This project runs a single Huggingface decoder-only model over tasks defined in a CSV. For each row, it samples M question types (e.g. WHY_QS, WHAT_QS, etc.) and performs inference (one output per chosen question). Then, each row-question pair is sent individually to an LLM as judge. The judge returns a chain-of-thought explanation and a score (0–100) in JSON format (validated with Pydantic).
pip install -r requirements.txtapt install gcc screen htop iotop nano
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run
#Accept terms
#Install only the CUDA Toolkit (no driver, since PyTorch works already)
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
pip install --upgrade pip setuptools wheel
pip install flash-attn --no-build-isolation
python -m pip install --upgrade 'optree>=0.13.0'Copy .env.template and fill in the required values in .env file
or
touch .env
echo "OPENAI_API_KEY=your_key_here" >> .env
echo "HUGGINGFACE_TOKEN=your_token_here" >> .env
Run inference:
python main.py inference --model_id="meta-llama/Llama-3.2-3B-Instruct" --question_types="WHY_QS,WHAT_QS,HOW_QS,DESCRIBE_QS,ANALYZE_QS" --batch_size=10with nohup
nohup python main.py inference --model_id="issai/LLama-3.1-KazLLM-1.0-70B" --question_types="WHY_QS,WHAT_QS,HOW_QS,DESCRIBE_QS,ANALYZE_QS" --batch_size=1 > output.log 2>&1 &If your model needs its own chat template, prompt formatting, or generation parameters, add a dedicated block to inference.py inside run_inference_huggingface().
Follow the pattern already used for Sherkala, Gemma-3, Qwen-2.5, etc.:
# ── Custom model example ─────────────────────────────────────────────
if model_id == "your-org/Your-Model-Name":
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
...
...
return outputsAdd further elif blocks for every additional custom model; keep them above the catch-all section so they’re reached first.
python main.py judgeAfter Judge evaluations, you will get filename.json in the output/judge/ directory. You can use this file to submit your model in the Kaz Offline Arena.
Note: You don't need to run the code(elo.py) below. We calculate elo in spaces leaderboard when you submit the model
Compute ELO leaderboard(optionally):
It's actually Bradley-Terry model as described here
python main.py elo