Offline Arena

This project runs a single Huggingface decoder-only model over tasks defined in a CSV. For each row, it samples M question types (e.g. WHY_QS, WHAT_QS, etc.) and performs inference (one output per chosen question). Then, each row-question pair is sent individually to an LLM as judge. The judge returns a chain-of-thought explanation and a score (0–100) in JSON format (validated with Pydantic).

install requirements

pip install -r requirements.txt

Install Flash Attention(optionally)

apt install gcc screen htop iotop nano
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run
#Accept terms
#Install only the CUDA Toolkit (no driver, since PyTorch works already)
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
pip install --upgrade pip setuptools wheel
pip install flash-attn --no-build-isolation
python -m pip install --upgrade 'optree>=0.13.0'

Env vars

Copy .env.template and fill in the required values in .env file

or

touch .env
echo "OPENAI_API_KEY=your_key_here" >> .env
echo "HUGGINGFACE_TOKEN=your_token_here" >> .env

Commands

Run inference:

python main.py inference --model_id="meta-llama/Llama-3.2-3B-Instruct" --question_types="WHY_QS,WHAT_QS,HOW_QS,DESCRIBE_QS,ANALYZE_QS" --batch_size=10

with nohup

nohup python main.py inference --model_id="issai/LLama-3.1-KazLLM-1.0-70B" --question_types="WHY_QS,WHAT_QS,HOW_QS,DESCRIBE_QS,ANALYZE_QS" --batch_size=1 > output.log 2>&1 &

Using a model with a custom chat template or special generation settings

If your model needs its own chat template, prompt formatting, or generation parameters, add a dedicated block to inference.py inside run_inference_huggingface().
Follow the pattern already used for Sherkala, Gemma-3, Qwen-2.5, etc.:

# ── Custom model example ─────────────────────────────────────────────
if model_id == "your-org/Your-Model-Name":
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    ...
    ...
    return outputs

Add further elif blocks for every additional custom model; keep them above the catch-all section so they’re reached first.

Run judge evaluations:

python main.py judge

After Judge evaluations, you will get filename.json in the output/judge/ directory. You can use this file to submit your model in the Kaz Offline Arena.

Note: You don't need to run the code(elo.py) below. We calculate elo in spaces leaderboard when you submit the model

Compute ELO leaderboard(optionally):

It's actually Bradley-Terry model as described here

python main.py elo

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
notebooks		notebooks
.env.template		.env.template
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
__init__.py		__init__.py
elo.py		elo.py
inference.py		inference.py
judge.py		judge.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_inference.sh		run_inference.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Offline Arena

install requirements

Install Flash Attention(optionally)

Env vars

Commands

Using a model with a custom chat template or special generation settings

Run judge evaluations:

Note: You don't need to run the code(elo.py) below. We calculate elo in spaces leaderboard when you submit the model

It's actually Bradley-Terry model as described here

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Offline Arena

install requirements

Install Flash Attention(optionally)

Env vars

Commands

Using a model with a custom chat template or special generation settings

Run judge evaluations:

Note: You don't need to run the code(elo.py) below. We calculate elo in spaces leaderboard when you submit the model

It's actually Bradley-Terry model as described here

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages