Skip to content

horde-research/Kaz-Offline-Arena

Repository files navigation

Offline Arena

This project runs a single Huggingface decoder-only model over tasks defined in a CSV. For each row, it samples M question types (e.g. WHY_QS, WHAT_QS, etc.) and performs inference (one output per chosen question). Then, each row-question pair is sent individually to an LLM as judge. The judge returns a chain-of-thought explanation and a score (0–100) in JSON format (validated with Pydantic).

install requirements

pip install -r requirements.txt

Install Flash Attention(optionally)

apt install gcc screen htop iotop nano
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run
#Accept terms
#Install only the CUDA Toolkit (no driver, since PyTorch works already)
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
pip install --upgrade pip setuptools wheel
pip install flash-attn --no-build-isolation
python -m pip install --upgrade 'optree>=0.13.0'

Env vars

Copy .env.template and fill in the required values in .env file

or

touch .env
echo "OPENAI_API_KEY=your_key_here" >> .env
echo "HUGGINGFACE_TOKEN=your_token_here" >> .env 

Commands

Run inference:

python main.py inference --model_id="meta-llama/Llama-3.2-3B-Instruct" --question_types="WHY_QS,WHAT_QS,HOW_QS,DESCRIBE_QS,ANALYZE_QS" --batch_size=10

with nohup

nohup python main.py inference --model_id="issai/LLama-3.1-KazLLM-1.0-70B" --question_types="WHY_QS,WHAT_QS,HOW_QS,DESCRIBE_QS,ANALYZE_QS" --batch_size=1 > output.log 2>&1 &

Using a model with a custom chat template or special generation settings

If your model needs its own chat template, prompt formatting, or generation parameters, add a dedicated block to inference.py inside run_inference_huggingface().
Follow the pattern already used for Sherkala, Gemma-3, Qwen-2.5, etc.:

# ── Custom model example ─────────────────────────────────────────────
if model_id == "your-org/Your-Model-Name":
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    ...
    ...
    return outputs

Add further elif blocks for every additional custom model; keep them above the catch-all section so they’re reached first.


Run judge evaluations:

python main.py judge

After Judge evaluations, you will get filename.json in the output/judge/ directory. You can use this file to submit your model in the Kaz Offline Arena.

Note: You don't need to run the code(elo.py) below. We calculate elo in spaces leaderboard when you submit the model

Compute ELO leaderboard(optionally):

It's actually Bradley-Terry model as described here

python main.py elo

About

Offline LLM evaluation pipeline for Kazakh: run local HF models, auto-judge, export JSON for the Arena leaderboard: https://huggingface.co/spaces/kz-transformers/kaz-offline-arena

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors