| title | Citation Benchmark |
|---|---|
| colorFrom | blue |
| colorTo | indigo |
| sdk | docker |
| app_port | 7860 |
| pinned | false |
Citation-Agent is an OpenEnv-compliant environment for training and evaluating a reinforcement learning citation agent on academic literature review and citation selection.
Given a claim, the agent must navigate a localized research database to search across papers, read abstracts, traverse citation graphs, and ultimately select the correct paper that supports the claim.
Most agent benchmarks focus on web scraping or bash terminal navigation. This environment tests an agent's ability to reason over complex scholarly graphs. The agent must:
- Differentiate between ArXiv IDs and Semantic Scholar Corpus IDs.
- Follow citation intents.
- Run SQL-backed queries to find core papers.
- Avoid getting stuck in infinite search loops.
environment.py: RL environment with actions, transitions, and dense reward logic.tasks.py: 50 tasks with deterministic grader checks.rl_agent.py: Q-learning policy for citation actions.evaluation.py: Hybrid evaluation with programmatic checks and optional LLM scoring.hf_entry.py: Gradio app entrypoint for Hugging Face Spaces deployment.
-
Install Dependencies: Requires Python 3.10+.
pip install -r requirements.txt
-
Set Environment Variables: We use the Hugging Face Router with the
google/gemma-4-31B-itmodel.- PowerShell:
$env:HF_TOKEN="your_huggingface_api_key" $env:MODEL_NAME="google/gemma-4-31B-it" $env:API_BASE_URL="https://router.huggingface.co/v1"
- Bash/Zsh:
export HF_TOKEN="your_huggingface_api_key" export MODEL_NAME="google/gemma-4-31B-it" export API_BASE_URL="https://router.huggingface.co/v1"
- PowerShell:
-
Train + Evaluate the RL Agent (CLI):
python inference.py --episodes-per-task 120
Add
--include-llm-scoreto enable LLM-based trajectory scoring.Useful controls:
- Resume from saved checkpoint:
python inference.py --load-checkpoint --checkpoint-path rl_policy.json
- Evaluate only (skip train):
python inference.py --load-checkpoint --skip-train
- Force a fresh retrain:
python inference.py --force-retrain
- CI fast mode (reduced train/eval set, no LLM scoring):
python inference.py --ci-fast
- Resume from saved checkpoint:
-
Run the Hugging Face app locally:
python hf_entry.py
Then open
http://localhost:7860.The Space uses
full_policy.jsonby default so it should return a result immediately instead of training on the first click.
To test if your environment works with the automated Hugging Face Spaces pipeline:
docker build -t citation-agent .
docker run --env HF_TOKEN="your_token" --env MODEL_NAME="google/gemma-4-31B-it" --env API_BASE_URL="https://router.huggingface.co/v1" citation-agent- Create a Space ID (for example
your-username/citation-rl-agent). - Set your token:
- PowerShell:
$env:HF_TOKEN="your_hf_token"
- PowerShell:
- Deploy:
python deploy_hf_space.py --space-id your-username/citation-rl-agent
- Optional private Space:
python deploy_hf_space.py --space-id your-username/citation-rl-agent --private
After upload, open https://huggingface.co/spaces/your-username/citation-rl-agent.
- Real-world task: Accurately simulates literature review and academic source verification.
- OpenEnv Spec Compliance:
- Contains a valid
openenv.yaml. environment.pyuses Pydantic Models (Observation,Action,Reward) and implements the requiredreset(),state(), andstep(action)methods.- Observation Space: JSON containing
current_claim,search_results,last_abstract,citations_data,message, andstep_count. - Action Space:
action_type(search,read_abstract,get_citations,submit),query, andpaper_id.
- Contains a valid
- 50 Distinct Tasks with Graders:
tasks.pyloads 50 distinct Easy, Medium, and Hard tasks from the SQLite database.- Includes a deterministic
Graderclass that checks the submittedpaper_idagainst the ground truth.
- Database Architecture:
- To stay under the 1GB limit for free HF Spaces,
environment.pyuseshuggingface-hubto automatically pull thecitation_db.sqlitedatabase from theruby56/Citation-Databasedataset on startup.
- To stay under the 1GB limit for free HF Spaces,
- Continuous Reward Function:
- Uses dense shaping. The agent gets a -0.05 step penalty to encourage efficiency, and -0.1 to -0.2 penalties for formatting errors. It gets the final grader score (+1.0 max) on submit.
- Hybrid Evaluation:
- Uses deterministic programmatic checks and optional LLM scoring to produce the final benchmark score.
- Production Dockerfile:
- Standard Dockerfile that installs dependencies and runs
hf_entry.pydirectly for Hugging Face Spaces.
- Standard Dockerfile that installs dependencies and runs