TL;DR: We built VulnHunter, an OpenEnv-compatible reinforcement learning environment for training AI agents to detect and patch web application vulnerabilities. Using Unsloth and GRPO (Group Relative Policy Optimization), we fine-tuned Qwen2.5-Coder-7B to identify SQL injection, XSS, and Path Traversal vulnerabilities, then generate secure patches.
Every company needs secure software. Yet:
- Static analyzers produce too many false positives
- Manual code review doesn't scale
- LLMs can identify obvious bugs but struggle with multi-step security reasoning
What if we could train an AI agent that doesn't just identify vulnerabilities, but reasons about them and generates fixes?
VulnHunter is a reinforcement learning environment where AI agents learn to:
- Explore source code to understand application structure
- Identify vulnerability types (SQL injection, XSS, Path Traversal)
- Generate patches that fix the vulnerability without breaking functionality
We implemented the OpenEnv standard by inheriting from the official base classes:
from openenv.core.env_server import Environment, Action, Observation, State
class VulnHunterEnv(Environment[VulnHunterAction, VulnHunterObservation, VulnHunterState]):
def reset(self, seed=None, episode_id=None, **kwargs) -> VulnHunterObservation:
...
def step(self, action, timeout_s=None, **kwargs) -> VulnHunterObservation:
...
@property
def state(self) -> VulnHunterState:
...
# Agent actions extend OpenEnv's Action base class
class VulnHunterAction(Action):
command: str # Run security tools (curl, sqlmap)
read_file: str # Examine source code
identify_vuln: dict # Report vulnerability findings
patch: dict # Submit a code fix| Event | Reward | Rationale |
|---|---|---|
| Identify vulnerability type | +0.3 | Encourage correct classification |
| Generate valid patch | +0.2 | Reward attempting fixes |
| Patch blocks exploit | +1.0 | Main goal achieved |
| Syntax error in patch | -0.2 | Discourage broken code |
The cumulative max reward of 1.5 means the agent must both identify AND fix to score highly.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"gateremark/vulnhunter-agent"
)
# Analyze vulnerable code
prompt = "Analyze this code: query = f\"SELECT * FROM users WHERE id = {user_id}\""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))git clone https://github.com/gateremark/vulnhunter
cd vulnhunter
uvicorn vulnhunter.env_server.server:app --port 8000cd vulnhunter/green_agent
python server.py
# Agent running at http://localhost:9009| Model | Pros | Cons | Our Choice |
|---|---|---|---|
| Llama-3-8B | Popular, well-tested | General purpose | ❌ |
| DeepSeek-Coder-7B | Code-specialized | Less Unsloth support | ❌ |
| Qwen2.5-Coder-7B | Code-specialized, Unsloth-optimized | — | ✅ |
Key reasons:
- Pre-trained on code — Already understands Python, SQL, security patterns
- Instruct variant — Follows instructions out-of-the-box
- 7B size — Sweet spot between capability and training cost
- Unsloth support — 2x faster training with their optimizations
Full Precision (FP16): ~14GB VRAM for 7B model
4-bit Quantized: ~4GB VRAM for 7B model
↓
3.5x less memory, same quality!
GRPO needs to generate multiple responses per prompt (we used 4). 4-bit quantization gave us the headroom to do this efficiently.
| Method | Memory | Complexity | Fit for Our Use Case |
|---|---|---|---|
| SFT | Low | Simple | Too passive — doesn't explore |
| PPO | High (needs critic) | Complex | Memory-prohibitive |
| DPO | Medium | Medium | Needs preference pairs |
| GRPO ✅ | Low | Medium | Perfect for reward-based RL |
GRPO's advantage: It eliminates the critic model by comparing responses within groups:
- Generate 4 responses to each prompt
- Score each with our security reward function
- Reinforce responses that scored above the group average
- Hardware: NVIDIA A100-SXM4-40GB on Lightning.ai
- Framework: Unsloth + TRL (GRPOTrainer)
- Training Time: ~90 minutes
- Total Steps: 200
- Model: Qwen2.5-Coder-7B-Instruct (4-bit quantized)
model = FastLanguageModel.get_peft_model(
model,
r=32, # Rank
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
use_gradient_checkpointing="unsloth",
)
# Only 1.05% of parameters trained!GRPOConfig(
learning_rate=2e-5,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_generations=4, # Group size for GRPO
max_completion_length=512,
num_train_epochs=2,
bf16=True,
)We tracked training with Weights & Biases:
Key observations:
- Security reward improved from ~0.5 → 1.4
- Loss stabilized near zero
- Consistent training over 200 steps
- GPU Power Usage: 200-350W (active training)
- ~90 minutes total on A100
We tracked training with Weights & Biases:
Key observations:
- Security reward improved from ~0.5 → 1.4
- Loss stabilized near zero
- Consistent training over 200 steps
- ~90 minutes total on A100
After training, VulnHunter accurately identifies and fixes vulnerabilities:
Input:
query = f"SELECT * FROM users WHERE username = '{username}'"
cursor.execute(query)VulnHunter Output:
SQL injection vulnerability detected. Use parameterized queries:
query = "SELECT * FROM users WHERE username = %s" cursor.execute(query, (username,))
Input:
return f"<h1>Results for: {query}</h1>"VulnHunter Output:
Cross-Site Scripting vulnerability. Escape user input:
from html import escape return f"<h1>Results for: {escape(query)}</h1>"
Input:
return send_file(f"/uploads/{filename}")VulnHunter Output:
Path traversal vulnerability. Use safe_join:
from werkzeug.utils import safe_join return send_file(safe_join('/uploads', filename))
VulnHunter implements the A2A protocol for interoperability with other agents:
GET /.well-known/agent.json
{
"name": "VulnHunter",
"description": "AI security agent that detects and fixes web vulnerabilities",
"skills": [{"id": "analyze_code", "name": "Analyze Code"}]
}import requests
response = requests.post("http://localhost:9009/agent/task", json={
"type": "analyze",
"code": "query = f\"SELECT * FROM users WHERE id = {user_id}\""
})
print(response.json())vulnhunter/
├── vulnhunter/
│ ├── env_server/ # OpenEnv server
│ │ ├── server.py # FastAPI endpoints
│ │ └── models.py # Pydantic models
│ └── green_agent/ # A2A Agent wrapper
│ ├── agent.py # Core agent logic
│ ├── server.py # A2A HTTP server
│ └── Dockerfile # Container support
├── train_simple.py # GRPO training script
├── train_grpo.py # Full training with environment
├── test_vulnhunter.py # Model testing script
└── requirements.txt
| Resource | Link |
|---|---|
| Model | huggingface.co/gateremark/vulnhunter-agent |
| Model (W&B) | huggingface.co/gateremark/vulnhunter-agent-wandb |
| W&B Run | wandb.ai/gatere-ai/huggingface/runs/v0dge86p |
| OpenEnv | github.com/meta-pytorch/OpenEnv |
Built as part of the AgentBeats OpenEnv Challenge - https://rdi.berkeley.edu/agentx-agentbeats.html: - PyTorch (Meta) - Hugging Face - Unsloth etc.




