You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An OpenEnv-compatible reinforcement learning environment where an AI agent acts as a senior code reviewer. The agent reads code diffs and decides whether to flag_bug, approve, or ignore them — receiving rewards based on correctness, explanation quality, and severity classification accuracy.
Built for the Meta × Scaler OpenEnv AI Hackathon.
Environment Overview
Property
Value
Action Space
Discrete: flag_bug · approve · ignore
Observation
Code diff + context flags + step feedback
Reward Range
0.10 to 0.90 (8-level partial credit)
Difficulties
easy · medium · hard · all
Episode Length
Up to 15 steps
Task Pool
21 unique scenarios (5 easy, 5 medium, 11 hard)
Per-episode
5 tasks sampled per difficulty (seed-reproducible)
Action Space
classCodeReviewAction(Action):
action_type: str# "flag_bug" | "approve" | "ignore"severity: Optional[str] # "critical" | "medium" | "low" — affects rewardcomment: str# Explanation — required for full marks on hard tasks
Observation Space
classCodeReviewObservation(Observation):
current_diff: Diff# The code diff to reviewsteps_remaining: int# Steps left in this episodestep: int# Current step numbertotal_tasks: int# Total tasks in this episodefeedback: str# Human-readable feedback on the last actiontask_difficulty: str# "easy" | "medium" | "hard"reward: float# Reward received for the last actiondone: bool# Whether the episode is completemetadata: dict# Debug info (task_id, episode_id, etc.)
The current_diff contains:
classDiff(BaseModel):
id: str# Unique diff identifierdiff_text: str# The raw code change as a diff stringrisk_hints: List[str] # Risk patterns detected (e.g. "sql_concat", "ssrf")has_tests: bool# Whether the diff includes teststouches_auth: bool# Whether the diff touches auth/security logic
Reward Function (8-Level Partial Credit)
The reward function rewards both correctness and explanation quality, and uses
severity classification as an additional signal.
Easy / Medium tasks (flag_bug actions)
Situation
Reward
Correct action + correct severity
0.90
Correct action, no severity specified
0.83
Ignored a real bug
0.30
Approved a buggy diff (worst outcome)
0.10
Hard tasks (flag_bug actions)
Situation
Reward
Correct + comment + correct severity
0.90(full marks)
Correct + comment, wrong/no severity
0.87
Correct + correct severity, no comment
0.85
Correct, no comment, no severity
0.80(minimum partial)
Missed the bug (ignored)
0.30
Approved a critical vulnerability
0.10
Approve / Ignore tasks (all difficulties)
Situation
Reward
Correct approve
0.90
Cautious ignore on safe code
0.50
False positive (flagged safe code)
0.15
Task Descriptions
Easy (5 tasks — always all 5)
ID
Vulnerability
Correct Action
Severity
e1
Hardcoded password in source code
flag_bug
critical
e2
Division function with no zero-guard
flag_bug
medium
e3
Unused imports only
ignore
—
e4
SQL query built with f-string interpolation
flag_bug
critical
e5
Safe, tested greet() utility function
approve
—
Medium (5 tasks — always all 5)
ID
Vulnerability
Correct Action
Severity
m1
User profile fetch with no auth check
flag_bug
critical
m2
Password stored in plaintext
flag_bug
critical
m3
Loop accessing items[i+1] — off-by-one
flag_bug
medium
m4
Auth token written to application logs
flag_bug
critical
m5
Shared counter across threads without a lock
flag_bug
medium
Hard (11-task pool — 5 randomly sampled per episode)
ID
Vulnerability
Correct Action
Severity
h1
JWT decoded with verify_signature: False
flag_bug
critical
h2
pickle.loads() on untrusted user input (RCE)
flag_bug
critical
h3
Regex with nested quantifiers — ReDoS
flag_bug
medium
h4
HTTP request to user-controlled URL — SSRF
flag_bug
critical
h5
Clean, tested order-processing function
approve
—
h6
Secret compared with == instead of hmac.compare_digest
flag_bug
medium
h7
User input used directly in open() path — path traversal