Skip to content

22elix3r/ClipQualityEnv

Repository files navigation

title CLIP Quality Analyzer
colorFrom purple
colorTo gray
sdk docker
app_port 7860
base_path /dashboard/
tags
openenv
reinforcement-learning
clip-quality
quality-analysis
lora-training
dataset-curation
talking-head

ClipQualityEnv

An OpenEnv-compliant reinforcement learning environment for curating high-quality talking-head video clips intended for Audio-Visual (AV) LoRA fine-tuning. The agent learns to classify clips as KEEP, BORDERLINE, or REJECT by evaluating per-clip metadata against a versioned quality rubric, ensuring only the cleanest, most training-appropriate clips make it into a LoRA dataset.

Architecture

flowchart TD
    CFG["⚙️ Task Config\ntask_id = task_easy | task_medium | task_hard"] --> ENV
    ENV["🎬 ClipQualityEnv\n25-step clip review episodes"] --> OBS
    OBS["📋 Observation\n• clip_metadata (14 features)\n• rubric_summary\n• ICL history"] --> AGT
    AGT["🤖 LLM Agent\nICL-RL Feedback Loop"] --> STEP
    STEP["⚡ Action → env.step()\nKEEP / BORDERLINE / REJECT + reasoning"] --> SCORE
    SCORE["🏁 Deterministic Grader\nformat + label + reasoning + calibration\n0.01 – 0.99"]
Loading

Baseline Performance

xychart-beta
    title "ClipQualityEnv Baseline Scores (Deterministic Fallback)"
    x-axis ["task_easy", "task_medium", "task_hard"]
    y-axis "Average Reward" 0.00 --> 1.00
    bar [0.78, 0.75, 0.60]
Loading

Reference Model

This benchmark is designed to evaluate agents working with talking-head clip datasets used for training models like:

elix3r/LTX-2.3-22b-AV-LoRA-talking-head

Motivation

While training the LTX 2.3 AV LoRA adapter for talking-head video generation, I kept running into the same frustrating problem: the model would train fine on the numbers but the results were inconsistent in ways that were hard to pin down. After a lot of time debugging, it turned out the training data was the culprit. Clips that looked fine at a glance were causing all sorts of issues. Some had faces that were slightly off-angle, others had audio that was clean enough to pass a quick listen but too noisy for the model to learn from properly. A few clips had subtle motion blur or poor lighting that I only noticed after the model started producing shaky outputs.

Manually reviewing hundreds of clips takes a long time and your eye gets tired. You end up with inconsistent standards across a large dataset and no good way to audit the decisions you made earlier. The quality bar shifts depending on how tired you are.

ClipQualityEnv came out of that experience. The idea was to turn what I learned during that LoRA training run into a structured, programmable rubric and then teach an agent to apply it consistently. Instead of a human eyeballing clips, an LLM agent evaluates each clip's extracted metadata (face confidence, head pose, audio SNR, motion score, lighting uniformity, and more), then produces a graded KEEP / BORDERLINE / REJECT decision. The agent learns from a reward signal tied to a human-authored ground-truth store, improving its classification accuracy through in-context reinforcement learning without any weight updates.

Only KEEP-labelled clips are passed downstream to the LoRA training pipeline.

What it does

The environment presents an LLM agent with a 25-step episode. Each step shows one clip's metadata, a quality rubric, and the agent's prior prediction history for that clip. The agent classifies the clip and receives a structured reward signal broken down into four components:

  • Format score (max 0.10): validates that the label, reasoning, and confidence are all well-formed
  • Label score (max 0.68): checks label correctness against ground truth or rubric-derived labels, with deterministic per-clip noise
  • Reasoning score (max 0.30): checks that the reasoning mentions dominant features with directional language and contains no hallucinated feature names
  • Calibration adjustment (max +/- 0.05): rewards well-calibrated confidence (e.g., +0.03 bonus for correct + confident predictions)

Scores are normalized and clamped to the range [0.01, 0.99] across all tasks. Instead of artificial band ceilings, the environment enforces strict rule-based difficulty: the hard task uses much stricter reasoning thresholds and provides minimal partial credit for borderline labels, naturally leading to lower average rewards.

Key Features

Mixed-Difficulty Episodes

The task_mixed mode builds a single episode that transitions across difficulty levels: 10 easy clips, followed by 8 medium clips, followed by 7 hard clips. This progressive escalation within a single episode tests the agent's ability to adapt its strategy as signal quality degrades.

Multi-Episode Curriculum

The environment tracks cross-episode performance and automatically promotes or demotes the agent through difficulty levels:

  • Promotion: if the agent's average reward exceeds 0.75 for 2 consecutive episodes at the current level, difficulty increases
  • Demotion: if the average reward drops below 0.35 for 2 consecutive episodes, difficulty decreases
  • Progression order: Easy > Medium > Hard > Mixed

Enable curriculum mode by passing curriculum=True to reset(). The environment will automatically select the appropriate task based on the agent's current level.

Seeded Determinism

Passing seed=N to reset() guarantees an identical clip sequence every time. The corpus is sorted by clip_id before sampling, so insertion order cannot affect the result. The same seed with the same task always produces the exact same episode.

Step-Level Progression Tracking

Each observation includes real-time progression metadata:

Field Description
difficulty_trend List of difficulty labels for all steps in the episode
cumulative_accuracy Fraction of correct predictions so far
curriculum_level Current difficulty level in the curriculum
curriculum_history Last 5 episodes of performance history

Confidence Calibration

The fourth reward dimension evaluates whether the agent's confidence matches its actual accuracy. A correct prediction with high confidence earns a bonus. An incorrect prediction with high confidence incurs a penalty. This prevents the agent from defaulting to maximum or minimum confidence on every prediction.

Difficulty-Aware GT Promotion

Ground-truth promotion thresholds are scaled by difficulty:

Difficulty Reward Threshold Confidence Threshold
Easy 0.85 0.80
Medium 0.75 0.80
Hard 0.65 0.75

This ensures hard tasks (which have lower reward ceilings) can still contribute to GT expansion.

Architecture

clip_quality_env/
  env.py             # OpenEnv environment, episode management, GT promotion, curriculum
  grader.py          # Deterministic reward decomposition (format + label + reasoning + calibration)
  rubric.py          # Versioned threshold definitions and feature status logic
  ground_truth.py    # Append-only GT store with difficulty-aware promotion
  icl_memory.py      # Per-session ICL memory, context injection, hint feedback
  agent.py           # Lightweight LLM agent (XML tag parser, used in standalone mode)
  difficulty.py      # Difficulty normalization utilities
  models.py          # Pydantic models: Action, Observation, State, Reward, ClipMetadata
  real_clips.py      # Real clip manifest loader

server/
  app.py             # FastAPI application, Gradio dashboard, baseline runner
  grader.py          # /grader endpoint handler
  tasks/             # Task registry: task_easy, task_medium, task_hard

inference.py         # ClipQualityAgent with ICL-RL loop, CLI baseline runner
scripts/
  extract_mp4_metadata.py  # Extracts clip features from MP4 files into manifest
data/
  real_clips_manifest.jsonl  # Per-clip metadata extracted from real video files
  seed_gt.json               # Seed ground truth labels

Clip Metadata Features

Each clip observation exposes the following fields:

Feature Description
face_area_ratio Fraction of frame occupied by the detected face
face_confidence Face detection confidence score
head_pose_yaw_deg Head yaw angle in degrees
motion_score Frame-to-frame motion intensity
bg_complexity_score Background visual complexity
audio_snr_db Audio signal-to-noise ratio
duration_s Clip length in seconds
mouth_open_ratio Ratio of mouth openness
lighting_uniformity Consistency of lighting across the frame
sharpness_score Frame sharpness estimate
temporal_flicker Frame-to-frame brightness flicker
bg_entropy Background entropy
eye_contact_ratio Fraction of frames with estimated eye contact
speech_rate_wpm Estimated words per minute

Rubric and Grading

The rubric defines per-feature thresholds with three modes: higher (feature should be above threshold to KEEP), lower (feature should be below threshold to KEEP), and band (feature should fall within a range to KEEP). The rubric is versioned and can tighten automatically over time as the agent's accuracy on easy and medium tasks improves.

Labels are derived from the rubric when no explicit ground truth exists. An agent-predicted label can be promoted into the ground truth store if it meets difficulty-specific reward and confidence thresholds and matches any existing expected label.

In-Context Learning

The ICLMemory class tracks per-clip prediction history within a session. After each step, the agent's label, raw label score, and reasoning are recorded. On subsequent attempts at the same clip:

  • If a prior attempt scored well, the agent is nudged to keep that label and sharpen the reasoning
  • If a prior attempt scored modestly, the agent is prompted to consider a different label
  • If a prior attempt scored poorly, the agent is told to try something completely different

The memory never reveals the expected label. All feedback is based on the reward signal alone, and reward noise is added to prevent the agent from treating any single score as a definitive answer.

Tasks

Task ID Difficulty Description
task_easy Easy Clear, unambiguous quality signals across most features
task_medium Medium Mixed indicators requiring trade-off reasoning
task_hard Hard Conflicting signals with no dominant clear indicator
task_mixed Mixed Progressive difficulty: 10 easy, 8 medium, 7 hard

Each task corpus contains 25 clips with balanced label distributions across KEEP, BORDERLINE, and REJECT.

API Endpoints

Method Path Description
GET / Status and environment metadata
GET /health Health check
GET /state Current environment state
GET /tasks List all tasks with action schema
POST /grader Score a single action against a task
POST /baseline/start Start an async baseline run
GET /baseline/status/{run_id} Poll a running baseline
GET /baseline Alias for baseline start (GET-compatible)
POST /reset Reset the environment to a new episode
POST /step Submit one action and advance the episode
GET /metadata OpenEnv environment metadata (includes curriculum config)
GET /schema OpenEnv action/observation schema
GET /dashboard/ Gradio interactive dashboard

Quick Start (curl)

# Health Check
curl http://localhost:7860/health

# List Tasks (includes grader paths)
curl http://localhost:7860/tasks

# Reset Environment
curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{"task_id": "task_easy"}'

# Submit Step
curl -X POST http://localhost:7860/step -H "Content-Type: application/json" -d '{"label": "KEEP", "reasoning": "face_confidence is 0.95", "confidence": 0.9, "clip_id": "clip_0001"}'

The /grader endpoint accepts the same action schema as /step:

{
  "label": "KEEP",
  "reasoning": "face_confidence is 0.91, above the KEEP threshold (0.80). motion_score is 0.12, stable below the KEEP ceiling (0.25).",
  "confidence": 0.85,
  "clip_id": "clip_0001"
}

Dashboard

The Gradio dashboard at /dashboard/ provides a full interactive session:

  • Difficulty-tiered input tabs (Easy, Medium, Hard) with structured reasoning fields
  • Task selector dropdown including the mixed-difficulty mode
  • Live clip corpus queue sorted by clip ID with predicted and expected labels
  • Dominant feature table showing closest-boundary features and their rubric status
  • Reward breakdown cards for format, label, reasoning, and calibration scores
  • Session history table with difficulty column, submitted vs expected labels, and per-step rewards
  • Match results with difficulty badges ([E], [M], [H]) and progression summary
  • Curriculum progress panel with level display and episode history table
  • ICL learning progress panel tracking reward trends across runs per clip
  • "Load Quality Hint" button that generates a pre-filled hint from rubric thresholds
  • "Run LLM Baseline Agent" button that runs the full ICL-RL agent in a background thread

Local Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run the server:

PYTHONPATH=. python -m uvicorn server.app:app --host 0.0.0.0 --port 7860

Run tests:

PYTHONPATH=. python -m pytest tests/ -q

Baseline Inference

The baseline agent runs all three tasks in sequence and prints step-level rewards. It uses the LLM if a token is available, otherwise falls back to a deterministic heuristic with grader-aligned reasoning.

Set environment variables:

export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="llama-3.3-70b-versatile"
export HF_TOKEN="your_token_here"

Run all tasks:

PYTHONPATH=. python inference.py

Run the deterministic baseline used by validation:

PYTHONPATH=. python inference.py --tasks easy --episodes 1 --seed 42 --deterministic-baseline --max-steps 5

Run a single task:

PYTHONPATH=. python inference.py task_easy

Run the mixed-difficulty task:

PYTHONPATH=. python inference.py task_mixed

Structured stdout format:

[START] task=task_easy episode=1 seed=42 mode=deterministic max_steps=5
[STEP] task=task_easy episode=1 step=1 action=KEEP patient_id=clip_0001 reward=0.8000 done=false status=ok
...
[END] task=task_easy episode=1 seed=42 score=0.7800 steps=5 done=true

Extracting Real Clip Metadata

To build a real clip manifest from MP4 files:

pip install -r requirements_extractor.txt
PYTHONPATH=. python scripts/extract_mp4_metadata.py path/to/clips/ --output data/real_clips_manifest.jsonl

The manifest is loaded at startup if present at data/real_clips_manifest.jsonl. If missing, the environment falls back to the static task corpora.

Docker

docker build -t clip-quality-env .
docker run -p 7860:7860 -e HF_TOKEN=your_token clip-quality-env

Deployment

The project is deployed as a Hugging Face Space using the Docker SDK. The openenv.yaml and HuggingFace Space frontmatter in this file configure the deployment.

openenv push --repo-id elix3r/clip-quality-env

Or manually:

git remote add hf-space https://huggingface.co/spaces/your-username/ClipQualityEnv
git push hf-space main

Dependencies

  • openenv-core >= 0.2.3: OpenEnv environment base classes and FastAPI server factory
  • fastapi >= 0.104.0 + uvicorn >= 0.24.0: HTTP server
  • gradio >= 5.0.0, < 6.0.0: Interactive dashboard
  • openai >= 1.0.0: OpenAI-compatible client (used with HuggingFace inference router)
  • pydantic >= 2.0.0: Data validation and serialization
  • opencv-python-headless >= 4.10.0: Video processing for metadata extraction
  • numpy >= 1.26.0: Numerical operations in extraction pipeline
  • pandas >= 2.0.0: DataFrame rendering in the dashboard

Contributing

See CONTRIBUTING.md for task creation guidelines, grading invariants, and development workflow.

License

MIT


Academic References

ClipQualityEnv draws from several foundational research areas. The connections below tie each paper directly to specific components of the implementation.

Curriculum Learning

Paper Year Relevance
Bengio et al. "Curriculum Learning" 2009 Foundation for the Easy to Medium to Hard task progression and the multi-episode curriculum auto-promotion system. Key insight: ordering training samples by difficulty accelerates learning and improves convergence.
Graves et al. "Automated Curriculum Learning for Neural Networks" 2017 Adaptive curriculum where difficulty self-adjusts based on learner performance. Directly matches both the recalibrate() logic in rubric.py and the cross-episode curriculum promotion/demotion in env.py.
Kumar et al. "Self-Paced Learning with Diversity" 2010 Agent chooses its own curriculum pace. The confidence-weighted GT promotion in try_promote() is a form of self-pacing, where the agent only promotes predictions it is confident in.

Application in this environment: The 4-mode difficulty system (easy, medium, hard, mixed) implements curriculum learning at the task level. The multi-episode curriculum auto-promotes or demotes the agent based on rolling performance. Rubric calibration (recalibrate()) implements it across episodes, so the environment automatically gets harder as the agent succeeds on simpler clips.


Active Learning & Self-Training

Paper Year Relevance
Culotta & McCallum "Confidence-Weighted Active Learning" 2005 Selectively promote high-confidence predictions to the training set. Direct precedent for GTStore.try_promote(), which uses difficulty-aware thresholds before accepting a new ground-truth label.
Zhu et al. "Semi-Supervised Learning with Graphs" 2003 Self-training expands the labeled set iteratively with the model's own confident predictions. The GT expansion flywheel (more promoted clips, richer GT store, better grading signal) follows this pattern.
Settles "Active Learning Literature Survey" 2010 Comprehensive overview of query strategies including uncertainty sampling. ClipQualityEnv inverts uncertainty sampling: rather than querying uncertain examples for human labeling, it promotes certain agent predictions into the GT store.

Application in this environment: GT expansion via try_promote() is active learning in reverse. The agent autonomously extends the ground-truth store by promoting high-confidence, high-reward predictions, progressively replacing rubric-derived labels with agent-confirmed ones. Difficulty-aware thresholds ensure hard tasks can still contribute to GT growth.


Preference Optimization

Paper Year Relevance
Rafailov et al. "Direct Preference Optimization (DPO)" 2023 Preference-based training without explicit reward models. Partial label credit on BORDERLINE cases mirrors the preference pair structure, where a KEEP prediction on a BORDERLINE clip is treated as a useful signal rather than a hard failure.
Christiano et al. "Deep RL from Human Preferences" 2017 RLHF foundation. ClipQualityEnv replaces human preference comparisons with a fully verifiable reward function, retaining the reward decomposition insight while eliminating human-in-the-loop overhead.

Application in this environment: Partial label credit (0.25 for KEEP/REJECT when ground truth is BORDERLINE, scaled by difficulty) treats directionally-correct but imprecise decisions as informative signal rather than noise, analogous to weak preferences in RLHF training.


Verifiable Rewards

Paper Year Relevance
Sutton & Barto "Reinforcement Learning: An Introduction" 2018 Core RL principles. The grade() function in grader.py is a deterministic reward function decomposed into format, label, reasoning, and calibration components.
Ng & Russell "Algorithms for Inverse RL" 2000 Reward shaping foundations. The rubric calibration cycle, where thresholds tighten based on episode performance, is a form of dynamic reward shaping that keeps the task challenging as the agent improves.

Application in this environment: The grader is fully deterministic and rubric-derived, with no LLM judge involved. The confidence calibration dimension adds a fourth reward signal that prevents degenerate confidence strategies. This guarantees reproducibility, enables automated validation, and satisfies the OpenEnv spec requirement for programmatic graders that return valid 0.0 to 1.0 scores.


Self-Play & Co-Evolution

Paper Year Relevance
Bansal et al. "Emergent Complexity via Multi-Agent Competition" 2018 Agents and environments co-evolve, generating emergent difficulty without manual curriculum design. The rubric and GT co-evolution in ClipQualityEnv is a single-agent analogue of this pattern.
Leibo et al. "Multi-Agent RL in Sequential Social Dilemmas" 2017 Environment complexity scales with agent capability. Matches the calibration logic: as the agent succeeds on BORDERLINE clips, the rubric tightens, creating new BORDERLINE cases.

Application in this environment: The learning flywheel works as follows: GT expands as the agent promotes confident predictions, then the rubric tightens based on accuracy, then harder BORDERLINE cases emerge. This is co-evolution in a single-agent setting. The environment adapts to the agent's current capability level without external intervention.


In-Context Learning

Paper Year Relevance
Brown et al. "Language Models are Few-Shot Learners" 2020 In-context learning (ICL) enables LLMs to improve on a task purely from examples in the context window, without weight updates. The per-episode ICL loop uses this for within-episode improvement.
Xie et al. "An Explanation of In-Context Learning as Implicit Bayesian Inference" 2022 Theoretical grounding for why ICL works. The model implicitly updates a prior over task hypotheses from context examples, which validates the step-by-step reward feedback injection in ICLMemory.get_context_text().

Application in this environment: The ICLMemory class carries reward feedback from prior attempts at each clip into subsequent steps within the same session. The agent's context window includes label_score signals and soft directives across attempts, enabling learning without gradient updates over a 25-step episode and across multiple episode runs within a session.

About

An RL environment where an LLM agent learns to curate talking-head video clips for AV LoRA training. No labels exposed, rewards only.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages