ClipQualityEnv

title

CLIP Quality Analyzer

colorFrom

purple

colorTo

gray

sdk

docker

app_port

7860

base_path

/dashboard/

ClipQualityEnv

An OpenEnv-compliant reinforcement learning environment for curating high-quality talking-head video clips intended for Audio-Visual (AV) LoRA fine-tuning. The agent learns to classify clips as KEEP, BORDERLINE, or REJECT by evaluating per-clip metadata against a versioned quality rubric, ensuring only the cleanest, most training-appropriate clips make it into a LoRA dataset.

Architecture

flowchart TD
    CFG["⚙️ Task Config\ntask_id = task_easy | task_medium | task_hard"] --> ENV
    ENV["🎬 ClipQualityEnv\n25-step clip review episodes"] --> OBS
    OBS["📋 Observation\n• clip_metadata (14 features)\n• rubric_summary\n• ICL history"] --> AGT
    AGT["🤖 LLM Agent\nICL-RL Feedback Loop"] --> STEP
    STEP["⚡ Action → env.step()\nKEEP / BORDERLINE / REJECT + reasoning"] --> SCORE
    SCORE["🏁 Deterministic Grader\nformat + label + reasoning + calibration\n0.01 – 0.99"]

Baseline Performance

xychart-beta
    title "ClipQualityEnv Baseline Scores (Deterministic Fallback)"
    x-axis ["task_easy", "task_medium", "task_hard"]
    y-axis "Average Reward" 0.00 --> 1.00
    bar [0.78, 0.75, 0.60]

Reference Model

This benchmark is designed to evaluate agents working with talking-head clip datasets used for training models like:

elix3r/LTX-2.3-22b-AV-LoRA-talking-head

Motivation

While training the LTX 2.3 AV LoRA adapter for talking-head video generation, I kept running into the same frustrating problem: the model would train fine on the numbers but the results were inconsistent in ways that were hard to pin down. After a lot of time debugging, it turned out the training data was the culprit. Clips that looked fine at a glance were causing all sorts of issues. Some had faces that were slightly off-angle, others had audio that was clean enough to pass a quick listen but too noisy for the model to learn from properly. A few clips had subtle motion blur or poor lighting that I only noticed after the model started producing shaky outputs.

Manually reviewing hundreds of clips takes a long time and your eye gets tired. You end up with inconsistent standards across a large dataset and no good way to audit the decisions you made earlier. The quality bar shifts depending on how tired you are.

ClipQualityEnv came out of that experience. The idea was to turn what I learned during that LoRA training run into a structured, programmable rubric and then teach an agent to apply it consistently. Instead of a human eyeballing clips, an LLM agent evaluates each clip's extracted metadata (face confidence, head pose, audio SNR, motion score, lighting uniformity, and more), then produces a graded KEEP / BORDERLINE / REJECT decision. The agent learns from a reward signal tied to a human-authored ground-truth store, improving its classification accuracy through in-context reinforcement learning without any weight updates.

Only KEEP-labelled clips are passed downstream to the LoRA training pipeline.

What it does

The environment presents an LLM agent with a 25-step episode. Each step shows one clip's metadata, a quality rubric, and the agent's prior prediction history for that clip. The agent classifies the clip and receives a structured reward signal broken down into four components:

Format score (max 0.10): validates that the label, reasoning, and confidence are all well-formed
Label score (max 0.68): checks label correctness against ground truth or rubric-derived labels, with deterministic per-clip noise
Reasoning score (max 0.30): checks that the reasoning mentions dominant features with directional language and contains no hallucinated feature names
Calibration adjustment (max +/- 0.05): rewards well-calibrated confidence (e.g., +0.03 bonus for correct + confident predictions)

Scores are normalized and clamped to the range [0.01, 0.99] across all tasks. Instead of artificial band ceilings, the environment enforces strict rule-based difficulty: the hard task uses much stricter reasoning thresholds and provides minimal partial credit for borderline labels, naturally leading to lower average rewards.

Key Features

Mixed-Difficulty Episodes

The task_mixed mode builds a single episode that transitions across difficulty levels: 10 easy clips, followed by 8 medium clips, followed by 7 hard clips. This progressive escalation within a single episode tests the agent's ability to adapt its strategy as signal quality degrades.

Multi-Episode Curriculum

The environment tracks cross-episode performance and automatically promotes or demotes the agent through difficulty levels:

Promotion: if the agent's average reward exceeds 0.75 for 2 consecutive episodes at the current level, difficulty increases
Demotion: if the average reward drops below 0.35 for 2 consecutive episodes, difficulty decreases
Progression order: Easy > Medium > Hard > Mixed

Enable curriculum mode by passing curriculum=True to reset(). The environment will automatically select the appropriate task based on the agent's current level.

Seeded Determinism

Passing seed=N to reset() guarantees an identical clip sequence every time. The corpus is sorted by clip_id before sampling, so insertion order cannot affect the result. The same seed with the same task always produces the exact same episode.

Step-Level Progression Tracking

Each observation includes real-time progression metadata:

Field	Description
`difficulty_trend`	List of difficulty labels for all steps in the episode
`cumulative_accuracy`	Fraction of correct predictions so far
`curriculum_level`	Current difficulty level in the curriculum
`curriculum_history`	Last 5 episodes of performance history

Confidence Calibration

The fourth reward dimension evaluates whether the agent's confidence matches its actual accuracy. A correct prediction with high confidence earns a bonus. An incorrect prediction with high confidence incurs a penalty. This prevents the agent from defaulting to maximum or minimum confidence on every prediction.

Difficulty-Aware GT Promotion

Ground-truth promotion thresholds are scaled by difficulty:

Difficulty	Reward Threshold	Confidence Threshold
Easy	0.85	0.80
Medium	0.75	0.80
Hard	0.65	0.75

This ensures hard tasks (which have lower reward ceilings) can still contribute to GT expansion.

Architecture

clip_quality_env/
  env.py             # OpenEnv environment, episode management, GT promotion, curriculum
  grader.py          # Deterministic reward decomposition (format + label + reasoning + calibration)
  rubric.py          # Versioned threshold definitions and feature status logic
  ground_truth.py    # Append-only GT store with difficulty-aware promotion
  icl_memory.py      # Per-session ICL memory, context injection, hint feedback
  agent.py           # Lightweight LLM agent (XML tag parser, used in standalone mode)
  difficulty.py      # Difficulty normalization utilities
  models.py          # Pydantic models: Action, Observation, State, Reward, ClipMetadata
  real_clips.py      # Real clip manifest loader

server/
  app.py             # FastAPI application, Gradio dashboard, baseline runner
  grader.py          # /grader endpoint handler
  tasks/             # Task registry: task_easy, task_medium, task_hard

inference.py         # ClipQualityAgent with ICL-RL loop, CLI baseline runner
scripts/
  extract_mp4_metadata.py  # Extracts clip features from MP4 files into manifest
data/
  real_clips_manifest.jsonl  # Per-clip metadata extracted from real video files
  seed_gt.json               # Seed ground truth labels

Clip Metadata Features

Each clip observation exposes the following fields:

Feature	Description
`face_area_ratio`	Fraction of frame occupied by the detected face
`face_confidence`	Face detection confidence score
`head_pose_yaw_deg`	Head yaw angle in degrees
`motion_score`	Frame-to-frame motion intensity
`bg_complexity_score`	Background visual complexity
`audio_snr_db`	Audio signal-to-noise ratio
`duration_s`	Clip length in seconds
`mouth_open_ratio`	Ratio of mouth openness
`lighting_uniformity`	Consistency of lighting across the frame
`sharpness_score`	Frame sharpness estimate
`temporal_flicker`	Frame-to-frame brightness flicker
`bg_entropy`	Background entropy
`eye_contact_ratio`	Fraction of frames with estimated eye contact
`speech_rate_wpm`	Estimated words per minute

Rubric and Grading

The rubric defines per-feature thresholds with three modes: higher (feature should be above threshold to KEEP), lower (feature should be below threshold to KEEP), and band (feature should fall within a range to KEEP). The rubric is versioned and can tighten automatically over time as the agent's accuracy on easy and medium tasks improves.

Labels are derived from the rubric when no explicit ground truth exists. An agent-predicted label can be promoted into the ground truth store if it meets difficulty-specific reward and confidence thresholds and matches any existing expected label.

In-Context Learning

The ICLMemory class tracks per-clip prediction history within a session. After each step, the agent's label, raw label score, and reasoning are recorded. On subsequent attempts at the same clip:

If a prior attempt scored well, the agent is nudged to keep that label and sharpen the reasoning
If a prior attempt scored modestly, the agent is prompted to consider a different label
If a prior attempt scored poorly, the agent is told to try something completely different

The memory never reveals the expected label. All feedback is based on the reward signal alone, and reward noise is added to prevent the agent from treating any single score as a definitive answer.

Tasks

Task ID	Difficulty	Description
`task_easy`	Easy	Clear, unambiguous quality signals across most features
`task_medium`	Medium	Mixed indicators requiring trade-off reasoning
`task_hard`	Hard	Conflicting signals with no dominant clear indicator
`task_mixed`	Mixed	Progressive difficulty: 10 easy, 8 medium, 7 hard

Each task corpus contains 25 clips with balanced label distributions across KEEP, BORDERLINE, and REJECT.

API Endpoints

Method	Path	Description
`GET`	`/`	Status and environment metadata
`GET`	`/health`	Health check
`GET`	`/state`	Current environment state
`GET`	`/tasks`	List all tasks with action schema
`POST`	`/grader`	Score a single action against a task
`POST`	`/baseline/start`	Start an async baseline run
`GET`	`/baseline/status/{run_id}`	Poll a running baseline
`GET`	`/baseline`	Alias for baseline start (GET-compatible)
`POST`	`/reset`	Reset the environment to a new episode
`POST`	`/step`	Submit one action and advance the episode
`GET`	`/metadata`	OpenEnv environment metadata (includes curriculum config)
`GET`	`/schema`	OpenEnv action/observation schema
`GET`	`/dashboard/`	Gradio interactive dashboard

Quick Start (curl)

# Health Check
curl http://localhost:7860/health

# List Tasks (includes grader paths)
curl http://localhost:7860/tasks

# Reset Environment
curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{"task_id": "task_easy"}'

# Submit Step
curl -X POST http://localhost:7860/step -H "Content-Type: application/json" -d '{"label": "KEEP", "reasoning": "face_confidence is 0.95", "confidence": 0.9, "clip_id": "clip_0001"}'

The /grader endpoint accepts the same action schema as /step:

{
  "label": "KEEP",
  "reasoning": "face_confidence is 0.91, above the KEEP threshold (0.80). motion_score is 0.12, stable below the KEEP ceiling (0.25).",
  "confidence": 0.85,
  "clip_id": "clip_0001"
}

Dashboard

The Gradio dashboard at /dashboard/ provides a full interactive session:

Difficulty-tiered input tabs (Easy, Medium, Hard) with structured reasoning fields
Task selector dropdown including the mixed-difficulty mode
Live clip corpus queue sorted by clip ID with predicted and expected labels
Dominant feature table showing closest-boundary features and their rubric status
Reward breakdown cards for format, label, reasoning, and calibration scores
Session history table with difficulty column, submitted vs expected labels, and per-step rewards
Match results with difficulty badges ([E], [M], [H]) and progression summary
Curriculum progress panel with level display and episode history table
ICL learning progress panel tracking reward trends across runs per clip
"Load Quality Hint" button that generates a pre-filled hint from rubric thresholds
"Run LLM Baseline Agent" button that runs the full ICL-RL agent in a background thread

Local Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run the server:

PYTHONPATH=. python -m uvicorn server.app:app --host 0.0.0.0 --port 7860

Run tests:

PYTHONPATH=. python -m pytest tests/ -q

Baseline Inference

The baseline agent runs all three tasks in sequence and prints step-level rewards. It uses the LLM if a token is available, otherwise falls back to a deterministic heuristic with grader-aligned reasoning.

Set environment variables:

export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="llama-3.3-70b-versatile"
export HF_TOKEN="your_token_here"

Run all tasks:

PYTHONPATH=. python inference.py

Run the deterministic baseline used by validation:

PYTHONPATH=. python inference.py --tasks easy --episodes 1 --seed 42 --deterministic-baseline --max-steps 5

Run a single task:

PYTHONPATH=. python inference.py task_easy

Run the mixed-difficulty task:

PYTHONPATH=. python inference.py task_mixed

Structured stdout format:

[START] task=task_easy episode=1 seed=42 mode=deterministic max_steps=5
[STEP] task=task_easy episode=1 step=1 action=KEEP patient_id=clip_0001 reward=0.8000 done=false status=ok
...
[END] task=task_easy episode=1 seed=42 score=0.7800 steps=5 done=true

Extracting Real Clip Metadata

To build a real clip manifest from MP4 files:

pip install -r requirements_extractor.txt
PYTHONPATH=. python scripts/extract_mp4_metadata.py path/to/clips/ --output data/real_clips_manifest.jsonl

The manifest is loaded at startup if present at data/real_clips_manifest.jsonl. If missing, the environment falls back to the static task corpora.

Docker

docker build -t clip-quality-env .
docker run -p 7860:7860 -e HF_TOKEN=your_token clip-quality-env

Deployment

The project is deployed as a Hugging Face Space using the Docker SDK. The openenv.yaml and HuggingFace Space frontmatter in this file configure the deployment.

openenv push --repo-id elix3r/clip-quality-env

Or manually:

git remote add hf-space https://huggingface.co/spaces/your-username/ClipQualityEnv
git push hf-space main

Dependencies

openenv-core >= 0.2.3: OpenEnv environment base classes and FastAPI server factory
fastapi >= 0.104.0 + uvicorn >= 0.24.0: HTTP server
gradio >= 5.0.0, < 6.0.0: Interactive dashboard
openai >= 1.0.0: OpenAI-compatible client (used with HuggingFace inference router)
pydantic >= 2.0.0: Data validation and serialization
opencv-python-headless >= 4.10.0: Video processing for metadata extraction
numpy >= 1.26.0: Numerical operations in extraction pipeline
pandas >= 2.0.0: DataFrame rendering in the dashboard

Contributing

See CONTRIBUTING.md for task creation guidelines, grading invariants, and development workflow.

License

MIT

Academic References

ClipQualityEnv draws from several foundational research areas. The connections below tie each paper directly to specific components of the implementation.

Curriculum Learning

Paper	Year	Relevance
Bengio et al. "Curriculum Learning"	2009	Foundation for the Easy to Medium to Hard task progression and the multi-episode curriculum auto-promotion system. Key insight: ordering training samples by difficulty accelerates learning and improves convergence.
Graves et al. "Automated Curriculum Learning for Neural Networks"	2017	Adaptive curriculum where difficulty self-adjusts based on learner performance. Directly matches both the `recalibrate()` logic in `rubric.py` and the cross-episode curriculum promotion/demotion in `env.py`.
Kumar et al. "Self-Paced Learning with Diversity"	2010	Agent chooses its own curriculum pace. The confidence-weighted GT promotion in `try_promote()` is a form of self-pacing, where the agent only promotes predictions it is confident in.

Application in this environment: The 4-mode difficulty system (easy, medium, hard, mixed) implements curriculum learning at the task level. The multi-episode curriculum auto-promotes or demotes the agent based on rolling performance. Rubric calibration (recalibrate()) implements it across episodes, so the environment automatically gets harder as the agent succeeds on simpler clips.

Active Learning & Self-Training

Paper	Year	Relevance
Culotta & McCallum "Confidence-Weighted Active Learning"	2005	Selectively promote high-confidence predictions to the training set. Direct precedent for `GTStore.try_promote()`, which uses difficulty-aware thresholds before accepting a new ground-truth label.
Zhu et al. "Semi-Supervised Learning with Graphs"	2003	Self-training expands the labeled set iteratively with the model's own confident predictions. The GT expansion flywheel (more promoted clips, richer GT store, better grading signal) follows this pattern.
Settles "Active Learning Literature Survey"	2010	Comprehensive overview of query strategies including uncertainty sampling. ClipQualityEnv inverts uncertainty sampling: rather than querying uncertain examples for human labeling, it promotes certain agent predictions into the GT store.

Application in this environment: GT expansion via try_promote() is active learning in reverse. The agent autonomously extends the ground-truth store by promoting high-confidence, high-reward predictions, progressively replacing rubric-derived labels with agent-confirmed ones. Difficulty-aware thresholds ensure hard tasks can still contribute to GT growth.

Preference Optimization

Paper	Year	Relevance
Rafailov et al. "Direct Preference Optimization (DPO)"	2023	Preference-based training without explicit reward models. Partial label credit on BORDERLINE cases mirrors the preference pair structure, where a KEEP prediction on a BORDERLINE clip is treated as a useful signal rather than a hard failure.
Christiano et al. "Deep RL from Human Preferences"	2017	RLHF foundation. ClipQualityEnv replaces human preference comparisons with a fully verifiable reward function, retaining the reward decomposition insight while eliminating human-in-the-loop overhead.

Application in this environment: Partial label credit (0.25 for KEEP/REJECT when ground truth is BORDERLINE, scaled by difficulty) treats directionally-correct but imprecise decisions as informative signal rather than noise, analogous to weak preferences in RLHF training.

Verifiable Rewards

Paper	Year	Relevance
Sutton & Barto "Reinforcement Learning: An Introduction"	2018	Core RL principles. The `grade()` function in `grader.py` is a deterministic reward function decomposed into format, label, reasoning, and calibration components.
Ng & Russell "Algorithms for Inverse RL"	2000	Reward shaping foundations. The rubric calibration cycle, where thresholds tighten based on episode performance, is a form of dynamic reward shaping that keeps the task challenging as the agent improves.

Application in this environment: The grader is fully deterministic and rubric-derived, with no LLM judge involved. The confidence calibration dimension adds a fourth reward signal that prevents degenerate confidence strategies. This guarantees reproducibility, enables automated validation, and satisfies the OpenEnv spec requirement for programmatic graders that return valid 0.0 to 1.0 scores.

Self-Play & Co-Evolution

Paper	Year	Relevance
Bansal et al. "Emergent Complexity via Multi-Agent Competition"	2018	Agents and environments co-evolve, generating emergent difficulty without manual curriculum design. The rubric and GT co-evolution in ClipQualityEnv is a single-agent analogue of this pattern.
Leibo et al. "Multi-Agent RL in Sequential Social Dilemmas"	2017	Environment complexity scales with agent capability. Matches the calibration logic: as the agent succeeds on BORDERLINE clips, the rubric tightens, creating new BORDERLINE cases.

Application in this environment: The learning flywheel works as follows: GT expands as the agent promotes confident predictions, then the rubric tightens based on accuracy, then harder BORDERLINE cases emerge. This is co-evolution in a single-agent setting. The environment adapts to the agent's current capability level without external intervention.

In-Context Learning

Paper	Year	Relevance
Brown et al. "Language Models are Few-Shot Learners"	2020	In-context learning (ICL) enables LLMs to improve on a task purely from examples in the context window, without weight updates. The per-episode ICL loop uses this for within-episode improvement.
Xie et al. "An Explanation of In-Context Learning as Implicit Bayesian Inference"	2022	Theoretical grounding for why ICL works. The model implicitly updates a prior over task hypotheses from context examples, which validates the step-by-step reward feedback injection in `ICLMemory.get_context_text()`.

Application in this environment: The ICLMemory class carries reward feedback from prior attempts at each clip into subsequent steps within the same session. The agent's context window includes label_score signals and soft directives across attempts, enabling learning without gradient updates over a 25-step episode and across multiple episode runs within a session.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
clip_quality_env		clip_quality_env
data		data
scripts		scripts
server		server
state		state
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
client.py		client.py
inference.py		inference.py
models.py		models.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_extractor.txt		requirements_extractor.txt
spaces_app.py		spaces_app.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

ClipQualityEnv

Architecture

Baseline Performance

Reference Model

Motivation

What it does

Key Features

Mixed-Difficulty Episodes

Multi-Episode Curriculum

Seeded Determinism

Step-Level Progression Tracking

Confidence Calibration

Difficulty-Aware GT Promotion

Architecture

Clip Metadata Features

Rubric and Grading

In-Context Learning

Tasks

API Endpoints

Quick Start (curl)

Dashboard

Local Setup

Baseline Inference

Extracting Real Clip Metadata

Docker

Deployment

Dependencies

Contributing

License

Academic References

Curriculum Learning

Active Learning & Self-Training

Preference Optimization

Verifiable Rewards

Self-Play & Co-Evolution

In-Context Learning

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages