eval/main.py is the unified entry point. The current code supports:
- Models:
qwen2.5vl/qwen3vl/llava-st-qwen2/videomolmo - Datasets:
hcstvg,vidstg,doro-stvg
The default script is eval/run_eval.sh. You can edit it directly to change model paths, annotation paths, video paths, and output paths.
For llava-st-qwen2, make sure PYTHONPATH includes your local LLaVA-ST repository:
export PYTHONPATH="/mnt/sdc/xingjianwang/yibowang/LLaVA-ST:${PYTHONPATH:-}"For videomolmo, set the external VideoMolmo runtime before evaluation:
export VIDEOMOLMO_REPO=/path/to/VideoMolmo
export VIDEOMOLMO_PYTHON=/path/to/videomolmo/bin/python
export VIDEOMOLMO_COMPACT_QUERY=1Then run evaluation with --model_name videomolmo --model_path videomolmo.
Typical outputs:
results.json: per-sample predictions, parsed outputs, GT, and metricsstatus.json: overall summary and averaged metrics
graph_generator/ is to generate structured data from raw videos. Based on the current code, the main pipeline includes:
- Scene splitting
- Object detection and tracking
- Attribute generation
- Action detection
- Relation generation
- Cross-shot reference edge generation (optional)
- STVG query generation from scene graphs
- Formatting query outputs into training-friendly JSONL
Relevant entry points:
graph_generator/main.py: main scene graph generation entrygraph_generator/modules/query_generator_cpsat.py: generate queries from scene graphsgraph_generator/utils/format_train.py: convert query outputs into training formatgraph_generator/scripts/run_generator.sh: current command collection used in practice
This repository does not currently use a single root-level setup script. The actual setup should follow the module-specific pyproject.toml files under envs/.
Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrccd /path/to/DORO-STVG/envs/eval
uv syncIf uv sync times out on files.pythonhosted.org in this environment, refresh the lock and sync against the configured mirror:
cd /path/to/DORO-STVG/envs/eval
uv lock --refresh
uv sync --refreshcd /path/to/DORO-STVG/envs/graph_generator/main
uv syncThis environment is used for:
graph_generator/main.py- the main pipeline modules for attributes, relations, reference edges, and query generation
cd /path/to/DORO-STVG/envs/graph_generator/action_detector
uv syncThis separate environment is mainly used by the action detection module to avoid dependency conflicts with the main environment.
The evaluation script currently defaults to decord:
export FORCE_QWENVL_VIDEO_READER=decordYou can switch to torchvision or torchcodec if needed.
graph_generator depends on both model checkpoints and API-related environment variables. The repository already contains graph_generator/.env, and the scripts load it automatically.
The most important variables are:
API_KEYS=your_key_1,your_key_2
MM_API_BASE_URL=https://your-compatible-endpointYou also need to prepare:
- YOLO weights
- SAM2 / Grounded-SAM2 checkpoints
- VideoMAE action detection checkpoints
- DAM or other attribute-description models
For those details, refer to graph_generator/README.md.
cd /path/to/DORO-STVG/eval
bash run_eval.shFor llava-st-qwen2, the evaluation environment also expects:
- a local
LLaVA-STsource checkout - local
LLaVA-ST-Qwen2-7Bmodel weights
The default runner reads these environment variables:
LLAVA_ST_SOURCE_DIRMODEL_PATHANNOTATION_PATHVIDEO_DIROUTPUT_DIRCUDA_VISIBLE_DEVICES
A typical smoke-test command is:
cd /path/to/DORO-STVG
CUDA_VISIBLE_DEVICES=3 \
LLAVA_ST_SOURCE_DIR=/path/to/LLaVA-ST \
MODEL_PATH=/path/to/LLaVA-ST-Qwen2-7B \
ANNOTATION_PATH=/path/to/query_train_for_eval_smoke1.jsonl \
VIDEO_DIR=/path/to/video_test1_smoke \
OUTPUT_DIR=eval/res_llava_st_smoke \
bash eval/run_eval.shIf you prefer not to use the shell script, you can call the entry point directly:
cd /home/wangxingjian/DORO-STVG/eval
python main.py run \
--model_name=llava-st-qwen2 \
--model_path=/path/to/model \
--data_name=doro-stvg \
--annotation_path=/path/to/test.json \
--video_dir=/path/to/videos \
--output_dir=./eval/resThe current run_generator.sh contains the full pipeline command examples, and the bottom part of the script keeps the active query-generation example.
A typical workflow is:
- Generate
scene_graphs.jsonl - Generate
query.jsonl - Convert it into
query_train.jsonl
This is the training-friendly formatted output generated from query.jsonl by utils/format_train.py. The main fields include:
videopathqueryidqueryDifficultyWidth/Heightbox
box is a trajectory string in the following format:
target description: <frame_idx, time_sec, x1, y1, x2, y2; ... />
Here the coordinates are already normalized to [0, 1] using the video width and height, which makes this format easier to use for training and annotation consumption.