DARDcollect — DETECTOR Archive Data Collector

A GPU-accelerated multi-modal toolkit for downloading, processing, and annotating historical public-domain media from the Internet Archive. Originally developed for the DETECTOR project, it downloads videos, images, audio, and documents organised by language; extracts person detections and pose keypoints; transcribes speech; extracts text from PDFs and plain-text files; and produces 616×616 face crops with rich .json sidecars — bounding boxes, keypoints, quality scores, transcriptions, and full provenance — with FAIR metadata throughout.

Use it two ways:

Complete pipeline for bulk processing of historical media collections.
Modular library — import individual components (detection, transcription, OCR, face crops, quality scoring) into custom workflows.

Key Features

Eleven decoupled stages — download, video person detection, image person detection, video face crop extraction, image face crop extraction, quality filtering, quality annotation, video transcription, audio transcription, document text extraction, frame extraction — each resumable and independently re-runnable.
Pose-based filtering — face visibility, size, frontal orientation, and duplicate suppression all derived from CIGPose 133-keypoint poses; robust to the grain and low resolution of pre-1960 film.
Speech transcription — Whisper-Small transcribes both person-clip audio (video pipeline) and standalone audio files, writing .transcription.json sidecars with language detection.
Document text extraction — extracts text from PDFs (text layer or PaddleOCR fallback for scanned pages) and plain-text files with encoding detection, producing .text.txt + .annotation.json pairs.
GPU accelerated — YOLOX, CIGPose, Whisper, and PaddleOCR run via ONNX on CUDA 12; CPU-only mode activates automatically.
FAIR + EU AI Act — every artifact gets a UUID and full provenance chain; every model and rule-based algorithm is documented per Annex IV.

Installation

As a Pipeline

For bulk processing of Archive.org media with the complete 10-stage workflow:

git clone https://github.com/Vicomtech/dardcollect.git && cd dardcollect
uv sync

The pipeline processes media through four parallel modality tracks that converge at quality filtering:

                        ┌─ person clips ── face crops ─┐
Videos  ─── download ───┤                              ├── filter ── annotate
                        └─ transcriptions              │
                                                       │
Images  ─── download ─── detections ──── face crops ───┘

Audio   ─── download ─── transcriptions

Documents── download ─── extracted text

Then follow the step-by-step walkthrough in docs/0-GETTING-STARTED.md.

As a Library (Custom Workflows)

To use individual components in your own Python scripts:

uv pip install git+https://github.com/Vicomtech/dardcollect.git

Then import and use components:

# Example: Custom transcription + face detection workflow
from dardcollect import PersonDetector, AudioTranscriber, download_item
from pathlib import Path

# Download from archive.org with FAIR metadata
result = download_item("example_item_id", dest_dir=Path("media/"))

if result["success"]:
    # Transcribe audio
    transcriber = AudioTranscriber(model_size="small")
    text = transcriber.transcribe_file(result["path"])
    
    # Detect people in video
    detector = PersonDetector(config, model_path="models/yolox_tiny.onnx")
    bboxes, scores = detector.get_detections(frame)

Available components:

PersonDetector, PersonTracker, PoseEstimator — Detection & tracking
AudioTranscriber — Whisper speech-to-text
DocumentExtractor — OCR for scanned PDFs
process_image(), process_video() — Face crop extraction (OFIQ 616×616)
load_models(), score_video() — OFIQ 7-dimensional quality scoring
add_fair_metadata(), generate_uuid() — Provenance tracking
check_face_visibility(), check_frontal_face() — Face validation
extract_frames() — Video to PNG frames
download_item() — Archive.org downloads

For detailed examples and API reference, see docs/5-LIBRARY-API.md.

Output Structure

DARD/
├── archive_org_public_domain/            # Downloaded source files
│   ├── videos/eng/, videos/spa/, ...     # Language-organised video downloads (ISO 639-2)
│   ├── videos/und/                       # Videos with no language metadata on Archive.org
│   ├── images/                           # Image downloads (no language subfolder)
│   ├── audio/eng/, audio/spa/, ...       # Language-organised audio downloads (ISO 639-2)
│   ├── audio/und/                        # Audio with no language metadata on Archive.org
│   ├── texts/ger/, texts/fra/, ...       # Language-organised text downloads (ISO 639-2)
│   ├── texts/und/                        # Texts with no language metadata on Archive.org
│   └── downloads.csv                       # Unified metadata (one row per file)
├── extracted_person_clips/               # Person clip videos + JSON sidecars + clips_extraction.csv + transcriptions_extraction.csv
├── extracted_image_detections/           # Per-image detection JSON + image_person_detection.csv
├── video_face_crops/                     # 616×616 OFIQ-aligned crops (video) + video_face_crops_extraction.csv
├── image_face_crops/                     # 616×616 OFIQ-aligned crops (image) + image_face_crops_extraction.csv
├── filtered_video_face_crops/            # Quality-filtered video crops + video_filtered_face_crops.csv + video_face_quality_annotation.csv
├── filtered_image_face_crops/            # Quality-filtered image crops + image_filtered_face_crops.csv + image_face_quality_annotation.csv
├── extracted_frames/                     # Optional PNG frames + frames_extraction.csv
├── audio_transcriptions/                 # Whisper sidecars for audio files + audio_transcriptions_extraction.csv
└── preprocessed_documents/              # Extracted text + annotation JSON + document_text_extraction.csv

Every artifact is linked to its source via UUID: Archive.org ID → Download → Clip → Crop → Quality scores. See docs/2-LINEAGE.md for CSV schemas and traceability queries, and docs/3-ANNOTATIONS.md for sidecar JSON formats.

AI Systems (EU AI Act Annex IV)

Each automated component is documented as an AI system per Annex IV, regardless of whether it uses a learned model or a rule-based algorithm. Face quality measures follow ISO/IEC 29794-5 via the OFIQ reference implementation.

Task	System	Type	Implementation	Documentation
Detection	YOLOX-Tiny (HumanArt)	Neural network (ONNX)	`dardcollect/detector.py`	Model card
Tracking	OC-SORT	Algorithm (model-free)	`dardcollect/tracker.py`	System card
Pose estimation	CIGPose Wholebody (COCO 133)	Neural network (ONNX)	`dardcollect/poser.py`	Model card
Scene change detection	Luminance histogram + bbox area	Algorithm (rule-based)	`pipeline/extract_person_clips_from_videos.py`	System card
Clip segmentation	Face/duration/frontal rules	Algorithm (rule-based)	`pipeline/extract_person_clips_from_videos.py`	System card
Face quality — unified score	MagFace IResNet50 (ISO/IEC 29794-5)	Neural network (ONNX)	`pipeline/filter_face_crops_by_quality.py`, `pipeline/annotate_face_quality.py`	Model card
Face quality — sharpness	Face sharpness random forest (OFIQ `Sharpness`)	Algorithm (random forest)	`pipeline/annotate_face_quality.py`	Model card
Face quality — compression	SSIM CNN (OFIQ `CompressionArtifacts`)	Neural network (ONNX)	`pipeline/annotate_face_quality.py`	Model card
Face quality — expression neutrality	HSEmotion EfficientNet + AdaBoost (OFIQ `ExpressionNeutrality`)	Neural network + algorithm	`pipeline/annotate_face_quality.py`	Model card
Face quality — head coverings / occlusion	BiSeNet face parsing (OFIQ `NoHeadCoverings` / `FaceOcclusionPrevention`)	Neural network (ONNX)	`pipeline/annotate_face_quality.py`	Model card
Face quality — face occlusion segmentation	Face occlusion segmentation CNN	Neural network (ONNX)	`pipeline/annotate_face_quality.py`	Model card
Face quality — head pose	MobileNetV1 3DDFAV2 (OFIQ `HeadPose`)	Neural network (ONNX)	`pipeline/annotate_face_quality.py`	Model card
Audio transcription	Whisper-Small	Neural network (PyTorch)	`pipeline/transcribe_video_clips.py`, `pipeline/transcribe_audio_files.py`	Model card
Document OCR	PaddleOCR PP-OCRv4 (det + rec + cls)	Neural network (ONNX)	`pipeline/extract_text_from_doc.py`	Model card

Contributing

Contributions are welcome. Please read docs/CONTRIBUTING.md for:

Development setup and pre-commit hooks
Code style: Ruff (linting & formatting) + ty (type checking)
PR guidelines — including the requirement to document any new pipeline component as an AI system per EU AI Act Annex IV

License

The source code is licensed under the Apache License 2.0.

The bundled model weights in dardcollect/models/ carry separate licenses and are not covered by the Apache 2.0 license:

Model	File	License
YOLOX-Tiny (HumanArt)	`yolox_tiny_...onnx`	Architecture: Apache 2.0 — weights trained on HumanArt data ⚠ verify before commercial use
CIGPose Wholebody	`cigpose-m_...onnx`	Architecture: Apache 2.0 — trained on COCO WholeBody ⚠ verify before commercial use
Whisper Small	`openai_whisper_small.pt`	MIT
MagFace IResNet50	`magface_iresnet50_norm.onnx`	Apache 2.0 (MagFace); ONNX packaging MIT (OFIQ)
HSEmotion EfficientNet-B0/B2	`enet_b0_...onnx`, `enet_b2_...onnx`	MIT (HSEmotion)
AdaBoost neutrality classifier	`hse_1_2_C_adaboost.yml.gz`	MIT (OFIQ)
BiSeNet, occlusion, head pose, sharpness, SSIM	various `.onnx` / `.xml.gz`	MIT (OFIQ)
PaddleOCR PP-OCRv4 (3 files)	`ch_PP-OCRv4_.onnx`, `ch_ppocr_.onnx`	Apache 2.0 (PaddleOCR)

See NOTICE for full third-party attributions and dependency licenses (including ⚠ PyMuPDF AGPL-3.0).

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.vscode		.vscode
dardcollect		dardcollect
docs		docs
pipeline		pipeline
schemas		schemas
viewer		viewer
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
config.yaml		config.yaml
js		js
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DARDcollect — DETECTOR Archive Data Collector

Key Features

Installation

As a Pipeline

As a Library (Custom Workflows)

Output Structure

AI Systems (EU AI Act Annex IV)

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DARDcollect — DETECTOR Archive Data Collector

Key Features

Installation

As a Pipeline

As a Library (Custom Workflows)

Output Structure

AI Systems (EU AI Act Annex IV)

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages