A GPU-accelerated multi-modal toolkit for downloading, processing, and annotating historical public-domain media from the Internet Archive. Originally developed for the DETECTOR project, it downloads videos, images, audio, and documents organised by language; extracts person detections and pose keypoints; transcribes speech; extracts text from PDFs and plain-text files; and produces 616×616 face crops with rich .json sidecars — bounding boxes, keypoints, quality scores, transcriptions, and full provenance — with FAIR metadata throughout.
Use it two ways:
- Complete pipeline for bulk processing of historical media collections.
- Modular library — import individual components (detection, transcription, OCR, face crops, quality scoring) into custom workflows.
- Eleven decoupled stages — download, video person detection, image person detection, video face crop extraction, image face crop extraction, quality filtering, quality annotation, video transcription, audio transcription, document text extraction, frame extraction — each resumable and independently re-runnable.
- Pose-based filtering — face visibility, size, frontal orientation, and duplicate suppression all derived from CIGPose 133-keypoint poses; robust to the grain and low resolution of pre-1960 film.
- Speech transcription — Whisper-Small transcribes both person-clip audio (video pipeline) and standalone audio files, writing
.transcription.jsonsidecars with language detection. - Document text extraction — extracts text from PDFs (text layer or PaddleOCR fallback for scanned pages) and plain-text files with encoding detection, producing
.text.txt+.annotation.jsonpairs. - GPU accelerated — YOLOX, CIGPose, Whisper, and PaddleOCR run via ONNX on CUDA 12; CPU-only mode activates automatically.
- FAIR + EU AI Act — every artifact gets a UUID and full provenance chain; every model and rule-based algorithm is documented per Annex IV.
For bulk processing of Archive.org media with the complete 10-stage workflow:
git clone https://github.com/Vicomtech/dardcollect.git && cd dardcollect
uv syncThe pipeline processes media through four parallel modality tracks that converge at quality filtering:
┌─ person clips ── face crops ─┐
Videos ─── download ───┤ ├── filter ── annotate
└─ transcriptions │
│
Images ─── download ─── detections ──── face crops ───┘
Audio ─── download ─── transcriptions
Documents── download ─── extracted text
Then follow the step-by-step walkthrough in docs/0-GETTING-STARTED.md.
To use individual components in your own Python scripts:
uv pip install git+https://github.com/Vicomtech/dardcollect.gitThen import and use components:
# Example: Custom transcription + face detection workflow
from dardcollect import PersonDetector, AudioTranscriber, download_item
from pathlib import Path
# Download from archive.org with FAIR metadata
result = download_item("example_item_id", dest_dir=Path("media/"))
if result["success"]:
# Transcribe audio
transcriber = AudioTranscriber(model_size="small")
text = transcriber.transcribe_file(result["path"])
# Detect people in video
detector = PersonDetector(config, model_path="models/yolox_tiny.onnx")
bboxes, scores = detector.get_detections(frame)Available components:
PersonDetector,PersonTracker,PoseEstimator— Detection & trackingAudioTranscriber— Whisper speech-to-textDocumentExtractor— OCR for scanned PDFsprocess_image(),process_video()— Face crop extraction (OFIQ 616×616)load_models(),score_video()— OFIQ 7-dimensional quality scoringadd_fair_metadata(),generate_uuid()— Provenance trackingcheck_face_visibility(),check_frontal_face()— Face validationextract_frames()— Video to PNG framesdownload_item()— Archive.org downloads
For detailed examples and API reference, see docs/5-LIBRARY-API.md.
DARD/
├── archive_org_public_domain/ # Downloaded source files
│ ├── videos/eng/, videos/spa/, ... # Language-organised video downloads (ISO 639-2)
│ ├── videos/und/ # Videos with no language metadata on Archive.org
│ ├── images/ # Image downloads (no language subfolder)
│ ├── audio/eng/, audio/spa/, ... # Language-organised audio downloads (ISO 639-2)
│ ├── audio/und/ # Audio with no language metadata on Archive.org
│ ├── texts/ger/, texts/fra/, ... # Language-organised text downloads (ISO 639-2)
│ ├── texts/und/ # Texts with no language metadata on Archive.org
│ └── downloads.csv # Unified metadata (one row per file)
├── extracted_person_clips/ # Person clip videos + JSON sidecars + clips_extraction.csv + transcriptions_extraction.csv
├── extracted_image_detections/ # Per-image detection JSON + image_person_detection.csv
├── video_face_crops/ # 616×616 OFIQ-aligned crops (video) + video_face_crops_extraction.csv
├── image_face_crops/ # 616×616 OFIQ-aligned crops (image) + image_face_crops_extraction.csv
├── filtered_video_face_crops/ # Quality-filtered video crops + video_filtered_face_crops.csv + video_face_quality_annotation.csv
├── filtered_image_face_crops/ # Quality-filtered image crops + image_filtered_face_crops.csv + image_face_quality_annotation.csv
├── extracted_frames/ # Optional PNG frames + frames_extraction.csv
├── audio_transcriptions/ # Whisper sidecars for audio files + audio_transcriptions_extraction.csv
└── preprocessed_documents/ # Extracted text + annotation JSON + document_text_extraction.csv
Every artifact is linked to its source via UUID: Archive.org ID → Download → Clip → Crop → Quality scores. See docs/2-LINEAGE.md for CSV schemas and traceability queries, and docs/3-ANNOTATIONS.md for sidecar JSON formats.
Each automated component is documented as an AI system per Annex IV, regardless of whether it uses a learned model or a rule-based algorithm. Face quality measures follow ISO/IEC 29794-5 via the OFIQ reference implementation.
| Task | System | Type | Implementation | Documentation |
|---|---|---|---|---|
| Detection | YOLOX-Tiny (HumanArt) | Neural network (ONNX) | dardcollect/detector.py |
Model card |
| Tracking | OC-SORT | Algorithm (model-free) | dardcollect/tracker.py |
System card |
| Pose estimation | CIGPose Wholebody (COCO 133) | Neural network (ONNX) | dardcollect/poser.py |
Model card |
| Scene change detection | Luminance histogram + bbox area | Algorithm (rule-based) | pipeline/extract_person_clips_from_videos.py |
System card |
| Clip segmentation | Face/duration/frontal rules | Algorithm (rule-based) | pipeline/extract_person_clips_from_videos.py |
System card |
| Face quality — unified score | MagFace IResNet50 (ISO/IEC 29794-5) | Neural network (ONNX) | pipeline/filter_face_crops_by_quality.py, pipeline/annotate_face_quality.py |
Model card |
| Face quality — sharpness | Face sharpness random forest (OFIQ Sharpness) |
Algorithm (random forest) | pipeline/annotate_face_quality.py |
Model card |
| Face quality — compression | SSIM CNN (OFIQ CompressionArtifacts) |
Neural network (ONNX) | pipeline/annotate_face_quality.py |
Model card |
| Face quality — expression neutrality | HSEmotion EfficientNet + AdaBoost (OFIQ ExpressionNeutrality) |
Neural network + algorithm | pipeline/annotate_face_quality.py |
Model card |
| Face quality — head coverings / occlusion | BiSeNet face parsing (OFIQ NoHeadCoverings / FaceOcclusionPrevention) |
Neural network (ONNX) | pipeline/annotate_face_quality.py |
Model card |
| Face quality — face occlusion segmentation | Face occlusion segmentation CNN | Neural network (ONNX) | pipeline/annotate_face_quality.py |
Model card |
| Face quality — head pose | MobileNetV1 3DDFAV2 (OFIQ HeadPose) |
Neural network (ONNX) | pipeline/annotate_face_quality.py |
Model card |
| Audio transcription | Whisper-Small | Neural network (PyTorch) | pipeline/transcribe_video_clips.py, pipeline/transcribe_audio_files.py |
Model card |
| Document OCR | PaddleOCR PP-OCRv4 (det + rec + cls) | Neural network (ONNX) | pipeline/extract_text_from_doc.py |
Model card |
Contributions are welcome. Please read docs/CONTRIBUTING.md for:
- Development setup and pre-commit hooks
- Code style: Ruff (linting & formatting) + ty (type checking)
- PR guidelines — including the requirement to document any new pipeline component as an AI system per EU AI Act Annex IV
The source code is licensed under the Apache License 2.0.
The bundled model weights in dardcollect/models/ carry separate licenses and are not covered by the Apache 2.0 license:
| Model | File | License |
|---|---|---|
| YOLOX-Tiny (HumanArt) | yolox_tiny_...onnx |
Architecture: Apache 2.0 — weights trained on HumanArt data ⚠ verify before commercial use |
| CIGPose Wholebody | cigpose-m_...onnx |
Architecture: Apache 2.0 — trained on COCO WholeBody ⚠ verify before commercial use |
| Whisper Small | openai_whisper_small.pt |
MIT |
| MagFace IResNet50 | magface_iresnet50_norm.onnx |
Apache 2.0 (MagFace); ONNX packaging MIT (OFIQ) |
| HSEmotion EfficientNet-B0/B2 | enet_b0_...onnx, enet_b2_...onnx |
MIT (HSEmotion) |
| AdaBoost neutrality classifier | hse_1_2_C_adaboost.yml.gz |
MIT (OFIQ) |
| BiSeNet, occlusion, head pose, sharpness, SSIM | various .onnx / .xml.gz |
MIT (OFIQ) |
| PaddleOCR PP-OCRv4 (3 files) | ch_PP-OCRv4_*.onnx, ch_ppocr_*.onnx |
Apache 2.0 (PaddleOCR) |
See NOTICE for full third-party attributions and dependency licenses (including ⚠ PyMuPDF AGPL-3.0).