Skip to content

Vicomtech/dardcollect

Repository files navigation

DARDcollect — DETECTOR Archive Data Collector

A GPU-accelerated multi-modal toolkit for downloading, processing, and annotating historical public-domain media from the Internet Archive. Originally developed for the DETECTOR project, it downloads videos, images, audio, and documents organised by language; extracts person detections and pose keypoints; transcribes speech; extracts text from PDFs and plain-text files; and produces 616×616 face crops with rich .json sidecars — bounding boxes, keypoints, quality scores, transcriptions, and full provenance — with FAIR metadata throughout.

Use it two ways:

  • Complete pipeline for bulk processing of historical media collections.
  • Modular library — import individual components (detection, transcription, OCR, face crops, quality scoring) into custom workflows.

Key Features

  • Eleven decoupled stages — download, video person detection, image person detection, video face crop extraction, image face crop extraction, quality filtering, quality annotation, video transcription, audio transcription, document text extraction, frame extraction — each resumable and independently re-runnable.
  • Pose-based filtering — face visibility, size, frontal orientation, and duplicate suppression all derived from CIGPose 133-keypoint poses; robust to the grain and low resolution of pre-1960 film.
  • Speech transcription — Whisper-Small transcribes both person-clip audio (video pipeline) and standalone audio files, writing .transcription.json sidecars with language detection.
  • Document text extraction — extracts text from PDFs (text layer or PaddleOCR fallback for scanned pages) and plain-text files with encoding detection, producing .text.txt + .annotation.json pairs.
  • GPU accelerated — YOLOX, CIGPose, Whisper, and PaddleOCR run via ONNX on CUDA 12; CPU-only mode activates automatically.
  • FAIR + EU AI Act — every artifact gets a UUID and full provenance chain; every model and rule-based algorithm is documented per Annex IV.

Installation

As a Pipeline

For bulk processing of Archive.org media with the complete 10-stage workflow:

git clone https://github.com/Vicomtech/dardcollect.git && cd dardcollect
uv sync

The pipeline processes media through four parallel modality tracks that converge at quality filtering:

                        ┌─ person clips ── face crops ─┐
Videos  ─── download ───┤                              ├── filter ── annotate
                        └─ transcriptions              │
                                                       │
Images  ─── download ─── detections ──── face crops ───┘

Audio   ─── download ─── transcriptions

Documents── download ─── extracted text

Then follow the step-by-step walkthrough in docs/0-GETTING-STARTED.md.

As a Library (Custom Workflows)

To use individual components in your own Python scripts:

uv pip install git+https://github.com/Vicomtech/dardcollect.git

Then import and use components:

# Example: Custom transcription + face detection workflow
from dardcollect import PersonDetector, AudioTranscriber, download_item
from pathlib import Path

# Download from archive.org with FAIR metadata
result = download_item("example_item_id", dest_dir=Path("media/"))

if result["success"]:
    # Transcribe audio
    transcriber = AudioTranscriber(model_size="small")
    text = transcriber.transcribe_file(result["path"])
    
    # Detect people in video
    detector = PersonDetector(config, model_path="models/yolox_tiny.onnx")
    bboxes, scores = detector.get_detections(frame)

Available components:

  • PersonDetector, PersonTracker, PoseEstimator — Detection & tracking
  • AudioTranscriber — Whisper speech-to-text
  • DocumentExtractor — OCR for scanned PDFs
  • process_image(), process_video() — Face crop extraction (OFIQ 616×616)
  • load_models(), score_video() — OFIQ 7-dimensional quality scoring
  • add_fair_metadata(), generate_uuid() — Provenance tracking
  • check_face_visibility(), check_frontal_face() — Face validation
  • extract_frames() — Video to PNG frames
  • download_item() — Archive.org downloads

For detailed examples and API reference, see docs/5-LIBRARY-API.md.


Output Structure

DARD/
├── archive_org_public_domain/            # Downloaded source files
│   ├── videos/eng/, videos/spa/, ...     # Language-organised video downloads (ISO 639-2)
│   ├── videos/und/                       # Videos with no language metadata on Archive.org
│   ├── images/                           # Image downloads (no language subfolder)
│   ├── audio/eng/, audio/spa/, ...       # Language-organised audio downloads (ISO 639-2)
│   ├── audio/und/                        # Audio with no language metadata on Archive.org
│   ├── texts/ger/, texts/fra/, ...       # Language-organised text downloads (ISO 639-2)
│   ├── texts/und/                        # Texts with no language metadata on Archive.org
│   └── downloads.csv                       # Unified metadata (one row per file)
├── extracted_person_clips/               # Person clip videos + JSON sidecars + clips_extraction.csv + transcriptions_extraction.csv
├── extracted_image_detections/           # Per-image detection JSON + image_person_detection.csv
├── video_face_crops/                     # 616×616 OFIQ-aligned crops (video) + video_face_crops_extraction.csv
├── image_face_crops/                     # 616×616 OFIQ-aligned crops (image) + image_face_crops_extraction.csv
├── filtered_video_face_crops/            # Quality-filtered video crops + video_filtered_face_crops.csv + video_face_quality_annotation.csv
├── filtered_image_face_crops/            # Quality-filtered image crops + image_filtered_face_crops.csv + image_face_quality_annotation.csv
├── extracted_frames/                     # Optional PNG frames + frames_extraction.csv
├── audio_transcriptions/                 # Whisper sidecars for audio files + audio_transcriptions_extraction.csv
└── preprocessed_documents/              # Extracted text + annotation JSON + document_text_extraction.csv

Every artifact is linked to its source via UUID: Archive.org ID → Download → Clip → Crop → Quality scores. See docs/2-LINEAGE.md for CSV schemas and traceability queries, and docs/3-ANNOTATIONS.md for sidecar JSON formats.


AI Systems (EU AI Act Annex IV)

Each automated component is documented as an AI system per Annex IV, regardless of whether it uses a learned model or a rule-based algorithm. Face quality measures follow ISO/IEC 29794-5 via the OFIQ reference implementation.

Task System Type Implementation Documentation
Detection YOLOX-Tiny (HumanArt) Neural network (ONNX) dardcollect/detector.py Model card
Tracking OC-SORT Algorithm (model-free) dardcollect/tracker.py System card
Pose estimation CIGPose Wholebody (COCO 133) Neural network (ONNX) dardcollect/poser.py Model card
Scene change detection Luminance histogram + bbox area Algorithm (rule-based) pipeline/extract_person_clips_from_videos.py System card
Clip segmentation Face/duration/frontal rules Algorithm (rule-based) pipeline/extract_person_clips_from_videos.py System card
Face quality — unified score MagFace IResNet50 (ISO/IEC 29794-5) Neural network (ONNX) pipeline/filter_face_crops_by_quality.py, pipeline/annotate_face_quality.py Model card
Face quality — sharpness Face sharpness random forest (OFIQ Sharpness) Algorithm (random forest) pipeline/annotate_face_quality.py Model card
Face quality — compression SSIM CNN (OFIQ CompressionArtifacts) Neural network (ONNX) pipeline/annotate_face_quality.py Model card
Face quality — expression neutrality HSEmotion EfficientNet + AdaBoost (OFIQ ExpressionNeutrality) Neural network + algorithm pipeline/annotate_face_quality.py Model card
Face quality — head coverings / occlusion BiSeNet face parsing (OFIQ NoHeadCoverings / FaceOcclusionPrevention) Neural network (ONNX) pipeline/annotate_face_quality.py Model card
Face quality — face occlusion segmentation Face occlusion segmentation CNN Neural network (ONNX) pipeline/annotate_face_quality.py Model card
Face quality — head pose MobileNetV1 3DDFAV2 (OFIQ HeadPose) Neural network (ONNX) pipeline/annotate_face_quality.py Model card
Audio transcription Whisper-Small Neural network (PyTorch) pipeline/transcribe_video_clips.py, pipeline/transcribe_audio_files.py Model card
Document OCR PaddleOCR PP-OCRv4 (det + rec + cls) Neural network (ONNX) pipeline/extract_text_from_doc.py Model card

Contributing

Contributions are welcome. Please read docs/CONTRIBUTING.md for:

  • Development setup and pre-commit hooks
  • Code style: Ruff (linting & formatting) + ty (type checking)
  • PR guidelines — including the requirement to document any new pipeline component as an AI system per EU AI Act Annex IV

License

The source code is licensed under the Apache License 2.0.

The bundled model weights in dardcollect/models/ carry separate licenses and are not covered by the Apache 2.0 license:

Model File License
YOLOX-Tiny (HumanArt) yolox_tiny_...onnx Architecture: Apache 2.0 — weights trained on HumanArt data ⚠ verify before commercial use
CIGPose Wholebody cigpose-m_...onnx Architecture: Apache 2.0 — trained on COCO WholeBody ⚠ verify before commercial use
Whisper Small openai_whisper_small.pt MIT
MagFace IResNet50 magface_iresnet50_norm.onnx Apache 2.0 (MagFace); ONNX packaging MIT (OFIQ)
HSEmotion EfficientNet-B0/B2 enet_b0_...onnx, enet_b2_...onnx MIT (HSEmotion)
AdaBoost neutrality classifier hse_1_2_C_adaboost.yml.gz MIT (OFIQ)
BiSeNet, occlusion, head pose, sharpness, SSIM various .onnx / .xml.gz MIT (OFIQ)
PaddleOCR PP-OCRv4 (3 files) ch_PP-OCRv4_*.onnx, ch_ppocr_*.onnx Apache 2.0 (PaddleOCR)

See NOTICE for full third-party attributions and dependency licenses (including ⚠ PyMuPDF AGPL-3.0).

About

This repository contains a GPU-accelerated multi-modal toolkit for downloading, processing, and annotating historical public-domain media from the Internet Archive. It downloads videos, images, audio, and documents organised by language; extracts visual person data; transcribes speech; and extracts text with FAIR metadata output.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors