Flask app for identifying, storing and browsing historical documents from the GLOBALISE corpus.
- Python 3.13 or higher
- uv package manager (recommended)
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or with pip
pip install uv
# Or with pipx
pipx install uv# Install all dependencies (creates .venv automatically)
uv syncThis will automatically:
- Create a virtual environment in
.venv/ - Install all required Python packages
- Set up the project for development
The application requires several data files that are too large to include in the repository. These files must be obtained separately and placed in the data/ directory before running the import scripts.
Place the following files in the data/ directory:
- documents_for_django.csv - Scan metadata and inventory information (original dataset)
- documents_for_django_2025.csv - Additional scan metadata (2025 dataset)
- page_metadata.csv - Page-level metadata including folio numbers and scan types
- page_metadata_new_inventories.csv - Page metadata for newly added inventories
- inventory2dates.json - Date ranges for each inventory
- inventory2dates_extra.json - Extra dates missing in EAD for inventories
- inventory2handle.json - Handle URLs for inventories
- inventory2titles.json - Titles for inventories
- inventory2uuid.json - UUID mappings for inventories
- inventories.json - Complete inventory information
- archival_hierarchy.json - Archival series and hierarchy structure
- pp_project_globalisethesaurus.ttl - SKOS thesaurus with GLOBALISE and TANAP document types
- location_index.csv - Settlement/location index with GLOB IDs and spelling variants
- globalise_digitized_indexes.csv - TANAP-digitized catalog records (OBP index)
- annotationpages.csv - Per-scan flags indicating availability of transcription, entity, and event annotation pages
Your data/ directory should look like:
data/
├── documents_for_django.csv
├── documents_for_django_2025.csv
├── page_metadata.csv
├── page_metadata_new_inventories.csv
├── inventory2dates.json
├── inventory2dates_extra.json
├── inventory2handle.json
├── inventory2titles.json
├── inventory2uuid.json
├── inventories.json
├── overview_general_missives.csv
├── archival_hierarchy.json
├── pp_project_globalisethesaurus.ttl
├── location_index.csv
├── annotationpages.csv
└── globalise_digitized_indexes.csv
Run the three import scripts sequentially to create and populate the SQLite database:
uv run python 1_import_scans_and_inventories.pyThis script:
- Creates the database tables
- Imports inventory records from JSON files
- Imports scan metadata from CSV files
- Links scans to their respective inventories
- Expected runtime: 2-5 minutes depending on data size
uv run python 2_import_pages.pyThis script:
- Updates scan types (single/double page)
- Creates page records with detailed metadata
- Links pages to scans
- Maps folio numbers and recto/verso positions
- Expected runtime: 5-10 minutes
uv run python 3_import_hierarchy.py data/archival_hierarchy.jsonThis script:
- Imports archival series (sets) and subseries
- Establishes parent-child relationships
- Updates inventory records with series information
- Expected runtime: 1-2 minutes
uv run python 4_identify_documents_baseline.pyThis script implements a baseline document identification method for early modern archival documents:
- Creates a document identification method record
- Skips empty pages at the beginning of inventories (covers, archival covers)
- Identifies document boundaries based on:
- Empty page sequences (is_blank=True)
- Pages with signatures (indicating document end)
- Creates Document records and links them to pages
- Expected runtime: Varies based on inventory size
Note: This is a baseline implementation. More sophisticated document identification methods can be added as additional scripts that create different DocumentIdentificationMethod records.
uv run python 5_import_document_types.py
# or with explicit paths:
uv run python 5_import_document_types.py --ttl /path/to/thesaurus.ttl --database sqlite:///globalise_documents.dbThis script looks for document types in a pp_project_globalisethesaurus.ttl file and adds their UUID, the English and Dutch preflabels and whether it is a GLOBALISE or TANAP document type.
uv run python 6_import_settlements.py
# or with explicit paths:
uv run python 6_import_settlements.py --csv /path/to/location_index.csv --database sqlite:///globalise_documents.dbThis script:
- Imports settlement (location) data from
location_index.csv - Creates one Settlement per unique GLOB ID
- Creates multiple SettlementLabel records per settlement for spelling variants and alternative names
- Skips already existing settlements and labels on re-runs
uv run python 7_import_obp_index.py
# or with explicit paths:
uv run python 7_import_obp_index.py --csv /path/to/globalise_digitized_indexes.csv --database sqlite:///globalise_documents.dbThis script:
- Imports TANAP-digitized catalog records (OBP index) from CSV
- Creates Document records with titles, dates, folio ranges, and locations
- Links documents to document types extracted from PoolParty URIs
- Resolves settlement labels to settlement records
- Creates external ID records for OBP_INDEX, TANAP, and DIGITIZED TYPOSCRIPTS contexts
- Creates a "TANAP Digitized Index" document identification method
- Depends on steps 1–6 (requires inventories, document types, and settlements in the database)
uv run python 8_import_GM.py [--dry-run]Uses the Ground Truth for General Missives to add documents. Requires the file overview_general_missives.csv to be in data folder.
uv run python 9_import_annotation_pages_exist.py [--dry-run]Sets has_transcriptions, has_entities, and has_events flags on Scan records based on annotationpages.csv. These flags control whether annotation page links are included in IIIF manifest exports.
After running the import scripts, you should have a populated globalise_documents.db file.
Start the Flask web server:
uv run python app.pyThen open your browser to: http://localhost:5000
Alternatively, you can run the application using Docker:
docker pull ghcr.io/globalise-huygens/documents:latestdocker run -p 8000:8000 -v ./globalise_documents.db:/app/globalise_documents.db --rm globalisedocuments:latestThen open your browser to: http://localhost:8000
- Home - Overview and statistics
- Inventories - Browse all archive inventories
- Documents - Search and filter documents
- Scans - View document scans with IIIF images
- Pages - Explore individual pages with metadata
- Series - Browse archival series hierarchy
- Settlements - Browse settlement locations
- Document Types - Browse document type classifications (GLOBALISE and TANAP)
- Methods - View document identification methods with timeline visualization
- Search - Full-text search across all content
Sync with:
aws s3 sync objects/inventory/ s3://globalise-data/objects/inventory --acl=public-read --content-encoding gzipuv run python export_manifests.pyExports individual IIIF 3.0 Manifest JSON files for every inventory (objects/inventory/<number>.manifest.json). Output is gzipped and ready for S3 upload.
uv run python export_collection.pyExports a top-level IIIF 3.0 Collection JSON file (objects/inventory/collection.json) that references all inventory manifests. Output is gzipped and ready for S3 upload.
The application uses SQLite with the following main tables:
- Inventory - Archive inventory records
- InventoryTitle - Titles for inventories
- Series - Archival series hierarchy (sets and subsets)
- Scan - Digital scans with IIIF URLs
- Page - Individual pages with folio numbers and metadata
- Document - Document records with date ranges
- DocumentIdentificationMethod - Methods used to identify documents
- Page2Document - Many-to-many relationships between pages and documents
- DocumentType - GLOBALISE and TANAP document type classifications
- Document2DocumentType - Many-to-many relationships between documents and types
- Settlement - Settlement/location entities with GLOB IDs
- SettlementLabel - Spelling variants and alternative names for settlements
- ExternalID - External identifiers (OBP, TANAP, etc.)
- Document2ExternalID - Many-to-many relationships between documents and external IDs
documents/
├── app.py # Main Flask application
├── models.py # SQLAlchemy database models
├── db_utils.py # Database management utilities
├── 1_import_scans_and_inventories.py # Import script (step 1)
├── 2_import_pages.py # Import script (step 2)
├── 3_import_hierarchy.py # Import script (step 3)
├── 4_identify_documents_baseline.py # Document identification (baseline method)
├── 5_import_document_types.py # Import document types from thesaurus (step 5)
├── 6_import_settlements.py # Import settlements (step 6)
├── 7_import_obp_index.py # Import OBP index records (step 7)
├── 8_import_GM.py # Import GM data (step 8)
├── 9_import_annotation_pages_exist.py # Import annotation page flags (step 9)
├── export.py # Linked Art JSON-LD serialization helpers
├── export_collection.py # Export IIIF Collection
├── export_manifests.py # Export IIIF Manifests
├── Dockerfile # Container configuration
├── requirements.txt # Python dependencies (for pip)
├── pyproject.toml # UV/project configuration
├── data/ # Data files (not in repo)
└── templates/ # HTML templates
├── base.html
├── index.html
├── inventories.html
├── inventory_detail.html
├── documents.html
├── document_detail.html
├── document_types.html
├── document_type_detail.html
├── scans.html
├── scan_detail.html
├── pages.html
├── page_detail.html
├── settlements.html
├── settlement_detail.html
├── methods.html
├── method_detail.html
# Add a new package
uv add package-name
# Add a development dependency
uv add --dev package-name
# Update all packages
uv sync --upgrade[TODO: Add license information]