Command-line tool to export data from the Science Museum Group Collections Online Elasticsearch index to CSV.
- Python 3.9+
- Access to the Collections Online Elasticsearch instance
cd collections-exporter
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtCopy the template config and fill in your ES credentials:
cp .config.template .configEdit .config with your Elasticsearch connection details:
[elasticsearch]
node = http://user:pass@your-es-host/path-prefix/
index = ciim
[export]
output_dir = exports
base_url = https://collection.sciencemuseumgroup.org.uk
media_path = https://coimages.sciencemuseumgroup.org.uk/Note: If your ES instance is behind a reverse proxy on port 80 (no port in the URL), the client handles this automatically — no need to specify
:80.
The simplest way to run an export is with an export config file. These are JSON files in export_configs/ that define the filters and options for a particular export. Example configs (suffixed .example.json) are checked into git as templates — copy one to a plain .json name to make your own:
cp export_configs/railway_pre1976.example.json export_configs/railway_pre1976.json
python exporter.py export_configs/railway_pre1976.jsonNote: Plain
*.jsonfiles inexport_configs/are gitignored, so your private exports stay local. Only*.example.jsonfiles are tracked.
An export config looks like this:
{
"name": "Railway objects pre-1976",
"description": "Passenger Comforts and Railway Models made before 1976",
"categories": ["Passenger Comforts", "Railway Models"],
"before_year": 1976,
"include_images": true
}Available fields:
| Field | Type | Description |
|---|---|---|
name |
string | Display name shown when the export runs |
description |
string | Human-readable description of the export |
categories |
string[] | Category names to filter by |
exclude_categories |
string[] | Category names to exclude |
collections |
string[] | Named collection titles to filter by (e.g. "Daily Herald Archive") |
before_year |
int | Only include objects made before this year |
include_images |
bool | Include image path, licence, copyright, and credit columns |
all_image_licences |
bool | Include images with any licence (default: only open licences) |
all_images |
bool | Include every image per record as numbered image_<n>_* columns (default: first image only) — implies include_images |
max_images |
int | Cap per-record image columns when all_images is true (default: 10; 0 = no cap) |
download_images |
bool | Download images locally (implies include_images) |
jsonl |
bool | Also write objects.jsonl containing the raw ES _source per record |
output |
string | Output folder path (overrides default timestamped folder) |
To create a new export, add a JSON file to export_configs/ and run it:
python exporter.py export_configs/my_export.jsonRun all your private export configs in export_configs/ in one go (skips *.example.json templates):
python exporter.py --allOr specify multiple config files explicitly:
python exporter.py export_configs/railway_pre1976.json export_configs/my_other_export.jsonEach config creates its own individual output folder as usual. A summary is printed at the end showing the total records exported across all configs.
Any CLI argument will override the corresponding export config value:
# Use config but override the date filter
python exporter.py export_configs/railway_pre1976.example.json --before-year 2000
# Use config but send output to a specific folder
python exporter.py export_configs/railway_pre1976.example.json -o my_export_folderYou can also run directly with CLI arguments:
# Export all Mimsy objects
python exporter.py
# Filter by category and date
python exporter.py --categories "Passenger Comforts" "Railway Models" --before-year 1976
# Exclude specific categories
python exporter.py --exclude-categories "Photographs" "Art"
# Filter by named collection (cumulation.collector)
python exporter.py --collections "Daily Herald Archive"
# Include image data (open licences only by default)
python exporter.py --categories "Railway Models" --include-images
# Include images with any licence
python exporter.py --categories "Railway Models" --include-images --all-image-licences
# Download images locally
python exporter.py --categories "Railway Models" --download-images
# Include every image per record (image_1_*, image_2_*, ...)
python exporter.py --collections "Daily Herald Archive" --all-images
# Also write a JSONL file with the raw ES record per line
python exporter.py --categories "Railway Models" --jsonlBy default the CSV captures only the first image (legacy columns image_path, image_licence, image_copyright, image_credit). Pass --all-images (or set "all_images": true in an export config) to emit every image in the multimedia array as numbered column groups:
image_1_path, image_1_licence, image_1_copyright, image_1_credit,
image_2_path, image_2_licence, image_2_copyright, image_2_credit,
...
image_N_path, image_N_licence, image_N_copyright, image_N_credit
N is determined upfront from the query (via an aggregation that finds the max multimedia array length across matching records), but capped by --max-images (default 10) to keep the CSV from blowing up when a single outlier record skews the column count. Records with fewer than N images leave trailing columns empty.
# Default cap of 10 columns
python exporter.py --collections "Daily Herald Archive" --all-images
# Raise the cap
python exporter.py --collections "Daily Herald Archive" --all-images --max-images 50
# No cap — use the actual maximum across matching records
python exporter.py --collections "Daily Herald Archive" --all-images --max-images 0When a record has more images than the cap, the extras are silently dropped from both the CSV columns and (if --download-images is on) the download queue. The open-licence filter (and --all-image-licences override) applies per image — entries that don't pass become empty cells.
--download-images combined with --all-images downloads every image within the cap.
Pass --jsonl (or set "jsonl": true in an export config) to write a second file objects.jsonl alongside objects.csv. Each line is the full raw _source from Elasticsearch — useful for archival, downstream re-processing, or piping into jq. The query/filters are unchanged; this just adds a second output format.
Any note field is stripped at every nesting level before the record is written (top-level, inside nested objects, inside arrays of objects). This field typically contains cataloguer-internal annotations (sometimes with PII). Everything else passes through unchanged.
When
--jsonlis set, the script fetches the full ES document (no_sourcefield filtering), so it's slightly slower than CSV-only mode.
Use --download-images to save images into a local images/ folder within the export. The image_path column in the CSV will reference local paths instead of remote URLs:
python exporter.py --categories "Railway Models" --before-year 1850 --download-imagesThis produces:
exports/export_20260401_140513/
├── objects.csv # image_path = images/288/534/medium_image.jpg
├── export_info.txt
└── images/
├── 288/534/medium_image.jpg
├── 105/964/medium_other.jpg
└── ...
Preview the query and document count without exporting:
python exporter.py export_configs/railway_pre1976.example.json --dry-runusage: exporter.py [-h] [-c CONFIG] [-o OUTPUT] [-a]
[--categories CATEGORIES [CATEGORIES ...]]
[--exclude-categories EXCLUDE [EXCLUDE ...]]
[--collections COLLECTIONS [COLLECTIONS ...]]
[--before-year BEFORE_YEAR] [--include-images] [--all-image-licences]
[--download-images] [--all-images] [--max-images MAX_IMAGES] [--jsonl]
[--batch-size BATCH_SIZE] [--dry-run]
[export_configs ...]
positional arguments:
export_configs Path(s) to export config JSON file(s)
options:
-h, --help show this help message and exit
-c, --config CONFIG Path to server config file (default: .config)
-o, --output OUTPUT Output folder path (default: exports/<config>_<timestamp>/)
-a, --all Run all export configs in export_configs/
--categories Filter by category names (overrides export config)
--exclude-categories Exclude these category names (overrides export config)
--collections Filter by named collection title (overrides export config)
--before-year Only include objects made before this year (overrides export config)
--include-images Include image path, licence, copyright, and credit columns
--all-image-licences Include images with any licence (default: only open licences)
--download-images Download images locally (implies --include-images)
--all-images Include every image per record as numbered image_<n>_* columns (implies --include-images)
--max-images N Cap per-record image columns when --all-images is on (default: 10; 0 = no cap)
--jsonl Also write objects.jsonl with the raw ES _source per record
--batch-size Scroll batch size (default: 1000)
--dry-run Show the query and estimated count without exporting
Each export creates a timestamped folder:
exports/export_20260401_120000/
├── objects.csv # the exported data
└── export_info.txt # summary of settings and record count
| Field | Source |
|---|---|
| identifier | Primary identifier (accession number) |
| uid | Collection record ID (e.g. co12345) |
| created | Record created date (UTC) |
| modified | Record last modified date (UTC) |
| title | Primary title |
| object_name | Primary name |
| description | Primary description |
| date_made | Creation date (catalogue entry) |
| place_made | Creation place (catalogue entry) |
| maker | Creation maker (catalogue entry) |
| category | Category names (semicolon-separated) |
| materials | Material values (semicolon-separated) |
| measurements | Measurements display string |
| url | Public collection URL |
With --include-images or --download-images (first image only):
| Field | Source |
|---|---|
| image_path | URL to large image, or local path if downloading |
| image_licence | Image licence (e.g. CC BY-NC-SA 4.0) |
| image_copyright | Image copyright holder |
| image_credit | Image credit line |
With --all-images (every image per record), the four columns above become numbered groups: image_1_path, image_1_licence, image_1_copyright, image_1_credit, image_2_path, …, up to image_N_* where N is the largest multimedia array across all matching records. Records with fewer than N images leave trailing columns empty.