discogs-xml-converter

Purpose-built Rust tool for converting Discogs XML data dumps to CSV files compatible with the discogs-etl ETL pipeline.

Replaces three Python scripts with a single binary:

Python script	What it does	How Rust replaces it
`discogs-xml2db` (external)	XML to CSV	Built-in `quick-xml` parser
`fix_csv_newlines.py`	Fix embedded newlines	Rust `csv` crate produces RFC 4180 output
`filter_csv.py`	Filter by artist	`--library-artists` flag filters during parsing

Usage

The CLI follows the WXYC cache-builder convention with two subcommands:

discogs-xml-converter build  <input> [--data-dir DIR] [--library-artists FILE] [--limit N] [--resume]
discogs-xml-converter import <input> [--database-url URL] [--data-dir DIR] [--fresh] [--batch-size N]

build writes CSV files; import streams releases directly into PostgreSQL via COPY.

`build` — XML to CSV

# Convert all releases to CSV
discogs-xml-converter build releases.xml.gz --data-dir /path/to/csv/

# Convert and filter to library artists only
discogs-xml-converter build releases.xml.gz --data-dir /path/to/filtered/ \
  --library-artists library_artists.txt

# Limit records for testing
discogs-xml-converter build releases.xml.gz --data-dir /tmp/test/ --limit 100

`import` — XML to PostgreSQL

--database-url is required; if omitted the CLI falls back to the DATABASE_URL_DISCOGS environment variable.

# Stream releases directly into PostgreSQL
discogs-xml-converter import /path/to/xml-dumps/ \
  --data-dir /path/to/supplementary/ \
  --library-artists library_artists.txt \
  --database-url postgresql://localhost:5432/discogs

# Same, using DATABASE_URL_DISCOGS env var
DATABASE_URL_DISCOGS=postgresql://localhost:5432/discogs \
  discogs-xml-converter import /path/to/xml-dumps/ \
    --library-artists library_artists.txt

# Drop existing release rows before importing (--fresh truncates release CASCADE)
discogs-xml-converter import releases.xml.gz \
  --database-url postgresql://localhost:5432/discogs \
  --fresh

# Tune batch size (default: 100000)
discogs-xml-converter import releases.xml.gz \
  --database-url postgresql://localhost:5432/discogs \
  --batch-size 50000

Supplementary CSVs (artist_alias.csv, label_hierarchy.csv) are still written to --data-dir in directory-input mode.

Options

Common to both subcommands:

Flag	Description
`--data-dir DIR`	Working directory for CSV outputs (default: `./data`)
`--library-artists FILE`	Filter to releases by artists in this file (one per line)
`--limit N`	Stop after N releases
`--progress-interval N`	Log progress every N releases (default: 100000)

build only:

Flag	Description
`--resume`	Reserved for future resumable builds (currently a no-op; this tool is single-pass)
`--state-file FILE`	Path to the state file used by `--resume` (default: `./state.json`)

import only:

Flag	Description
`--database-url URL`	PostgreSQL connection URL. Falls back to `DATABASE_URL_DISCOGS`
`--fresh`	`TRUNCATE release ... CASCADE` before importing
`--batch-size N`	Releases to buffer before flushing to PostgreSQL (default: 100000)

--output-dir is accepted as a deprecated alias for --data-dir and emits a stderr warning. It will be removed in the next release.

Gzipped input is auto-detected by .gz extension.

CSV Output

Produces 6 CSV files:

File	Key columns
`release.csv`	id, status, title, country, released, notes, data_quality, master_id, format
`release_artist.csv`	release_id, artist_id, artist_name, extra, anv, position, join_field
`release_label.csv`	release_id, label, catno
`release_track.csv`	release_id, sequence, position, title, duration
`release_track_artist.csv`	release_id, track_sequence, artist_name
`release_image.csv`	release_id, type, width, height, uri

These are consumed by discogs-etl/scripts/import_csv.py using csv.DictReader.

Performance

Release processing is parallelized across all CPU cores:

A scanner thread reads the XML input and finds <release> element boundaries by byte scanning
A rayon worker pool parses XML and performs NFKD artist name normalization in parallel
The main thread writes matched releases to CSV sequentially, preserving document order

Artist and label XML files are also processed in parallel when both are present in directory mode. In directory mode, the release scanner starts before artist/label processing completes, overlapping the large file read with smaller-file CPU work.

Gzipped input

The converter streams .xml.gz files via on-the-fly decompression (flate2 GzDecoder), auto-detected by the .gz extension. This reduces disk I/O from ~57GB to ~5GB for the releases file, but adds CPU overhead for decompression. Net win when the bottleneck is disk throughput (slow or external drives); neutral or slightly slower on fast NVMe where CPU is the bottleneck.

Building

Requires the Rust toolchain.

cargo build --release
cargo install --path .

cargo install copies the binary to ~/.cargo/bin/discogs-xml-converter, which is on your PATH if you installed Rust via rustup. Alternatively, reference the binary directly at target/release/discogs-xml-converter.

Testing

cargo test

All tests use hand-written XML fixtures; no external data dumps needed.

Integration with discogs-etl

Direct-to-PostgreSQL pipeline (recommended)

Streams releases directly into PostgreSQL, eliminating the CSV round-trip:

python scripts/run_pipeline.py \
  --xml /path/to/xml-dumps/ \
  --library-artists library_artists.txt \
  --database-url postgresql://localhost:5432/discogs \
  --direct-pg

CSV pipeline

Feed the output into the --csv-dir pipeline mode:

# Convert and filter
discogs-xml-converter build releases.xml.gz \
  --data-dir /path/to/filtered/ \
  --library-artists library_artists.txt

# Run database build
python scripts/run_pipeline.py \
  --csv-dir /path/to/filtered/ \
  --database-url postgresql://localhost:5432/discogs

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

discogs-xml-converter

Usage

`build` — XML to CSV

`import` — XML to PostgreSQL

Options

CSV Output

Performance

Gzipped input

Building

Testing

Integration with discogs-etl

Direct-to-PostgreSQL pipeline (recommended)

CSV pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

discogs-xml-converter

Usage

build — XML to CSV

import — XML to PostgreSQL

Options

CSV Output

Performance

Gzipped input

Building

Testing

Integration with discogs-etl

Direct-to-PostgreSQL pipeline (recommended)

CSV pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`build` — XML to CSV

`import` — XML to PostgreSQL

Packages