Open Dictionary is a staged dictionary-production pipeline built on top of Wiktionary / Wiktextract data.
The current rewrite line is PostgreSQL-first and contract-driven:
- raw source snapshots are ingested into tracked PostgreSQL tables
- assembled entries are produced as word-centric learner-entry skeletons
- definition generation produces learner-facing explanatory fields in an explicit target definition language
- final read-only artifacts are exported as distribution JSONL, distribution SQLite, and audit JSONL
- Install project dependencies:
uv sync - Configure a
.envfile withDATABASE_URL - Ensure a PostgreSQL database is reachable via that URL
- Configure a model env file with
LLM_API,LLM_KEY, andLLM_MODELwhen runninggenerate-definitionsor any pipeline command that reaches the definition-generation stage
Default env-file behavior:
- if you do not pass
--env-file, commands read./.envfrom the current working directory - if you do not pass
--model-env-file, LLM commands fall back to the same--env-filevalue - there is no automatic parent-directory or repo-root search beyond that explicit
./.envdefault
The rewrite pipeline is explicitly staged:
database foundation
-> init-db
source snapshot acquisition
-> ingest-snapshot
raw PostgreSQL rows
-> assemble-entries
curated learner-entry skeletons
-> generate-definitions
curated structure + generated explanations
-> export-distribution
-> export-distribution-sqlite
-> validate-distribution
optional debug artifact
-> export-audit
The one-command wrapper for this flow is run, which still calls the
stage contracts in order instead of hiding them behind implicit side effects.
All CLI commands now follow the same output conventions:
stdout: one structured JSON result objectstderr: progress events and warnings
Example progress lines:
[progress] stage=definitions.generate event=generate_progress processed=150 queued_entries=742 succeeded=150 failed=0
[progress] stage=distribution.validate event=validate_complete validated_entries=741
Apply the rewrite schemas and metadata tables:
uv run opend init-dbThis creates the initial meta, raw, curated, llm, and export schemas.
Run the full staged pipeline from one CLI command:
uv run opend run \
--archive-path fixtures/wiktionary/raw.jsonl \
--model-env-file .env \
--distribution-output data/export/distribution.jsonl \
--validate-distributionRecommended real-model run with adaptive concurrency tiers and both export artifacts:
uv run opend run \
--archive-path fixtures/wiktionary/raw.jsonl \
--model-env-file .env \
--worker-tiers 50 12 4 1 \
--distribution-output data/export/distribution.jsonl \
--distribution-sqlite-output data/export/distribution.sqlite \
--audit-output data/export/audit.jsonl \
--validate-distributionExample with an explicit non-default definition language:
uv run opend run \
--archive-path fixtures/wiktionary/raw.jsonl \
--lang-codes en \
--definition-language-code fr \
--definition-language-name French \
--model-env-file .env \
--distribution-output data/export/en-headwords-fr-definitions.jsonl \
--validate-distributionUseful pipeline flags:
--skip-init-db--lang-codes en zh--limit-groups 100--limit-entries 50--worker-tiers 50 12 4 1--distribution-output data/export/distribution.jsonl--distribution-sqlite-output data/export/distribution.sqlite--audit-output data/export/audit.jsonl--validate-distribution--definition-language-code fr--definition-language-name French
Ingest a Wiktionary snapshot into the tracked raw tables:
uv run opend ingest-snapshot --workdir data/rawOr ingest from an already downloaded local archive:
uv run opend ingest-snapshot \
--archive-path /path/to/raw-wiktextract-data.jsonl.gz \
--workdir data/rawThis command:
- downloads or registers a source snapshot
- records a tracked pipeline run in
meta.pipeline_runs - records the source snapshot in
meta.source_snapshots - writes stage progress into
meta.stage_checkpoints - loads entries into
raw.wiktionary_entries - records malformed source records in
raw.wiktionary_ingest_anomalies
The rewrite ingest-snapshot stage reads .jsonl.gz archives directly and does not
require a fully materialized extracted JSONL file.
Transform raw Wiktionary records into word-centric assembled entries:
uv run opend assemble-entriesUseful flags:
uv run opend assemble-entries --limit-groups 100
uv run opend assemble-entries --lang-codes en zh
uv run opend assemble-entries --replace-existingThis stage writes to:
curated.entriescurated.entry_relationscurated.triage_queue
Generate structured learner-facing definitions from assembled entries:
uv run opend generate-definitionsThis stage writes to:
llm.prompt_versionsllm.entry_enrichments
Useful flags:
uv run opend generate-definitions --limit-entries 50
uv run opend generate-definitions --model-env-file .env
uv run opend generate-definitions --max-workers 50
uv run opend generate-definitions --max-retries 3
uv run opend generate-definitions --recompute-existing
uv run opend generate-definitions --definition-language-code en --definition-language-name EnglishExport the current merged entries-plus-definitions audit artifact:
uv run opend export-audit --output data/export/audit.jsonlUseful flags:
uv run opend export-audit --include-unenriched
uv run opend export-audit --model Qwen/Qwen3.5-35B-A3B-FP8
uv run opend export-audit --prompt-version curated_v1_distribution_fields_v2
uv run opend export-audit --definition-language-code en --definition-language-name EnglishThis stage records metadata in:
export.artifacts
Important:
- this audit artifact is not the final learner-facing distribution contract
- it intentionally preserves the internal
curatedanddefinitionsstage split for debugging, replay, and auditability - the learner-facing export is a separate command and schema
Export the learner-facing final JSONL artifact:
uv run opend export-distribution --output data/export/distribution.jsonlUseful flags:
uv run opend export-distribution --model Qwen/Qwen3.5-35B-A3B-FP8
uv run opend export-distribution --prompt-version curated_v1_distribution_fields_v2
uv run opend export-distribution --definition-language-code en --definition-language-name EnglishThis export:
- requires successful definition-generation rows with the distribution-field prompt contract
- flattens curated structure and generated explanatory fields into
distribution_entry_v1 - skips entries that do not contain any distributable meanings after merge
- keeps model, prompt, retries, and provenance in artifact metadata rather than leaking them into each distribution row
Validate an existing distribution JSONL file:
uv run opend validate-distribution \
--input data/export/distribution.jsonlExport the learner-facing final SQLite artifact:
uv run opend export-distribution-sqlite --output data/export/distribution.sqliteUseful flags:
uv run opend export-distribution-sqlite --model Qwen/Qwen3.5-35B-A3B-FP8
uv run opend export-distribution-sqlite --prompt-version curated_v1_distribution_fields_v2
uv run opend export-distribution-sqlite --definition-language-code en --definition-language-name EnglishThis export:
- contains the same learner-facing
distribution_entry_v1content as the JSONL export - stores the exact document row in
entries.document_jsonfor lossless round-trip - also normalizes the artifact into queryable SQLite tables such as
entries,pos_groups,meanings,meaning_examples, and relation tables - records upstream run lineage and export metadata both in PostgreSQL
export.artifactsand inside the SQLitemetadatatable
Download the compressed snapshot archive for local inspection:
uv run opend fetch-snapshot --output data/raw-wiktextract-data.jsonl.gzExtract the JSONL file when you need to inspect the raw records directly:
uv run opend unpack-snapshot \
--input data/raw-wiktextract-data.jsonl.gz \
--output data/raw-wiktextract-data.jsonlThe main commands are:
init-dbfetch-snapshotunpack-snapshotingest-snapshotassemble-entriesgenerate-definitionsexport-auditexport-distributionexport-distribution-sqlitevalidate-distributionrun