Skip to content

feat(custom_data): emit MCF Source/Provenance nodes for DCP strict mode#121

Merged
jm-rivera merged 5 commits into
mainfrom
feat/source-provenance-mcf-107
Jun 29, 2026
Merged

feat(custom_data): emit MCF Source/Provenance nodes for DCP strict mode#121
jm-rivera merged 5 commits into
mainfrom
feat/source-provenance-mcf-107

Conversation

@jm-rivera

Copy link
Copy Markdown
Collaborator

What

Implements #107: emit first-class MCF Source/Provenance nodes and list-format inputFiles with dcid: provenance references so generated bundles pass DCP strict mode (DATA_RUN_MODE=dcpbridge), plus the richer Source/Provenance metadata the platform surfaces through Mixer's GetVariableMetadata.

Breaking change. The legacy sources{} shorthand and the old add_provenance(...) signature are removed (not deprecated-in-place).

New API

  • add_source(*, name, url, …) and add_provenance(*, name, url, source, …) build MCF nodes (dcid:source/<name>, dcid:provenance/<name>, typeOf: dcs:Source/dcs:Provenance, sourceLink) into a default provenance.mcf. A bare name is minted to a dcid: id; an already-namespaced value is used verbatim.
  • 11 optional metadata kwargs on the builders (license, licenseType, refresh/release/observation dates, curator, isPartOf, …) plus an additional_properties escape hatch.
  • add_explicit_schema_file gains a pattern kwarg (config-only, mutually exclusive with file_name).
  • inputFiles is now a JSON list; each entry's provenance serializes as dcid:provenance/<name>.
  • export_all fails fast (before writing anything) if an input file references a provenance whose MCF file is not in mcf_file_names, so a partial bundle can never reach disk. export_config stays guard-free as the "config now, MCF later" escape hatch.

Removed (breaking)

Config.sources, the Source model, the old add_provenance(provenance_name, provenance_url, source_name, source_url), and rename_source / remove_source / rename_provenance / remove_by_source / remove_by_provenance. The provenance↔node integrity check moved from Config to CustomDataManager.

Testing

  • uv run ruff format --check, ruff check, ty check all clean; uv run pytest106 passed.
  • Goldens regenerated from real export output; the bundle matches the importer's single_entity_official_keys fixture shape (list inputFiles, dcid: refs, provenance.mcf with Source/Provenance, no sources{}).
  • Confirmed by tracing datacommonsorg/import that the strict validator normalizes the dcs: namespace on typeOf (kg_util/mcf_parser.py strips it before any triple reaches validation), so the emitted dcs:Source/dcs:Provenance form passes.

Remaining verification (not blocking this PR)

Two of the issue's acceptance criteria live downstream of this repo: end-to-end "passes DCP_BRIDGE strict validation" and "metadata surfaces through GetVariableMetadata". They're verified here only by fixture-shape matching plus the validator code trace above; a final confirmation by running a generated bundle through the real dcpbridge importer / Mixer is recommended before relying on it in production. A local pre-flight check is tracked separately in #110.

Co-authored-by: Claude noreply@anthropic.com

Replace the legacy sources{} shorthand with MCF-first add_source/add_provenance builders and reshape inputFiles from a dict to a list with dcid: provenance references, so generated bundles pass DCP strict mode (DATA_RUN_MODE=dcpbridge). Adds optional Source/Provenance metadata kwargs and a pattern kwarg on add_explicit_schema_file. export_all fails fast if a referenced provenance's MCF file is not exported.

BREAKING CHANGE: Config.sources and the Source model are removed. The old add_provenance(provenance_name, provenance_url, source_name, source_url) is replaced by add_source(name, url, ...) + add_provenance(name, url, source, ...). inputFiles is now a list, not a dict. rename_source, remove_source, rename_provenance, remove_by_source, and remove_by_provenance are removed.

Closes #107

Co-authored-by: Claude <noreply@anthropic.com>
@jm-rivera jm-rivera requested a review from tillywoodfield June 26, 2026 14:54
@jm-rivera jm-rivera self-assigned this Jun 26, 2026
Address PR review (3 findings):

- get_unregistered_csv_files now treats GCS files matching an inputFile pattern as registered, instead of reporting them as stray unregistered files.

- export_all's complete-bundle guard now also requires the MCF file defining each referenced provenance's sourceLink Source, not just the provenance file, so a Source kept in a separate unexported MCF file is caught.

- docs: source/provenance example names use valid dcid tokens (no whitespace) to match the strict mint_dcid contract, with a note explaining names become dcids.

Co-authored-by: Claude <noreply@anthropic.com>
@jm-rivera jm-rivera marked this pull request as ready for review June 26, 2026 15:15
@jm-rivera jm-rivera changed the title feat(custom_data)!: emit MCF Source/Provenance nodes for DCP strict mode feat(custom_data): emit MCF Source/Provenance nodes for DCP strict mode Jun 26, 2026
Comment thread src/dcp_tools/custom_data/models/common.py
Comment thread src/dcp_tools/custom_data/models/sources.py Outdated
Comment thread src/dcp_tools/custom_data/models/sources.py Outdated
Add additional tests
Match the repo's existing node-type convention (dcid:Topic,
dcid:StatisticalVariable). The DC importer strips the namespace prefix,
so dcs: and dcid: import identically; this is a consistency change only.
@jm-rivera jm-rivera requested a review from tillywoodfield June 29, 2026 12:37
@jm-rivera jm-rivera merged commit 765240f into main Jun 29, 2026
3 checks passed
@jm-rivera jm-rivera deleted the feat/source-provenance-mcf-107 branch June 29, 2026 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants