feat(custom_data): emit MCF Source/Provenance nodes for DCP strict mode#121
Merged
Conversation
Replace the legacy sources{} shorthand with MCF-first add_source/add_provenance builders and reshape inputFiles from a dict to a list with dcid: provenance references, so generated bundles pass DCP strict mode (DATA_RUN_MODE=dcpbridge). Adds optional Source/Provenance metadata kwargs and a pattern kwarg on add_explicit_schema_file. export_all fails fast if a referenced provenance's MCF file is not exported.
BREAKING CHANGE: Config.sources and the Source model are removed. The old add_provenance(provenance_name, provenance_url, source_name, source_url) is replaced by add_source(name, url, ...) + add_provenance(name, url, source, ...). inputFiles is now a list, not a dict. rename_source, remove_source, rename_provenance, remove_by_source, and remove_by_provenance are removed.
Closes #107
Co-authored-by: Claude <noreply@anthropic.com>
Address PR review (3 findings): - get_unregistered_csv_files now treats GCS files matching an inputFile pattern as registered, instead of reporting them as stray unregistered files. - export_all's complete-bundle guard now also requires the MCF file defining each referenced provenance's sourceLink Source, not just the provenance file, so a Source kept in a separate unexported MCF file is caught. - docs: source/provenance example names use valid dcid tokens (no whitespace) to match the strict mint_dcid contract, with a note explaining names become dcids. Co-authored-by: Claude <noreply@anthropic.com>
Add additional tests
Match the repo's existing node-type convention (dcid:Topic, dcid:StatisticalVariable). The DC importer strips the namespace prefix, so dcs: and dcid: import identically; this is a consistency change only.
tillywoodfield
approved these changes
Jun 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Implements #107: emit first-class MCF
Source/Provenancenodes and list-formatinputFileswithdcid:provenance references so generated bundles pass DCP strict mode (DATA_RUN_MODE=dcpbridge), plus the richer Source/Provenance metadata the platform surfaces through Mixer'sGetVariableMetadata.Breaking change. The legacy
sources{}shorthand and the oldadd_provenance(...)signature are removed (not deprecated-in-place).New API
add_source(*, name, url, …)andadd_provenance(*, name, url, source, …)build MCF nodes (dcid:source/<name>,dcid:provenance/<name>,typeOf: dcs:Source/dcs:Provenance,sourceLink) into a defaultprovenance.mcf. A bare name is minted to adcid:id; an already-namespaced value is used verbatim.additional_propertiesescape hatch.add_explicit_schema_filegains apatternkwarg (config-only, mutually exclusive withfile_name).inputFilesis now a JSON list; each entry'sprovenanceserializes asdcid:provenance/<name>.export_allfails fast (before writing anything) if an input file references a provenance whose MCF file is not inmcf_file_names, so a partial bundle can never reach disk.export_configstays guard-free as the "config now, MCF later" escape hatch.Removed (breaking)
Config.sources, theSourcemodel, the oldadd_provenance(provenance_name, provenance_url, source_name, source_url), andrename_source/remove_source/rename_provenance/remove_by_source/remove_by_provenance. The provenance↔node integrity check moved fromConfigtoCustomDataManager.Testing
uv run ruff format --check,ruff check,ty checkall clean;uv run pytest→ 106 passed.single_entity_official_keysfixture shape (listinputFiles,dcid:refs,provenance.mcfwith Source/Provenance, nosources{}).datacommonsorg/importthat the strict validator normalizes thedcs:namespace ontypeOf(kg_util/mcf_parser.pystrips it before any triple reaches validation), so the emitteddcs:Source/dcs:Provenanceform passes.Remaining verification (not blocking this PR)
Two of the issue's acceptance criteria live downstream of this repo: end-to-end "passes
DCP_BRIDGEstrict validation" and "metadata surfaces throughGetVariableMetadata". They're verified here only by fixture-shape matching plus the validator code trace above; a final confirmation by running a generated bundle through the realdcpbridgeimporter / Mixer is recommended before relying on it in production. A local pre-flight check is tracked separately in #110.Co-authored-by: Claude noreply@anthropic.com