Data Discovery converts a natural-language hypothesis or task query into a small, auditable set of candidate data artifacts. It is designed as the retrieval layer before a downstream coding agent.
The core boundary is:
repository folder -> artifact manifests
hypothesis / task query -> LM-authored generic retrieval requirements -> LM-authored ARG
ARG + manifests -> deterministic support report -> greedy selected file set
The LM handles only natural-language interpretation. Indexing, support scoring, selection, and provenance writing are deterministic.
pip install -e .Default LM calls use the regular OpenAI client.
Required:
export OPENAI_API_KEY="..."Optional:
export OPENAI_BASE_URL="..."
export OPENAI_ORG_ID="..."
export OPENAI_PROJECT="..."Azure OpenAI is available only as an explicit opt-in helper for users who call
make_azure_client_from_env() directly. It is not the CLI/default notebook
client. Azure env names are:
export AZURE_OPENAI_ENDPOINT="..."
export AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_API_VERSION="2025-04-01-preview"There is no no-API fallback for LM authoring. If an API key is unavailable, LM discovery should fail loudly.
For each hypothesis run, the package writes:
hypothesis_intake.jsonhypothesis_admissibility_report.jsonanalysis_requirement_graph.jsonartifact_support_report.jsonartifact_selection_result.jsonrun_summary.jsonlm_arg_authoring.json
artifact_selection_result.json has decision selected only when the greedy
file set covers every required ARG target. Decision rejected can still include
useful partial retrieval; it means at least one required ARG target remained
unsatisfied.
The default profile is generic. The LM authors retrieval concepts that are not
tied to a medical or pathology schema:
- entities
- identifiers
- measurements
- variables/features
- labels/targets
- modalities
- time/index axes
- join requirements
- provenance requirements
The normalized ARG uses generic node types:
entity_setidentifier_mappingmeasurement_sourcefeature_sourcelabel_sourcemodality_sourcetime_index_sourceprovenance_source
The older clinical profile remains available as discovery_profile="clinical"
for compatibility, but it is not the default.
Repository-specific metadata is allowed when it represents real repository data or documented repository conventions. It must be explicit input with provenance.
Allowed:
This repository contains a table mapping whole-slide image IDs to patient IDs.
Index that table as linkage evidence.
Disallowed:
If the hypothesis mentions a marker, force-select a specific workbook.
python -m data_discovery.cli build-index \
--repository-root /path/to/repository_data \
--output-dir runs/my_repository_indexOptional repository metadata:
python -m data_discovery.cli build-index \
--repository-root /path/to/repository_data \
--output-dir runs/my_repository_index \
--repository-metadata-config configs/repository_linkage.jsonfrom data_discovery import (
discover_for_hypothesis_with_lm_arg,
make_openai_client_from_env,
)
client = make_openai_client_from_env()
result = discover_for_hypothesis_with_lm_arg(
client=client,
model="gpt-5-mini",
repository_id="my_repository",
hypothesis_text="High marker expression is associated with worse survival.",
repository_context="This repository contains molecular expression tables and clinical survival tables.",
discovery_profile="generic",
manifest_path="runs/my_repository_index/artifact_manifests.jsonl",
output_dir="runs/my_discovery_run",
)python -m data_discovery.cli discover-lm-arg \
--repository-id my_repository \
--hypothesis-text "High marker expression is associated with worse survival." \
--repository-context "This repository contains molecular expression and survival tables." \
--manifest-path runs/my_repository_index/artifact_manifests.jsonl \
--output-dir runs/my_discovery_run \
--model gpt-5-miniThis repository is intended to run without importing the original development
monorepo. Runtime modules must not import HypothesisGeneration.*.
For a clean local check from outside any parent monorepo:
cd /tmp
PYTHONPATH=/path/to/data-discovery/src python -m data_discovery.cli build-index \
--repository-root /path/to/repository_data \
--output-dir /tmp/data_discovery_index \
--no-progress