Data Discovery

Data Discovery converts a natural-language hypothesis or task query into a small, auditable set of candidate data artifacts. It is designed as the retrieval layer before a downstream coding agent.

The core boundary is:

repository folder -> artifact manifests
hypothesis / task query -> LM-authored generic retrieval requirements -> LM-authored ARG
ARG + manifests -> deterministic support report -> greedy selected file set

The LM handles only natural-language interpretation. Indexing, support scoring, selection, and provenance writing are deterministic.

Install

pip install -e .

Environment

Default LM calls use the regular OpenAI client.

Required:

export OPENAI_API_KEY="..."

Optional:

export OPENAI_BASE_URL="..."
export OPENAI_ORG_ID="..."
export OPENAI_PROJECT="..."

Azure OpenAI is available only as an explicit opt-in helper for users who call make_azure_client_from_env() directly. It is not the CLI/default notebook client. Azure env names are:

export AZURE_OPENAI_ENDPOINT="..."
export AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_API_VERSION="2025-04-01-preview"

There is no no-API fallback for LM authoring. If an API key is unavailable, LM discovery should fail loudly.

What The Pipeline Returns

For each hypothesis run, the package writes:

hypothesis_intake.json
hypothesis_admissibility_report.json
analysis_requirement_graph.json
artifact_support_report.json
artifact_selection_result.json
run_summary.json
lm_arg_authoring.json

artifact_selection_result.json has decision selected only when the greedy file set covers every required ARG target. Decision rejected can still include useful partial retrieval; it means at least one required ARG target remained unsatisfied.

Generic ARG Front End

The default profile is generic. The LM authors retrieval concepts that are not tied to a medical or pathology schema:

entities
identifiers
measurements
variables/features
labels/targets
modalities
time/index axes
join requirements
provenance requirements

The normalized ARG uses generic node types:

entity_set
identifier_mapping
measurement_source
feature_source
label_source
modality_source
time_index_source
provenance_source

The older clinical profile remains available as discovery_profile="clinical" for compatibility, but it is not the default.

Repository Metadata

Repository-specific metadata is allowed when it represents real repository data or documented repository conventions. It must be explicit input with provenance.

Allowed:

This repository contains a table mapping whole-slide image IDs to patient IDs.
Index that table as linkage evidence.

Disallowed:

If the hypothesis mentions a marker, force-select a specific workbook.

Build An Index

python -m data_discovery.cli build-index \
  --repository-root /path/to/repository_data \
  --output-dir runs/my_repository_index

Optional repository metadata:

python -m data_discovery.cli build-index \
  --repository-root /path/to/repository_data \
  --output-dir runs/my_repository_index \
  --repository-metadata-config configs/repository_linkage.json

Run Discovery From Python

from data_discovery import (
    discover_for_hypothesis_with_lm_arg,
    make_openai_client_from_env,
)

client = make_openai_client_from_env()

result = discover_for_hypothesis_with_lm_arg(
    client=client,
    model="gpt-5-mini",
    repository_id="my_repository",
    hypothesis_text="High marker expression is associated with worse survival.",
    repository_context="This repository contains molecular expression tables and clinical survival tables.",
    discovery_profile="generic",
    manifest_path="runs/my_repository_index/artifact_manifests.jsonl",
    output_dir="runs/my_discovery_run",
)

Run Discovery From CLI

python -m data_discovery.cli discover-lm-arg \
  --repository-id my_repository \
  --hypothesis-text "High marker expression is associated with worse survival." \
  --repository-context "This repository contains molecular expression and survival tables." \
  --manifest-path runs/my_repository_index/artifact_manifests.jsonl \
  --output-dir runs/my_discovery_run \
  --model gpt-5-mini

Independence Contract

This repository is intended to run without importing the original development monorepo. Runtime modules must not import HypothesisGeneration.*.

For a clean local check from outside any parent monorepo:

cd /tmp
PYTHONPATH=/path/to/data-discovery/src python -m data_discovery.cli build-index \
  --repository-root /path/to/repository_data \
  --output-dir /tmp/data_discovery_index \
  --no-progress

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/data_discovery		src/data_discovery
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Discovery

Install

Environment

What The Pipeline Returns

Generic ARG Front End

Repository Metadata

Build An Index

Run Discovery From Python

Run Discovery From CLI

Independence Contract

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Discovery

Install

Environment

What The Pipeline Returns

Generic ARG Front End

Repository Metadata

Build An Index

Run Discovery From Python

Run Discovery From CLI

Independence Contract

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages