Skip to content

JinchuLi2002/data-discovery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Discovery

Data Discovery converts a natural-language hypothesis or task query into a small, auditable set of candidate data artifacts. It is designed as the retrieval layer before a downstream coding agent.

The core boundary is:

repository folder -> artifact manifests
hypothesis / task query -> LM-authored generic retrieval requirements -> LM-authored ARG
ARG + manifests -> deterministic support report -> greedy selected file set

The LM handles only natural-language interpretation. Indexing, support scoring, selection, and provenance writing are deterministic.

Install

pip install -e .

Environment

Default LM calls use the regular OpenAI client.

Required:

export OPENAI_API_KEY="..."

Optional:

export OPENAI_BASE_URL="..."
export OPENAI_ORG_ID="..."
export OPENAI_PROJECT="..."

Azure OpenAI is available only as an explicit opt-in helper for users who call make_azure_client_from_env() directly. It is not the CLI/default notebook client. Azure env names are:

export AZURE_OPENAI_ENDPOINT="..."
export AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_API_VERSION="2025-04-01-preview"

There is no no-API fallback for LM authoring. If an API key is unavailable, LM discovery should fail loudly.

What The Pipeline Returns

For each hypothesis run, the package writes:

  • hypothesis_intake.json
  • hypothesis_admissibility_report.json
  • analysis_requirement_graph.json
  • artifact_support_report.json
  • artifact_selection_result.json
  • run_summary.json
  • lm_arg_authoring.json

artifact_selection_result.json has decision selected only when the greedy file set covers every required ARG target. Decision rejected can still include useful partial retrieval; it means at least one required ARG target remained unsatisfied.

Generic ARG Front End

The default profile is generic. The LM authors retrieval concepts that are not tied to a medical or pathology schema:

  • entities
  • identifiers
  • measurements
  • variables/features
  • labels/targets
  • modalities
  • time/index axes
  • join requirements
  • provenance requirements

The normalized ARG uses generic node types:

  • entity_set
  • identifier_mapping
  • measurement_source
  • feature_source
  • label_source
  • modality_source
  • time_index_source
  • provenance_source

The older clinical profile remains available as discovery_profile="clinical" for compatibility, but it is not the default.

Repository Metadata

Repository-specific metadata is allowed when it represents real repository data or documented repository conventions. It must be explicit input with provenance.

Allowed:

This repository contains a table mapping whole-slide image IDs to patient IDs.
Index that table as linkage evidence.

Disallowed:

If the hypothesis mentions a marker, force-select a specific workbook.

Build An Index

python -m data_discovery.cli build-index \
  --repository-root /path/to/repository_data \
  --output-dir runs/my_repository_index

Optional repository metadata:

python -m data_discovery.cli build-index \
  --repository-root /path/to/repository_data \
  --output-dir runs/my_repository_index \
  --repository-metadata-config configs/repository_linkage.json

Run Discovery From Python

from data_discovery import (
    discover_for_hypothesis_with_lm_arg,
    make_openai_client_from_env,
)

client = make_openai_client_from_env()

result = discover_for_hypothesis_with_lm_arg(
    client=client,
    model="gpt-5-mini",
    repository_id="my_repository",
    hypothesis_text="High marker expression is associated with worse survival.",
    repository_context="This repository contains molecular expression tables and clinical survival tables.",
    discovery_profile="generic",
    manifest_path="runs/my_repository_index/artifact_manifests.jsonl",
    output_dir="runs/my_discovery_run",
)

Run Discovery From CLI

python -m data_discovery.cli discover-lm-arg \
  --repository-id my_repository \
  --hypothesis-text "High marker expression is associated with worse survival." \
  --repository-context "This repository contains molecular expression and survival tables." \
  --manifest-path runs/my_repository_index/artifact_manifests.jsonl \
  --output-dir runs/my_discovery_run \
  --model gpt-5-mini

Independence Contract

This repository is intended to run without importing the original development monorepo. Runtime modules must not import HypothesisGeneration.*.

For a clean local check from outside any parent monorepo:

cd /tmp
PYTHONPATH=/path/to/data-discovery/src python -m data_discovery.cli build-index \
  --repository-root /path/to/repository_data \
  --output-dir /tmp/data_discovery_index \
  --no-progress

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages