Skip to content

Latest commit

 

History

History
332 lines (278 loc) · 12.6 KB

File metadata and controls

332 lines (278 loc) · 12.6 KB

Data Formats

BioSample JSON Input (bs_entries)

A list of BioSample entries. Supports JSON array or JSONL (one JSON object per line) format.

Each entry must have an accession field.

[
  {
    "accession": "SAMN00000001",
    "title": "HeLa cell RNA-seq",
    "characteristics": {
      "cell_line": "HeLa",
      "organism": "Homo sapiens"
    }
  }
]

JSONL format:

{"accession": "SAMN00000001", "title": "HeLa cell RNA-seq", ...}
{"accession": "SAMN00000002", "title": "HEK293 cell ChIP-seq", ...}

Mapping TSV (for evaluation)

A TSV file used for evaluating Select accuracy. A header row is required.

Note: The extraction answer column is the output of a previous tool (MetaSRA), not a human-curated ground truth. It is not used for evaluation. Only mapping answer ID (human-curated) is used as the gold standard for Select mode evaluation.

Column Description
BioSample ID BioSample accession
Experiment type Experiment type
extraction answer Previous tool output (not used for evaluation)
mapping answer ID Human-curated ground truth mapping ID (used for Select evaluation)
mapping answer label Ground truth mapping label
BioSample ID Experiment type extraction answer mapping answer ID mapping answer label
SAMN00000001 RNA-seq HeLa CVCL_0030 HeLa
SAMN00000002 RNA-seq HEK293 CVCL_0045 HEK293

Extract Result JSON (ExtractResult)

Saved to bsllmner2-results/extract/{run_name}.json.

{
  "entries": [
    {
      "accession": "SAMN00000001",
      "extracted": { "cell_line": "HeLa" },
      "raw_output": "{\"cell_line\": \"HeLa\"}",
      "llm_timing": {
        "total_duration": 1000000000,
        "load_duration": 100000000,
        "eval_count": 50,
        "eval_duration": 500000000,
        "prompt_eval_count": 100
      }
    }
  ],
  "run_metadata": {
    "run_name": "llama3.1:70b_20250101_120000",
    "model": "llama3.1:70b",
    "thinking": false,
    "start_time": "2025-01-01T12:00:00Z",
    "end_time": "2025-01-01T12:10:00Z",
    "status": "completed",
    "processing_time_sec": 600.0,
    "total_entries": 1
  },
  "performance": null,
  "errors": []
}

Key Fields

Path Type Description
entries[].accession string BioSample accession
entries[].extracted dict | list | null Parsed extraction result
entries[].raw_output string | null Raw JSON string from LLM
entries[].llm_timing LlmTimingFields Lightweight timing data (nanoseconds)
run_metadata.run_name string Run identifier
run_metadata.model string Model name
run_metadata.start_time datetime ISO 8601 UTC start time
run_metadata.end_time datetime | null ISO 8601 UTC end time
run_metadata.status "running" | "completed" | "failed" Run status
run_metadata.processing_time_sec float | null Processing time (seconds)
run_metadata.total_entries int | null Total processed entries
errors list[ErrorLog] Error information

LlmTimingFields

Lightweight timing fields extracted from ChatResponse (nanoseconds). Replaces the full ChatResponse in persisted output.

Field Type Description
total_duration int Total duration (ns)
load_duration int Model load duration (ns)
eval_count int Number of tokens generated
eval_duration int Token generation duration (ns)
prompt_eval_count int Number of prompt tokens

Select Result JSON (SelectResult)

Saved to bsllmner2-results/select/select_{run_name}.json.

{
  "entries": [
    {
      "extract": {
        "accession": "SAMN00000001",
        "extracted": { "cell_line": "HeLa", "tissue": "cervix" },
        "raw_output": "{\"cell_line\": \"HeLa\", \"tissue\": \"cervix\"}",
        "llm_timing": { "total_duration": 0, "load_duration": 0, "eval_count": 0, "eval_duration": 0, "prompt_eval_count": 0 }
      },
      "search_results": {
        "cell_line": {
          "HeLa": [
            {
              "term_uri": "http://purl.obolibrary.org/obo/CVCL_0030",
              "term_id": "CVCL:0030",
              "prop_uri": "http://www.w3.org/2000/01/rdf-schema#label",
              "value": "HeLa",
              "label": "HeLa",
              "exact_match": true,
              "text2term_score": null,
              "reasoning": null,
              "definitions": null,
              "comments": ["Disease: Cervical adenocarcinoma"]
            }
          ]
        }
      },
      "text2term_results": {},
      "select_timings": {
        "cell_line": {
          "HeLa": { "total_duration": 500000000, "load_duration": 0, "eval_count": 20, "eval_duration": 200000000, "prompt_eval_count": 50 }
        }
      },
      "results": {
        "cell_line": [
          {
            "value": "HeLa",
            "term_id": "CVCL:0030",
            "term_uri": "http://purl.obolibrary.org/obo/CVCL_0030",
            "label": "HeLa",
            "exact_match": true,
            "reasoning": "Exact match found for HeLa"
          }
        ]
      }
    }
  ],
  "run_metadata": {
    "run_name": "llama3.1:70b_20250101_120000",
    "model": "llama3.1:70b",
    "thinking": false,
    "start_time": "2025-01-01T12:00:00Z",
    "end_time": "2025-01-01T12:15:00Z",
    "status": "completed",
    "processing_time_sec": 900.0,
    "total_entries": 1
  },
  "evaluation": null,
  "performance": null,
  "errors": []
}

Key Fields

Path Type Description
entries[].extract ExtractEntry Embedded extract result for this entry
entries[].search_results dict[field, dict[value, list[SearchResult]]] Stage 2a ontology search results
entries[].text2term_results dict[field, dict[value, list[SearchResult]]] Stage 2b text2term results
entries[].search_results.*.[].definitions list[str] | null obo:IAO_0000115 values collected from the subset OWL. Passed to the Stage 3 LLM as term-level context
entries[].search_results.*.[].comments list[str] | null rdfs:comment values. In the default subset OWLs only ChEBI populates this (with has_role info as "{role_type}: {role_label}"); most other ontologies leave it null
entries[].select_timings dict[field, dict[value, LlmTimingFields]] Per-field LLM timing
entries[].results dict[field, list[ResolvedValue]] Final mapping results
evaluation EvaluationMetrics | null Evaluation metrics (independent from RunMetadata). All ratio fields (accuracy, precision, recall, f1) are stored as 0–1 ratios, not percentages.
errors list[ErrorLog] Error information

ResolvedValue

Unified result type for Select mode output.

Field Type Description
value string Original extracted value
term_id string | null Matched ontology term ID
term_uri string | null Matched ontology term URI
label string | null Ontology term label
exact_match bool | null Whether it was an exact match
reasoning string | null LLM reasoning for selection

Select Config JSON

Configuration file for Select mode. Defines the ontology file and prompt per field.

{
  "fields": {
    "cell_line": {
      "ontology_file": "/app/ontology/cellosaurus_human.owl",
      "prompt_description": "Cell line is a group of cells that are genetically identical...",
      "value_type": "string"
    },
    "drug": {
      "ontology_file": "/app/ontology/chebi_subset.owl",
      "prompt_description": "Drug is a chemical or biological substance...",
      "value_type": "array"
    },
    "knockout_gene": {
      "ontology_file": "/app/ontology/ncbi_gene_human.owl",
      "prompt_description": "Knockout gene refers to a gene that has been rendered completely non-functional...",
      "value_type": "array"
    }
  }
}

All ontologies are delivered as pre-subsetted OWLs: cellosaurus_{human,mouse}.owl (built by scripts/preprocess_cellosaurus.py --taxid ...), {cl,uberon}_{human,mouse}_subset.owl, chebi_subset.owl, and mondo_human_subset.owl (built by scripts/build_subset_ontologies.sh). No runtime filter is applied.

For the full specification of each field, see Select Mode - Select Config Customization.

Prompt YAML

Prompts are defined in YAML as a list of role and content.

- role: system
  content: |-
    You are a smart curator of biological data
- role: user
  content: |-
    I will input JSON formatted metadata of a sample...
    Here is the input metadata:

role must be one of "system", "user", or "assistant".

Format JSON Schema

A JSON Schema that controls the LLM output format. Passed to the Ollama format parameter.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "cell_line": { "type": ["string", "null"] }
  },
  "required": ["cell_line"],
  "additionalProperties": true
}

In Select mode, the schema is dynamically generated from the SelectConfig field definitions (build_extract_schema_for_select). For value_type: "array", it is generated as {"type": ["array", "null"], "items": {"type": "string"}}. The generated schema always includes "additionalProperties": false.

PerformanceSummary

Performance data is embedded in the performance field of ExtractResult and SelectResult. There is no separate benchmark file; all data lives inside the result JSON.

Key Fields

Path Type Description
performance.total_input_entries int Total input entries
performance.completed_count int Entries that completed processing
performance.total_wall_sec float | null Total wall-clock time (seconds)
performance.stage_timings[] StageTimings[] Per-batch stage breakdown
performance.ner_llm_timing LlmTimingSummary | null Aggregated NER LLM timing stats
performance.select_llm_timing LlmTimingSummary | null Aggregated Select LLM timing stats (Select mode only)
performance.disk_io DiskIoTimings Disk I/O timing breakdown (Select mode only)

Accuracy metrics (accuracy, precision, recall, f1) are in SelectResult.evaluation, not in PerformanceSummary.

LlmTimingSummary Fields

Field Description
call_count Number of LLM calls
total_duration_sec Sum of total_duration across all calls
mean_latency_sec Mean latency per call (total_duration - load_duration)
p50/p95/p99_latency_sec Latency percentiles
mean_tokens_per_sec Mean generation speed (eval_count / eval_duration)
p50/p95_tokens_per_sec tokens/sec percentiles
mean_load_duration_sec Mean model load time (high = cold start)
max_load_duration_sec Max model load time
total_prompt_tokens Total prompt tokens processed
total_eval_tokens Total tokens generated

StageTimings Fields

One entry per processed batch (performance.stage_timings[]).

Field Type Description
batch_idx int Zero-based batch index
batch_size int Number of entries in this batch
ner_sec float | null Stage 1 NER wall-clock time
ontology_search_sec float | null Stage 2a word-combination search time
text2term_sec float | null Stage 2b text2term.map_terms() time (cache load + scoring once the text2term cache is warm)
llm_select_sec float | null Stage 3 LLM selection time (asyncio.gather max across fields)
resume_write_sec float | null Resume checkpoint write time after the batch completes

DiskIoTimings Fields

Run-wide disk I/O timing lists (performance.disk_io). Entries are appended in the order operations occur, so len(list) indicates how many times the operation ran.

Field Type Description
index_cache_load_sec list[float] OntologyIndex cache load time per ontology file (hit)
index_cache_save_sec list[float] OntologyIndex cache save time per ontology file (miss -> rebuilt)
index_build_from_file_sec list[float] OntologyIndex build time per OWL/TSV file (cache miss)
text2term_cache_build_sec list[float] text2term.cache_ontology() time per OWL (first run only)
text2term_cache_load_sec list[float] text2term.cache_exists() check time per OWL (cache hit path)
resume_write_sec list[float] Per-batch resume write time

For interpretation guidance, see benchmarking.md.