Skip to content

[ENHANCEMENT] : Optimize LLM.main_loop() for batch extraction of fields #403

@dibyajyoti-mandal

Description

@dibyajyoti-mandal

Description

Currently, the LLM extraction pipeline in src/llm.py fires a separate HTTP request
to Ollama for every field in the form via LLM.main_loop(). A form with 20 fields
produces 20 sequential round-trips to the local model, making the pipeline slow,
fragile, and wasteful. There is also no structured output contract the LLM returns
raw strings, and the -1 sentinel for missing values can silently end up written
into a PDF field.

Proposed Solution

Replace the per-field loop with a single batch extraction call LLM.extract_all() that sends all field names to Mistral at once and receives a single JSON object
containing every field value in one shot.

Suggested Implementation

New method: LLM.extract_all()

def extract_all(self, fields: list[str], transcript: str) -> dict:
    prompt = f"""
    SYSTEM PROMPT:
    You are an AI assistant designed to extract information from transcribed voice
    recordings and return the results as a structured JSON object.

    You will receive a list of field names and a transcript. For each field, identify
    its value in the transcript and include it in the JSON response.

    Rules:
    - Return a single JSON object where every key is a field name from the list.
    - If a field is plural and multiple values are found, return them separated by ";".
    - If a value cannot be found in the transcript, return null for that field.
    - Return JSON only. No explanation, no markdown, no extra text.

    ---
    Fields to extract: {json.dumps(fields)}
    Transcript: {transcript}
    """
    response = self.client.chat(
        model="mistral",
        messages=[{"role": "user", "content": prompt}],
        format="json"
    )
    return json.loads(response.message.content)

Fallback for backward compatibility

The old per-field loop is retained as LLM._legacy_per_field_extract() and called
automatically if the batch method fails (e.g. on older Ollama versions that do not
support JSON mode):

Benefits

  • Performance — eliminates N sequential Ollama round-trips, replacing them with
    a single call regardless of form size
  • Safetynull for missing values is unambiguous and handled cleanly by the
    downstream Validator, preventing sentinel strings from reaching the PDF
  • Backward compatibility — legacy per-field loop is preserved as a fallback,
    so deployments on older Ollama versions are unaffected

Files Affected

  • src/llm.py — primary change, new extract_all() and _batch_extract() methods
  • tests/test_llm.py — new unit tests for batch extraction, JSON parsing, plural
    handling, null sentinel, and fallback behaviour

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions