Skip to content

Latest commit

 

History

History
210 lines (171 loc) · 6.19 KB

File metadata and controls

210 lines (171 loc) · 6.19 KB

Case Templates

A CaseTemplate defines the blueprint for generating multi-document cases. It specifies what entities exist, what facts can be introduced, what document types appear, and how they're distributed across a case timeline.

Quick Example

from synthdocs import (
    CaseTemplate,
    DocumentTypeSpec,
    EntitySchema,
    EntityField,
    FactType,
    FactField,
    StyleVariant,
)

template = CaseTemplate(
    name="lease-dispute",
    description=(
        "Tenant {{ tenant.name }} is renting {{ property.address }} "
        "from {{ landlord.name }}. They are in a dispute."
    ),
    entity_schemas=[
        EntitySchema(
            name="Tenant",
            description="The tenant involved in the dispute",
            fields=[
                EntityField(name="name", description="Full name", field_type="str"),
                EntityField(name="email", description="Email", field_type="str"),
            ],
        ),
        # ... more entity schemas
    ],
    fact_types=[
        FactType(
            name="LeaseTerms",
            description="Key lease terms",
            fields=[
                FactField(name="monthly_rent", description="Rent amount", field_type="int"),
                FactField(name="lease_start", description="Start date", field_type="date"),
            ],
            template="Lease terms: {{ monthly_rent }} per month, starting {{ lease_start }}",
        ),
        # ... more fact types
    ],
    document_types=[
        DocumentTypeSpec(
            name="Lease Agreement",
            description="Signed lease contract",
            probability=1.0,
            min_count=1,
            max_count=1,
            introduces_fact_types=["LeaseTerms"],
        ),
        # ... more document types
    ],
    target_document_count=(3, 6),
)

See examples/lease_case_template.py for a complete working example.

Core Concepts

Entities (generated once per case)

Entities are the "actors" in a case—people, organizations, properties, etc. They're generated once at the start of case generation and stay consistent across all documents.

EntitySchema(
    name="Tenant",
    description="The tenant involved in the lease dispute",
    fields=[
        EntityField(name="name", description="Full legal name", field_type="str"),
        EntityField(name="email", description="Email address", field_type="str"),
        EntityField(name="phone", description="Phone number", field_type="str"),
    ],
)

Entity values are referenced in the case description template using Jinja2 syntax: {{ tenant.name }}.

Facts (generated just-in-time per document)

Facts are the claims that documents introduce. Unlike entities, facts are generated just-in-time before each document, allowing the LLM to make contextually appropriate choices based on:

  • The case entities and description
  • Previously introduced facts
  • The running summary of documents so far
FactType(
    name="RentPayment",
    description="Status of rent payment for a specific month",
    fields=[
        FactField(name="month", description="Month of payment", field_type="date"),
        FactField(name="amount_paid", description="Amount paid", field_type="int"),
        FactField(
            name="status",
            description="Payment status",
            field_type="enum",
            options=["paid", "partial", "late", "unpaid"],
        ),
    ],
    template="Rent payment {{ month }}: {{ status }} ({{ amount_paid }})",
)

The template field is a Jinja2 template that renders the fact to human-readable text. This rendered text is what gets located in the generated document.

Document Types (sampled per case)

Document types define what documents can appear in a case and how they're distributed.

DocumentTypeSpec(
    name="Inspection Report",
    description="Report describing property condition and issues",
    style_rules=["Objective tone", "Checklist format"],
    
    # Distribution
    probability=0.7,        # 70% chance this doc type appears
    min_count=0,            # Can be skipped
    max_count=2,            # Up to 2 instances
    
    # Timing (days after case start)
    days_after_start_min=30,
    days_after_start_max=180,
    
    # What facts this document introduces
    introduces_fact_types=["InspectionFinding"],
    
    # Style variants for diversity
    style_variants=[
        StyleVariant(name="detailed", description="Thorough, itemized findings"),
        StyleVariant(name="brief", description="Short, direct observations"),
    ],
    styles_to_sample=1,     # Pick 1 style variant per instance
)

Field Types

Both EntityField and FactField support these types:

Type Description Extra fields
str Free-form text
int Integer min_value, max_value
date ISO date string
enum One of fixed options options (list of strings)

Case Description Template

The description field on CaseTemplate is a Jinja2 template that produces the case overview. Reference entities by their schema name (lowercased):

CaseTemplate(
    description=(
        "Tenant {{ tenant.name }} is renting {{ property.address }} "
        "{{ property.unit }}, {{ property.city }} from {{ landlord.name }}."
    ),
    # ...
)

This description is passed to the LLM when generating each document, ensuring consistency.

Controlling Output

Document Count

Use target_document_count to set bounds on total documents per case:

CaseTemplate(
    # ...
    target_document_count=(3, 6),  # Generate 3-6 documents per case
)

Style Preferences

Global style rules that apply to all documents:

CaseTemplate(
    # ...
    style_preferences=[
        "Use clear, formal legal language",
        "Include realistic addresses and dates",
    ],
)

Using the Template

from synthdocs import generate_case_batch, MistralBackend

results = generate_case_batch(
    template=my_template,
    count=5,
    backend=MistralBackend(),
    output_dir=Path("./output"),
    variation_hints="Mix of urban and rural addresses",
)

The variation_hints parameter guides the LLM to produce diverse entity values across cases.