Skip to content

Local LLM-driven entity descriptions for archive gaps #15

Description

@adhit-r

Skills: Transformers.js / WebLLM · prompt engineering · build scripting · QA review
Time: ~10 hours
Good for: ML engineers · NLP folks · lore curators
Difficulty: Advanced


Context

Of our 310 entities, ~120 don't have Wikipedia entries (entity.long is
empty). Mostly minor characters, ships, vehicles. The semantic search struggles
on these because there's no narrative for the embedding model to chew on.

Goal

At build time, run a small local LLM (Phi-3-mini, Gemma-2B, or similar via
Transformers.js / WebLLM) to generate canonical descriptions from the entity's
relations + name + type. Manually verified before merging into kb.json.

Where to start

  • New scripts/build-llm-descriptions.ts — pure Node script that:
    • Loads kb.json
    • For each entity with empty long, builds a structured prompt from
      name + type + relations + short
    • Calls a local LLM (no external API)
    • Caches output to data/.cache/llm/<entityId>.json for review
  • A small UI in CLI to approve/reject each generated description before merging
  • Re-run build:embeddings after merge

Acceptance criteria

  • 80%+ of empty long fields populated with plausible descriptions
  • Every generated description manually reviewed (one-shot pass is fine)
  • No hallucinated facts; if the LLM doesn't have enough signal, leave the
    field empty
  • Local-only, no API keys, no network calls beyond model download

Notes

  • Model size budget: ≤2 GB on disk, ≤4 GB RAM
  • Generation budget: ≤2s per entity on CPU (so ~4 minutes total)

Metadata

Metadata

Assignees

No one assigned

    Labels

    advancedDeep technical chops neededarea: ai/mlEmbeddings, NLP, on-device MLarea: dataPipelines, ingestion, schemaenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions