Tag every legal document with precise, machine-readable legal concepts, individuals, and relationships — automatically.
Legal documents contain thousands of concepts buried in dense prose: causes of action, contract terms, regulatory frameworks, named entities, and inter-concept relationships. FOLIO Enrich reads your documents, identifies those concepts, maps each one to the FOLIO ontology (18,000+ standardized legal concepts), extracts named individuals (citations, parties, dates, amounts) and OWL object properties (legal verbs and relationships), scores confidence, and exports structured results in 13 formats — all through a single API call.
Upload complaints, contracts, or regulatory filings.
Seconds later, receive a structured annotation layer that machines can search, filter, sort, and analyze.
- Features
- Architecture
- Pipeline Stages
- Quick Start
- Configuration
- API Reference
- Export Formats
- Frontend
- LLM Integration
- Embedding & Semantic Search
- Confidence Scoring
- Individual Extraction
- Property Extraction
- Metadata Extraction
- Document Type Detection
- Synthetic Document Generation
- Testing
- Project Structure
- Dependencies
- License
- Multi-format ingestion — PDF, DOCX, HTML, Markdown, RTF, email (EML/MSG), and plain text
- Dual-path enrichment — spaCy EntityRuler and LLM concept extraction run in parallel, then reconcile
- FOLIO ontology mapping — resolves every annotation to a FOLIO IRI with multi-candidate backup lists
- Multi-branch classification — assigns multiple FOLIO branch categories per concept with nested same-type span support
- Individual extraction — three-path hybrid (eyecite/citeurl citations, 14 regex/spaCy extractors, LLM class linking) for OWL individuals
- Property extraction — Aho-Corasick + LLM identification of OWL ObjectProperties (legal verbs and relationships) with domain/range/inverse linking
- 28-field metadata extraction — pipeline-aware context engineering extracts parties, dates, courts, claims, attorneys, and more
- Document type detection — early parallel LLM classification with post-pipeline quality cross-check
- Calibrated confidence scoring — graduated initial scores, contextual LLM reranking, branch judge blending, and embedding triage across 5 stages
- Containment-aware dedup — nested spans (A inside B) survive across all stages; partial overlaps resolve to longer match
- 13 export formats — JSON, JSON-LD, XML, CSV, JSONL, Parquet, Elasticsearch bulk, Neo4j CSV, RAG chunks, RDF/Turtle, brat standoff, HTML, Excel
- Real-time streaming — Server-Sent Events (SSE) for live pipeline progress including individuals, properties, and document type
- Annotation lifecycle — promote, reject, restore, cascade-promote, and bulk-reject operations with full lineage tracking
- Per-task LLM routing — assign different LLM providers to 9 pipeline tasks (classifier, extractor, concept, branch judge, area of law, synthetic, individual, property, document type)
- 14 LLM providers — OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Meta Llama, Groq, xAI, GitHub Models, Ollama, LM Studio, Llamafile, and custom OpenAI-compatible endpoints
- Semantic search — FAISS-backed embedding index for fast concept lookup and conflict resolution
- Synthetic document generation — LLM-powered generation of realistic legal test documents across 40+ document types
- Legal citation parsing — eyecite + citeurl integration for citation extraction and URL resolution
- Feedback system — per-annotation user feedback with aggregated insights dashboard
- Dark-themed UI — single-file browser frontend with concept graph visualization (Cytoscape), individual/property tabs, multi-branch overlays, and polyhierarchy tree views
┌──────────────────────────────────────────────────────────────┐
│ Frontend (index.html) │
│ Vanilla JS · Dark theme · Cytoscape graph viz │
│ Tabs: Annotated Text · Concepts · Individuals · Properties │
│ Triples · Metadata · Export · Insights │
└───────────────────────────┬──────────────────────────────────┘
│ REST + SSE
┌───────────────────────────▼──────────────────────────────────┐
│ FastAPI Backend │
├──────────────────────────────────────────────────────────────┤
│ Middleware: Rate Limit → Security → CORS → Error Handler │
├──────────────────────────────────────────────────────────────┤
│ Routes: /enrich · /export · /concepts · /synthetic │
│ /feedback · /settings · /health │
├──────────────────────────────────────────────────────────────┤
│ Pipeline Orchestrator (16 stages · 8 UI stations) │
│ │
│ Phase 1 — Sequential │
│ 1. Ingestion → UI: Ingest │
│ 2. Normalization → UI: Normalize │
│ │
│ Phase 2 — Parallel (run concurrently) │
│ 3. EntityRuler → UI: String │
│ 4. LLM Concept → UI: LLM Extract │
│ 5. EarlyIndividual → UI: String │
│ 6. EarlyProperty → UI: String │
│ 7. DocumentType → UI: LLM Extract │
│ │
│ Phase 3 — Sequential (post-parallel) │
│ 8. Reconciliation → UI: Resolve │
│ 9. Resolution → UI: Resolve │
│ 10. Contextual Rerank → UI: Resolve │
│ 11. Branch Judge → UI: Judge │
│ 12. String Match → UI: Match │
│ 13. LLM Individual → UI: Finalize │
│ 14. LLM Property → UI: Finalize │
│ 15. Dependency → UI: Finalize │
│ 16. Metadata → UI: Finalize │
│ │
│ Post-pipeline → UI: Finalize │
│ Area of Law · Document Type Quality Check │
├──────────────────────────────────────────────────────────────┤
│ Services: FOLIO · Embedding · LLM Registry · Individual · │
│ Property · Quality · Job Store │
├──────────────────────────────────────────────────────────────┤
│ Storage: ~/.folio-enrich/jobs/ (JSON, atomic writes) │
└──────────────────────────────────────────────────────────────┘
The pipeline runs in three phases with 5 parallel stages and 9 sequential post-parallel stages. LLM-dependent stages are automatically skipped when no LLM is configured.
In the UI's live progress journey these 16 backend stages roll up into 8 user-facing stations — Ingest → Normalize → String → LLM Extract → Resolve → Judge → Match → Finalize. The UI Station column shows which station each backend stage appears under (several parallel/post stages share a station).
| # | Backend Stage | UI Station | Phase | Description |
|---|---|---|---|---|
| 1 | Ingestion | Ingest | Pre-parallel | Converts PDF, DOCX, HTML, Markdown, RTF, email, or plain text to raw text |
| 2 | Normalization | Normalize | Pre-parallel | Chunks text into semantic chunks, builds sentence index |
| 3 | EntityRuler | String | Parallel | spaCy pattern matching against FOLIO preferred and alternative labels |
| 4 | LLM Concept | LLM Extract | Parallel | LLM-based concept extraction per chunk (runs concurrently with EntityRuler) |
| 5 | EarlyIndividual | String | Parallel | Citation parsing (eyecite/citeurl) + 14 regex/spaCy entity extractors |
| 6 | EarlyProperty | String | Parallel | Aho-Corasick automaton matching FOLIO ObjectProperty labels |
| 7 | DocumentType | LLM Extract | Parallel | LLM identifies what the document "calls itself" from title/header |
| 8 | Reconciliation | Resolve | Post-parallel | Merges EntityRuler + LLM results using embedding-powered triage |
| 9 | Resolution | Resolve | Post-parallel | Resolves concept text to FOLIO IRIs with multi-candidate backup lists |
| 10 | Contextual Rerank | Resolve | Post-parallel | LLM reranking using full document context (50/50 blend with pipeline score) |
| 11 | Branch Judge | Judge | Post-parallel | LLM assigns FOLIO branch categories for ambiguous concepts (70/30 blend) |
| 12 | String Match | Match | Post-parallel | Aho-Corasick matching with containment-aware dedup and multi-branch keying |
| 13 | LLM Individual | Finalize | Post-parallel | LLM links individuals to resolved OWL class annotations |
| 14 | LLM Property | Finalize | Post-parallel | LLM identifies properties with domain/range cross-linking |
| 15 | Dependency | Finalize | Post-parallel | spaCy dependency parsing to extract subject-predicate-object triples |
| 16 | Metadata | Finalize | Post-parallel | LLM extracts 28 metadata fields using full pipeline context |
Post-pipeline (UI station: Finalize): Area of Law assessment classifies legal domains. Document Type quality cross-check validates findings against pipeline output.
- Python 3.11+
- A spaCy English model (
en_core_web_sm)
# Clone the repository
git clone https://github.com/alea-institute/folio-enrich.git
cd folio-enrich/backend
# Create virtual environment and install dependencies
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
# Download spaCy model
python -m spacy download en_core_web_smcd backend
.venv/bin/uvicorn app.main:app --host 0.0.0.0 --port 8731 --reloadThe API is now available at http://localhost:8731. Open frontend/index.html in a browser for the UI.
# Submit a document for enrichment
curl -X POST http://localhost:8731/enrich \
-H "Content-Type: application/json" \
-d '{"text": "The plaintiff filed a motion for summary judgment in the District Court."}'
# Check job status (replace JOB_ID)
curl http://localhost:8731/enrich/JOB_ID
# Export results as JSON-LD
curl http://localhost:8731/enrich/JOB_ID/export?format=jsonld
# Stream pipeline progress via SSE
curl http://localhost:8731/enrich/JOB_ID/streamAll settings use environment variables with the FOLIO_ENRICH_ prefix. Managed via pydantic-settings.
| Variable | Default | Description |
|---|---|---|
FOLIO_ENRICH_JOBS_DIR |
~/.folio-enrich/jobs |
Job storage directory |
FOLIO_ENRICH_MAX_UPLOAD_SIZE |
52428800 (50 MB) |
Maximum upload size in bytes |
FOLIO_ENRICH_MAX_CONCURRENT_JOBS |
10 |
Maximum concurrent pipeline jobs |
FOLIO_ENRICH_JOB_RETENTION_DAYS |
30 |
Days before jobs are auto-cleaned |
| Variable | Default | Description |
|---|---|---|
FOLIO_ENRICH_LLM_PROVIDER |
google |
Global default LLM provider |
FOLIO_ENRICH_LLM_MODEL |
gemini-3-flash |
Global default model |
FOLIO_ENRICH_OPENAI_API_KEY |
— | OpenAI API key |
FOLIO_ENRICH_ANTHROPIC_API_KEY |
— | Anthropic API key |
FOLIO_ENRICH_GOOGLE_API_KEY |
— | Google Gemini API key |
Each pipeline task can use a different provider/model:
| Variable Pattern | Example |
|---|---|
FOLIO_ENRICH_LLM_{TASK}_PROVIDER |
FOLIO_ENRICH_LLM_CLASSIFIER_PROVIDER=anthropic |
FOLIO_ENRICH_LLM_{TASK}_MODEL |
FOLIO_ENRICH_LLM_CLASSIFIER_MODEL=claude-sonnet-4-6 |
Tasks: CLASSIFIER, EXTRACTOR, CONCEPT, BRANCH_JUDGE, AREA_OF_LAW, SYNTHETIC, INDIVIDUAL, PROPERTY, DOCUMENT_TYPE
| Variable | Default | Description |
|---|---|---|
FOLIO_ENRICH_INDIVIDUAL_EXTRACTION_ENABLED |
true |
Enable individual (named entity) extraction |
FOLIO_ENRICH_INDIVIDUAL_REGEX_ONLY |
false |
Skip LLM class linking, use only regex/spaCy |
FOLIO_ENRICH_PROPERTY_EXTRACTION_ENABLED |
true |
Enable property (verb/relation) extraction |
FOLIO_ENRICH_PROPERTY_REGEX_ONLY |
false |
Skip LLM property identification, use only Aho-Corasick |
| Variable | Default | Description |
|---|---|---|
FOLIO_ENRICH_EMBEDDING_PROVIDER |
local |
Embedding provider (local, ollama, openai) |
FOLIO_ENRICH_EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
Embedding model name |
FOLIO_ENRICH_EMBEDDING_DISABLED |
false |
Disable embedding features entirely |
FOLIO_ENRICH_SEMANTIC_SIMILARITY_THRESHOLD |
0.80 |
Minimum similarity for conflict resolution |
| Variable | Default | Description |
|---|---|---|
FOLIO_ENRICH_MAX_CHUNK_CHARS |
3000 |
Maximum characters per text chunk |
FOLIO_ENRICH_CHUNK_OVERLAP_CHARS |
200 |
Overlap between adjacent chunks |
| Variable | Default | Description |
|---|---|---|
FOLIO_ENRICH_RATE_LIMIT_REQUESTS |
60 |
Max requests per window |
FOLIO_ENRICH_RATE_LIMIT_WINDOW |
60 |
Window size in seconds |
| Method | Endpoint | Description |
|---|---|---|
POST |
/enrich |
Submit a document for enrichment (returns 202 with job_id) |
GET |
/enrich/{job_id} |
Get job status and results (includes annotations, individuals, properties) |
GET |
/enrich/{job_id}/stream |
SSE stream of pipeline progress |
GET |
/enrich/branches |
List all FOLIO branches with display colors |
| Method | Endpoint | Description |
|---|---|---|
POST |
/enrich/{job_id}/annotations/{id}/promote |
Promote a backup concept to primary |
POST |
/enrich/{job_id}/annotations/{id}/reject |
Dismiss an annotation as a false positive |
POST |
/enrich/{job_id}/annotations/{id}/restore |
Restore a dismissed annotation |
POST |
/enrich/{job_id}/cascade-promote |
Bulk promote a concept across all matching annotations |
POST |
/enrich/{job_id}/annotations/bulk-reject |
Bulk reject all annotations with a given IRI |
GET |
/enrich/{job_id}/annotations/{id}/lineage |
Full event history for an annotation |
| Method | Endpoint | Description |
|---|---|---|
GET |
/enrich/{job_id}/export?format=json |
Export results in any of the 13 supported formats |
Query parameters: format (required), include_dismissed (default false)
All export formats include annotations, individuals, and properties sections.
| Method | Endpoint | Description |
|---|---|---|
GET |
/concepts/{iri_hash} |
Look up a FOLIO concept by IRI hash |
GET |
/concepts/{iri_hash}/graph |
BFS entity graph around a concept |
| Method | Endpoint | Description |
|---|---|---|
POST |
/synthetic |
Generate a synthetic legal document |
GET |
/synthetic/types |
List available document types (10 categories, 40+ types) |
| Method | Endpoint | Description |
|---|---|---|
POST |
/feedback |
Submit feedback on an annotation |
GET |
/feedback/insights |
Aggregated feedback insights |
GET |
/feedback/insights/csv |
Export feedback data as CSV |
| Method | Endpoint | Description |
|---|---|---|
GET |
/settings/llm/providers |
List available LLM providers |
GET |
/settings/llm/models |
List models for a provider |
POST |
/settings/llm/test |
Test an LLM connection |
GET |
/settings/embedding/providers |
List embedding providers |
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Simple health check |
GET |
/health/detail |
Detailed subsystem health (FOLIO, embedding, LLM, spaCy) |
| Format | Content-Type | Description |
|---|---|---|
json |
application/json |
Flat JSON with annotations, individuals, and properties arrays |
jsonld |
application/ld+json |
JSON-LD with @context for Linked Data |
xml |
application/xml |
Hierarchical XML with annotation elements |
csv |
text/csv |
One row per annotation |
jsonl |
application/x-ndjson |
Line-delimited JSON |
| Format | Content-Type | Description |
|---|---|---|
parquet |
application/octet-stream |
Apache Parquet columnar format |
elasticsearch |
application/x-ndjson |
Elasticsearch bulk indexing format |
neo4j |
text/csv |
CSV formatted for Neo4j graph import |
rag |
application/json |
RAG-optimized chunks with embedded annotations |
rdf |
text/turtle |
RDF/Turtle for Semantic Web integration |
brat |
text/plain |
brat standoff annotation format |
html |
text/html |
Interactive HTML with styled tooltips and confidence bars |
excel |
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
Spreadsheet with color-coded confidence |
The frontend is a single-file SPA at frontend/index.html — no build step required.
Capabilities:
- Drag-and-drop document upload (or paste text directly)
- Real-time pipeline progress via SSE with progressive rendering
- Annotation viewer with ranked concept candidates and confidence bars
- Color-coded FOLIO branch badges (Actor, Area of Law, Document, Engagement, Event, Location, etc.)
- Multi-branch overlays with stacked gradient borders and colored dot indicators
- Polyhierarchy tree views for multi-parent concept hierarchies
- Individuals tab with citation details, entity cards, and OWL class links
- Properties tab with verb/relation cards showing definition, domain/range, and inverse
- Document type banner with confidence indicator
- "N sources agree" badges for multi-source annotation confirmation
- Annotation state management (preliminary / confirmed / rejected)
- Concept graph visualization using Cytoscape with Dagre layout
- Right detail panel with concept info, entity graph, and ancestry tree
- Cascade-promote and bulk-reject operations
- Export menu for all 13 formats
- LLM provider settings panel
- Dark theme
| Provider | Env Key Suffix | Default Model |
|---|---|---|
| OpenAI | OPENAI_API_KEY |
gpt-4o |
| Anthropic | ANTHROPIC_API_KEY |
claude-sonnet-4-6 |
| Google Gemini | GOOGLE_API_KEY |
gemini-3-flash-preview |
| Mistral | MISTRAL_API_KEY |
mistral-medium-latest |
| Cohere | COHERE_API_KEY |
command-a-03-2025 |
| Meta Llama | META_API_KEY |
— |
| Groq | GROQ_API_KEY |
— |
| xAI (Grok) | XAI_API_KEY |
— |
| GitHub Models | GITHUB_TOKEN |
— |
| Ollama | — (local) | — |
| LM Studio | — (local) | — |
| Llamafile | — (local) | — |
| Custom endpoint | CUSTOM_API_KEY |
— |
Each task can be independently routed to a different provider:
| Task | Pipeline Role |
|---|---|
| Classifier | Document type classification |
| Extractor | Structured metadata field extraction (28 fields) |
| Concept | Legal concept identification from text chunks |
| Branch Judge | FOLIO branch category assignment for ambiguous concepts |
| Area of Law | Post-pipeline legal domain classification |
| Individual | OWL individual class linking |
| Property | OWL ObjectProperty identification with domain/range |
| Document Type | Early parallel document self-identification |
| Synthetic | Test document generation |
The pipeline degrades gracefully — LLM-dependent stages are skipped when no provider is available, and the EntityRuler + String Match + EarlyIndividual + EarlyProperty stages still produce useful results.
At startup, FOLIO Enrich pre-computes embeddings for all FOLIO concept labels and builds a FAISS index for fast similarity search.
| Provider | Model | Notes |
|---|---|---|
| Local (default) | all-MiniLM-L6-v2 |
Runs on CPU, no external calls |
| OpenAI | text-embedding-3-small |
Requires API key |
| Ollama | Configurable | Local inference server |
- Reconciliation — resolves conflicts when EntityRuler and LLM disagree on a concept
- Semantic Ruler — supplements pattern matching with embedding similarity
- Resolution — helps find the best FOLIO IRI when exact label match fails
- Concept Lookup — powers the
/conceptsAPI for semantic search
Annotations pass through a multi-stage confidence calibration system. Each stage blends its signal with the running score using explicit weights.
| Match Type | Base Score |
|---|---|
| Multi-word preferred label | 0.90 |
| Single-word preferred label | 0.72 |
| Multi-word alternative label | 0.65 |
| Single-word alternative label | 0.35 |
- If EntityRuler and LLM agree: takes the higher confidence
- If they disagree: embedding similarity breaks the tie
- LLM-only concepts enter at LLM-reported confidence
An LLM evaluates each annotation against the full document context using a calibrated rubric:
new_confidence = pipeline_score * 0.5 + context_score * 0.5
Rubric anchors: 0.95 = unambiguous match, 0.70 = plausible, 0.40 = weak, 0.20 = likely false positive
An LLM assigns the correct FOLIO branch category:
new_confidence = existing_confidence * 0.7 + judge_score * 0.3
High-confidence annotations are promoted to document-level metadata fields.
FOLIO Enrich extracts OWL individuals (named instances) using a three-path hybrid approach across two pipeline stages.
- Legal citations — eyecite parses U.S. legal citations; citeurl normalizes forms and resolves URLs
- 14 regex/spaCy extractors — dates, monetary amounts, addresses, durations, percentages, phone numbers, emails, URLs, statutory references, court names, case numbers, organization entities, person entities, and geopolitical entities
- Class linking — LLM links extracted individuals to resolved OWL class annotations with confidence scores and relationship types
- Deduplication — merges duplicates across extraction paths
| Source | Extracted Types |
|---|---|
| eyecite + citeurl | Case citations, statutory citations, regulatory citations with normalized forms and URLs |
| Regex/spaCy | DATE, MONEY, DURATION, PERCENT, ADDRESS, PHONE, EMAIL, URL, STATUTE, COURT, CASE_NUMBER, ORG, PERSON, GPE |
| LLM | Class-linked individuals with relationship to OWL class annotations |
FOLIO Enrich identifies OWL ObjectProperties (legal verbs and relationships) using a two-stage approach.
- Aho-Corasick automaton — fast multi-pattern matching against all FOLIO ObjectProperty labels
- Containment-aware overlap resolution — contained spans survive, partial overlaps resolve to longer match
- Contextual identification — LLM identifies property verbs in context with domain/range cross-linking
- Merge and dedup — combines with early results, deduplicates
Each extracted property includes:
- FOLIO IRI — resolved ObjectProperty identifier
- Label and definition — from FOLIO ontology
- Domain and range — OWL class constraints
- Inverse relation — linked inverse property (if any)
- Alternative labels and examples — from SKOS metadata
- Confidence score — extraction confidence
The Metadata stage runs last in the pipeline and uses all prior pipeline output as structured context for LLM extraction.
| Category | Fields |
|---|---|
| Case identification | case_name, case_number, docket_entry_number, court, jurisdiction |
| Parties and people | parties (with roles), judge, attorneys (with affiliations), signatories, witnesses, author, recipient |
| Legal substance | cause_of_action, claim_types, relief_sought, disposition, standard_of_review, governing_law, procedural_posture |
| Dates | date_filed, date_signed, date_effective, date_due, dates_mentioned |
| Document info | document_title, related_documents, confidentiality, contract_type |
Additional context-aware fields: counterparties, term_duration, termination_conditions, consideration, addresses, has_exhibits, exhibit_list, language
The LLM receives structured context from all upstream stages:
- Individuals grouped by type (citations, persons, organizations, dates, amounts)
- Low-confidence entities with surrounding sentence context
- Extracted properties with FOLIO labels
- Resolved concepts with branch classifications
- Subject-predicate-object triples from dependency parsing
- Areas of law assessment
Asks the LLM what the document "calls itself" from the title, caption, or header. Stores self_identified_type in metadata for use by all downstream stages.
After pipeline completion, a DocumentTypeChecker cross-validates:
- Compares document type against discovered branches, concepts, and annotation counts
- Detects missing expected branches or unexpected dominant branches
- Stores
quality_signalsarray with severity levels in metadata
Generate realistic legal documents for testing and demonstration.
| Category | Example Types |
|---|---|
| Litigation | Complaint, Motion for Summary Judgment, Appellate Brief, Discovery Request, Settlement Agreement, Court Order |
| Contracts | Employment Agreement, NDA, SaaS Agreement, License Agreement, Lease Agreement |
| Corporate | Articles of Incorporation, Board Resolution, Merger Agreement, Operating Agreement |
| Regulatory | SEC Filing, Agency Comment Letter, Compliance Policy |
| Law Firm Operations | Engagement Letter, Legal Opinion |
| Real Estate | Purchase Agreement, Deed of Trust, Easement Agreement |
| Intellectual Property | Patent Application, Trademark Registration |
| Estate Planning | Last Will and Testament, Trust Agreement |
| Immigration | Visa Petition, Naturalization Application |
# List available document types
curl http://localhost:8731/synthetic/types
# Generate a document
curl -X POST http://localhost:8731/synthetic \
-H "Content-Type: application/json" \
-d '{"doc_type": "complaint", "length": "medium", "jurisdiction": "California"}'cd backend
.venv/bin/python -m pytest tests/ -v586 tests across 45 test files covering:
| Area | Tests Cover |
|---|---|
| Pipeline | End-to-end pipeline, progressive pipeline, parallel execution, task LLM routing |
| Stages | EntityRuler, reconciliation, resolution, rerank, branch judge, annotation states |
| Individuals | Citation extraction, entity extraction, class linking, deduplication |
| Properties | Aho-Corasick matching, property dedup, domain/range resolution |
| Document type | Self-identification, quality cross-check signals |
| Metadata | 28-field extraction, pipeline context building |
| Ingestion | PDF, HTML, RTF, email, DOCX, Markdown, plain text |
| LLM | Provider registry, per-task routing, connection testing, pricing |
| Embedding | Semantic ruler, embedding index, FAISS index |
| Export | All Tier 1 formats, all Tier 2 formats |
| Concepts | Concept identification, concept detail, resolver |
| Matching | Aho-Corasick automaton, containment-aware dedup, multi-branch keying |
| Feedback | Submission, insights aggregation, dismiss/restore |
| Infrastructure | Security middleware, SSE streaming, job retention, rate limiting |
folio-enrich/
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI app, startup, middleware
│ │ ├── config.py # Pydantic settings (env vars)
│ │ ├── api/routes/
│ │ │ ├── enrich.py # Document enrichment endpoints
│ │ │ ├── export.py # Export endpoints (13 formats)
│ │ │ ├── concepts.py # FOLIO concept lookup + graph
│ │ │ ├── synthetic.py # Synthetic document generation
│ │ │ ├── feedback.py # User feedback + insights
│ │ │ ├── settings.py # LLM/embedding configuration
│ │ │ └── health.py # Health checks
│ │ ├── models/
│ │ │ ├── annotation.py # Annotation, Individual, PropertyAnnotation
│ │ │ ├── job.py # Job, JobStatus, JobResult
│ │ │ ├── document.py # Document formats and chunks
│ │ │ ├── llm_models.py # LLM provider types (14)
│ │ │ └── embedding_models.py # Embedding model config
│ │ ├── pipeline/
│ │ │ ├── orchestrator.py # Three-phase parallel orchestrator
│ │ │ └── stages/
│ │ │ ├── base.py # PipelineStage ABC
│ │ │ ├── ingestion_stage.py # Multi-format document ingestion
│ │ │ ├── normalization_stage.py # Chunking + sentence indexing
│ │ │ ├── entity_ruler_stage.py # spaCy pattern matching
│ │ │ ├── llm_concept_stage.py # LLM concept extraction
│ │ │ ├── individual_stage.py # EarlyIndividual + LLMIndividual
│ │ │ ├── property_stage.py # EarlyProperty + LLMProperty
│ │ │ ├── document_type_stage.py # Early document type classification
│ │ │ ├── reconciliation_stage.py # Dual-path merge
│ │ │ ├── resolution_stage.py # FOLIO IRI resolution
│ │ │ ├── rerank_stage.py # Contextual LLM reranking
│ │ │ ├── branch_judge_stage.py # Branch category assignment
│ │ │ ├── string_match_stage.py # Aho-Corasick matching
│ │ │ ├── metadata_stage.py # 28-field extraction + classification
│ │ │ └── dependency_stage.py # SPO triple extraction
│ │ ├── services/
│ │ │ ├── folio/ # FOLIO ontology (resolver, search, graph)
│ │ │ ├── llm/ # LLM registry + provider implementations
│ │ │ ├── embedding/ # Embedding service + FAISS index
│ │ │ ├── entity_ruler/ # Pattern builder + semantic ruler
│ │ │ ├── matching/ # Aho-Corasick string matching
│ │ │ ├── concept/ # LLM concept ID, branch judge, area of law
│ │ │ ├── individual/ # Citation + entity extractors, deduplicator
│ │ │ ├── property/ # Property matcher, deduplicator, LLM identifier
│ │ │ ├── quality/ # Document type cross-check
│ │ │ ├── metadata/ # Classifier, extractor, promoter
│ │ │ ├── normalization/ # Text chunking + sentence splitting
│ │ │ ├── reconciliation/ # EntityRuler + LLM merge logic
│ │ │ ├── dependency/ # spaCy dependency parser
│ │ │ ├── streaming/ # SSE implementation
│ │ │ └── testing/ # Synthetic document generator
│ │ ├── middleware/
│ │ │ ├── error_handler.py # Global exception handling
│ │ │ ├── rate_limit.py # Request rate limiting
│ │ │ └── security.py # Header validation + CORS
│ │ └── storage/
│ │ ├── job_store.py # Atomic JSON job persistence
│ │ └── feedback_store.py # Feedback persistence
│ ├── tests/ # 586 tests across 45 files
│ └── pyproject.toml # Dependencies + build config
├── frontend/
│ └── index.html # Single-file SPA (vanilla JS, dark theme)
└── README.md
| Library | Purpose |
|---|---|
| FastAPI | Web framework |
| folio-python | FOLIO ontology access (classes, properties, search) |
| spaCy | NLP, entity ruler, NER, dependency parsing |
| pydantic-settings | Configuration via environment variables |
| Library | Purpose |
|---|---|
| pyahocorasick | Multi-pattern string matching (concepts + properties) |
| faiss-cpu | Vector similarity search |
| nupunkt | Sentence segmentation |
| eyecite | Legal citation parsing |
| citeurl | Citation normalization + URL resolution |
| Library | Purpose |
|---|---|
| PyMuPDF | PDF text extraction |
| python-docx | DOCX handling |
| beautifulsoup4 | HTML parsing |
| markdown-it-py | Markdown parsing |
| striprtf | RTF format handling |
| extract-msg | Email (EML/MSG) parsing |
| Library | Purpose |
|---|---|
| pyarrow | Parquet export |
| rdflib | RDF/Turtle export |
| openpyxl | Excel export |
| Library | Purpose |
|---|---|
| uvicorn | ASGI server |
| httpx | HTTP client for LLM APIs |
| sse-starlette | Server-Sent Events |
See LICENSE for details.