Skip to content

Latest commit

 

History

History
723 lines (560 loc) · 22 KB

File metadata and controls

723 lines (560 loc) · 22 KB

Semantic Search Framework - Architecture

Technical Specification and Design Decisions

Version: 1.0
Author: Jordan Minor
Last Updated: November 2025


Table of Contents

  1. System Overview
  2. 4-Layer Architecture
  3. Chunking Strategy
  4. Embedding Model Selection
  5. Vector Database Configuration
  6. Update Mechanisms
  7. Performance Characteristics
  8. Design Trade-offs

System Overview

Purpose

Transform markdown document collection into semantically searchable knowledge base for AI assistant context management.

Design Goals

  1. Zero-cost operation (no API calls)
  2. Sub-second search performance (<500ms)
  3. Minimal context overhead (search = 0 tokens)
  4. Maintainable updates (incremental, not full reindex)
  5. Privacy-first (all local, no data upload)

System Boundaries

In Scope:

  • Markdown (.md) files only
  • Section-based chunking (header hierarchy)
  • Semantic similarity search
  • Local inference (no cloud dependencies)
  • Incremental updates (modified files only)

Out of Scope:

  • Real-time file watching (manual reindex trigger)
  • Non-markdown formats (PDF, DOCX, HTML)
  • Cross-document relationship mapping
  • Multi-language support (English optimized)
  • Distributed deployment (single-machine only)

4-Layer Architecture

Design Philosophy

Separation of concerns: Raw data → Processing → Intelligence → Interface

Each layer has single responsibility, enabling independent optimization and testing.

Layer 1: Raw Markdown Files

Purpose: Source of truth for business documentation

Structure:

G:\My Drive\MRMINOR\
├── 01-Strategy/           # Business planning (9 docs)
├── 04-Technical/          # Technical specs (17 docs)
├── 08-Data/              # Operational protocols (17 docs)
├── 10-Financial/         # Financial tracking (10 docs)
├── 11-Legal-Compliance/  # Legal docs (6 docs)
└── [other folders...]    # 77 total documents

Total: 124 markdown files

Characteristics:

  • Standard markdown syntax (headers, lists, code blocks)
  • Hierarchical headers (# ## ### ####)
  • Mix of structured data (tables) and prose
  • File sizes: 2KB - 50KB (median: 15KB)

Design Decision: Keep source files unchanged (read-only from system perspective)

  • Rationale: Maintain single source of truth
  • Benefit: No risk of data corruption
  • Trade-off: Requires reindex on file changes

Layer 2: Processed Chunks

Purpose: Break documents into searchable units while preserving context

Processing Pipeline:

  1. File Discovery - Recursive scan of directory tree
  2. Markdown Parsing - Extract headers and content sections
  3. Chunking - Split into 1,000-character blocks with 200-char overlap
  4. Metadata Extraction - Capture file path, section headers, hierarchy
  5. Validation - Ensure no empty chunks, verify metadata completeness

Chunk Structure:

{
    "content": "Actual text content (1000 chars max)",
    "metadata": {
        "file_path": "G:\\My Drive\\MRMINOR\\01-Strategy\\...",
        "section_title": "Revenue Tracking Protocol",
        "header_path": "Strategy > Financial > Revenue",
        "chunk_index": 3,
        "total_chunks": 12
    }
}

Chunking Parameters:

  • CHUNK_SIZE: 1,000 characters
  • CHUNK_OVERLAP: 200 characters (20%)
  • MIN_CHUNK_SIZE: 100 characters (discard smaller)
  • MAX_CHUNK_SIZE: 1,500 characters (split larger)

Design Decision: Section-based chunking with overlap

  • Rationale: Preserves logical document structure
  • Benefit: Search results include meaningful context
  • Trade-off: Slightly more storage vs character-only chunking

Example:

Document: 5,000 characters
Sections: 4 (by ## headers)
Output: 6 chunks
- Section 1: 1 chunk (800 chars)
- Section 2: 2 chunks (1,200 chars → split)
- Section 3: 2 chunks (1,800 chars → split)
- Section 4: 1 chunk (600 chars)

Layer 3: Vector Embeddings

Purpose: Convert text chunks into mathematical representations for similarity comparison

Embedding Model: all-MiniLM-L6-v2

  • Architecture: Sentence Transformer (BERT-based)
  • Dimensions: 384 (dense vectors)
  • Size: 80MB (local download)
  • Speed: ~200ms per chunk (CPU inference)
  • Quality: 85% accuracy on semantic similarity tasks

Embedding Process:

  1. Load pre-trained model (one-time, cached locally)
  2. Tokenize chunk text (WordPiece tokenization)
  3. Forward pass through transformer (12 layers)
  4. Mean pooling of token embeddings
  5. L2 normalization (unit vector)

Vector Properties:

Input:  "Revenue tracking requires monthly reconciliation"
Output: [0.023, -0.156, 0.089, ..., 0.234]  # 384 dimensions
Norm:   1.0 (normalized for cosine similarity)

Storage in ChromaDB:

  • Vector: 384 × 4 bytes = 1.5KB per chunk
  • Metadata: ~500 bytes per chunk
  • Total: ~2KB per chunk
  • 7,396 chunks = ~14.5MB storage

Design Decision: Local inference vs API

  • Chosen: Local (sentence-transformers)
  • Alternative: OpenAI embeddings API
  • Rationale: Zero cost, privacy, offline capability
  • Trade-off: Slightly lower quality (85% vs 90%) acceptable for use case

Layer 4: Semantic Search Interface

Purpose: Enable AI assistant to query knowledge base via Model Context Protocol (MCP)

MCP Server Implementation:

# Server structure (simplified)
@server.call_tool()
async def search_markdown(query: str, num_results: int = 5):
    # 1. Embed query (same model as chunks)
    query_embedding = model.encode(query)
    
    # 2. Query ChromaDB (cosine similarity)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=num_results
    )
    
    # 3. Format results with metadata
    return format_results(results)

Search Flow:

  1. Client sends MCP request: search_markdown("revenue tracking")
  2. Server embeds query → 384-dim vector
  3. ChromaDB performs cosine similarity search
  4. Top N results ranked by similarity score
  5. Results formatted with file path + section context
  6. Response sent via MCP (stdio transport)

Result Format:

{
  "file": "G:\\My Drive\\MRMINOR\\10-Financial\\revenue-tracking.md",
  "section": "Monthly Reconciliation Process",
  "header_path": "Financial > Revenue > Reconciliation",
  "content": "Revenue tracking requires monthly reconciliation...",
  "similarity": 0.847
}

Performance Characteristics:

  • Query embedding: ~50ms
  • ChromaDB search: ~100-200ms
  • Result formatting: ~10ms
  • Total latency: 160-260ms (< 500ms target)

Design Decision: MCP vs REST API

  • Chosen: MCP (Model Context Protocol)
  • Alternative: REST API + HTTP server
  • Rationale: Native Claude Desktop integration, stdio transport (no ports)
  • Trade-off: MCP ecosystem required vs universal HTTP

Chunking Strategy

Section-Based Approach

Philosophy: Respect document structure for coherent search results

Algorithm:

  1. Parse markdown headers (# ## ### ####)
  2. Identify section boundaries
  3. Extract section content
  4. Apply character-based chunking if section > CHUNK_SIZE
  5. Add overlap between adjacent chunks
  6. Preserve section title in metadata

Example Document:

# Financial Management

## Revenue Tracking
Monthly reconciliation process requires...
[800 characters]

## Expense Categories
Operating expenses are categorized...
[1,500 characters → splits into 2 chunks]

### Travel Expenses
Travel costs include...
[400 characters]

Chunking Output:

  • Chunk 1: "Revenue Tracking" section (800 chars, no split)
  • Chunk 2: "Expense Categories" section part 1 (1,000 chars)
  • Chunk 3: "Expense Categories" section part 2 (700 chars, 200 overlap)
  • Chunk 4: "Travel Expenses" subsection (400 chars)

Overlap Strategy:

Purpose: Prevent information loss at chunk boundaries

Implementation:

  • Last 200 characters of Chunk N → First 200 characters of Chunk N+1
  • Ensures sentences aren't cut mid-thought
  • Maintains context continuity for embedding quality

Example:

Chunk 1: "...systems must validate input. Security protocols require..."
         [overlap region: "Security protocols require..."]
Chunk 2: "Security protocols require multi-factor authentication..."

Why 20% Overlap?

  • 10% overlap: Tested, insufficient for sentence completion
  • 20% overlap: Optimal (captures sentence context)
  • 30% overlap: Tested, marginal benefit vs storage cost
  • Conclusion: 20% balances context preservation and efficiency

Alternative Approaches Considered:

  1. Fixed-length chunking (rejected)

    • Pro: Simple implementation
    • Con: Splits mid-sentence, loses context
    • Example: "The process requires auth[SPLIT]entication and validation"
  2. Sentence-based chunking (rejected)

    • Pro: Natural boundaries
    • Con: Highly variable chunk sizes (50-2,000 chars)
    • Impact: Inconsistent embedding quality
  3. Paragraph-based chunking (rejected)

    • Pro: Logical units
    • Con: Markdown doesn't enforce paragraph structure
    • Reality: Many docs use lists, not paragraphs

Embedding Model Selection

Evaluation Criteria

Requirements:

  1. Speed: <500ms per query (target: <300ms)
  2. Quality: >80% semantic similarity accuracy
  3. Size: <500MB (reasonable local storage)
  4. Cost: Zero (no API calls)
  5. Maintenance: Stable, well-supported

Model Comparison

Model Dimensions Size Speed* Quality** Selected
all-MiniLM-L6-v2 384 80MB 200ms 85% ✅ YES
all-mpnet-base-v2 768 420MB 450ms 90% ❌ NO
paraphrase-MiniLM-L3-v2 384 61MB 150ms 75% ❌ NO
sentence-t5-base 768 220MB 600ms 87% ❌ NO
OpenAI text-embedding-3-small 1536 API 300ms 92% ❌ NO

* Speed measured on Intel i7-10th gen CPU
** Quality from MTEB benchmark (semantic similarity tasks)

Selection Rationale: all-MiniLM-L6-v2

Chosen because:

  1. Speed: 200ms meets <300ms target
  2. Quality: 85% exceeds 80% threshold
  3. Size: 80MB easily fits local storage
  4. Balance: Best speed/quality/size trade-off
  5. Proven: 50M+ downloads, actively maintained

Why not all-mpnet-base-v2?

  • 5% quality improvement (90% vs 85%)
  • 2.2x slower (450ms vs 200ms)
  • 5.2x larger (420MB vs 80MB)
  • Conclusion: Marginal quality gain not worth speed/size cost

Why not OpenAI API?

  • Highest quality (92%)
  • Fast API response (300ms)
  • Blockers: Monthly cost ($$$), requires internet, privacy concerns
  • Use case doesn't justify API dependency

Model Architecture

all-MiniLM-L6-v2 Technical Details:

  • Base: Distilled from BERT
  • Layers: 6 transformer layers
  • Hidden Size: 384 dimensions
  • Attention Heads: 12
  • Parameters: 22M (lightweight)
  • Training: Sentence similarity datasets (NLI, STS, etc.)
  • Tokenizer: WordPiece (30K vocab)

Why Sentence-Transformers Framework?

  • Designed specifically for semantic similarity
  • Optimized mean pooling of token embeddings
  • Cosine similarity built-in
  • Widely adopted (vs raw BERT models)

Vector Database Configuration

ChromaDB Architecture

Why ChromaDB?

Criteria ChromaDB Pinecone Weaviate Qdrant
Deployment Local Cloud Self-host/Cloud Self-host
Cost $0 $70+/month $0-$25/month $0
Setup pip install Account + API Docker Docker
Latency 100-200ms 100-500ms 100-300ms 100-200ms
Privacy Fully local Data uploaded Local option Local option
Python Integration Native SDK SDK SDK
Maturity Growing Mature Mature Growing

Decision: ChromaDB

  • Rationale: Zero cost, fully local, simple setup, good-enough performance
  • Trade-off: Less mature than Pinecone, but meets all requirements

Configuration

Collection Settings:

collection = client.get_or_create_collection(
    name="mrminor_docs",
    metadata={"hnsw:space": "cosine"},  # Cosine similarity
    embedding_function=None  # We provide embeddings
)

Distance Metric: Cosine Similarity

  • Formula: cosine(A, B) = (A · B) / (||A|| × ||B||)
  • Range: -1 to 1 (1 = identical, 0 = orthogonal, -1 = opposite)
  • Our vectors: L2 normalized (||A|| = 1), so cosine = dot product

Why Cosine vs Euclidean?

  • Cosine: Measures angle between vectors (semantic similarity)
  • Euclidean: Measures absolute distance (magnitude matters)
  • Example:
    • "big dog" vs "large dog" → High cosine (same direction)
    • "big dog" vs "big dog big dog" → Low Euclidean (different magnitudes)
  • Conclusion: Cosine is correct for semantic search

Storage Backend:

  • Default: DuckDB (embedded SQL database)
  • Location: ./chroma_db/ directory
  • Persistence: Automatic on collection.add()
  • Size: ~2KB per chunk × 7,396 chunks = ~14.5MB

Indexing Algorithm: HNSW (Hierarchical Navigable Small World)

  • Purpose: Fast approximate nearest neighbor search
  • Complexity: O(log N) search time (vs O(N) brute force)
  • Accuracy: >95% recall for top-10 results
  • Trade-off: Slight accuracy loss for major speed gain

Performance Optimization

Batch Size Configuration:

# Original (caused failures)
collection.add(
    embeddings=all_embeddings,  # 7,396 vectors
    documents=all_texts,
    metadatas=all_metadata,
    ids=all_ids
)
# Error: ChromaDB max batch size = 5,461

# Fixed (batched)
BATCH_SIZE = 1000
for i in range(0, len(embeddings), BATCH_SIZE):
    batch = embeddings[i:i+BATCH_SIZE]
    collection.add(...)
# Success: 8 batches, no failures

Lesson Learned: Always batch large operations

  • Discovery: Production bug during incremental update testing
  • Root Cause: ChromaDB undocumented batch limit
  • Solution: Implement batching with BATCH_SIZE = 1000
  • Result: Reliable updates for any collection size

Update Mechanisms

Full Reindex

Purpose: Complete rebuild of vector database

Process:

  1. Delete existing collection
  2. Scan all .md files in directory tree
  3. Parse, chunk, and embed each file
  4. Batch insert to ChromaDB
  5. Verify chunk count and collection health

Performance:

  • 124 files, 7,396 chunks
  • Embedding time: 7,396 × 200ms = 1,479s ≈ 25 minutes
  • ChromaDB insert: 8 seconds (batched)
  • Total: ~26 minutes

When to Use:

  • Initial setup (first time)
  • Major document reorganization
  • ChromaDB corruption or errors
  • File count changes dramatically (>10% of collection)

Tool: reindex_documents (MCP tool)


Incremental Update

Purpose: Update only changed files (fast, efficient)

Process:

  1. Receive list of modified file paths
  2. Delete old chunks for those files (query by file_path)
  3. Re-process only modified files
  4. Insert new chunks (batched)
  5. Return updated chunk count

Performance:

  • 1 file (~60 chunks): ~12 seconds (60 × 200ms)
  • 3 files (~180 chunks): ~36 seconds
  • 10 files (~600 chunks): ~120 seconds (2 minutes)

Comparison:

Files Changed Full Reindex Incremental Speedup
1 file 26 minutes 12 seconds 130x
3 files 26 minutes 36 seconds 43x
10 files 26 minutes 2 minutes 13x
50+ files 26 minutes ~10 minutes 2.6x

When to Use:

  • Daily documentation updates (1-5 files)
  • After editing specific documents
  • Continuous workflow (update as you edit)
  • Any change affecting <20% of collection

Tool: update_files (MCP tool)

Implementation Detail:

def update_files(file_paths: List[str]):
    for file_path in file_paths:
        # Delete old chunks
        old_ids = get_chunk_ids_for_file(file_path)
        collection.delete(ids=old_ids)
        
        # Re-chunk and embed
        new_chunks = process_file(file_path)
        new_embeddings = model.encode(new_chunks)
        
        # Batch insert new chunks
        for batch in batched(new_embeddings, BATCH_SIZE):
            collection.add(...)

Performance Characteristics

Latency Breakdown

Search Query (End-to-End):

MCP Request received          0ms
├─ Query embedding           50ms  (sentence-transformers)
├─ ChromaDB search          120ms  (HNSW approximate search)
├─ Result formatting         10ms  (JSON + metadata extraction)
└─ MCP Response sent        180ms  TOTAL

Indexing (Per Chunk):

Chunk processing              5ms   (parsing, metadata)
├─ Text embedding           200ms   (sentence-transformers)
└─ ChromaDB insert            1ms   (batched, amortized)
TOTAL per chunk            ~206ms

Scaling Characteristics:

Collection Size Search Time Index Time Storage
1,000 chunks 150ms 3.5 min 2MB
5,000 chunks 170ms 17 min 10MB
10,000 chunks 200ms 35 min 20MB
50,000 chunks 250ms 175 min 100MB

Observations:

  • Search time: O(log N) - scales well
  • Index time: O(N) - linear with document count
  • Storage: O(N) - ~2KB per chunk

Design Trade-offs

1. Quality vs Speed (Embedding Model)

Decision: all-MiniLM-L6-v2 (fast) over all-mpnet-base-v2 (accurate)

Trade-off Analysis:

  • Quality delta: 5% (85% → 90%)
  • Speed delta: 2.2x slower (200ms → 450ms)
  • Impact on use case: Business documents (not research papers)
  • Conclusion: 85% quality sufficient, speed more valuable

Would Reconsider If:

  • Using for academic/research documents (quality critical)
  • Search latency >1 second acceptable
  • Collection size <1,000 chunks (speed less important)

2. Accuracy vs Storage (HNSW Index)

Decision: HNSW approximate search over exact brute-force

Trade-off Analysis:

  • Accuracy: 95% recall (vs 100% brute force)
  • Speed: 100-200ms (vs 5-10 seconds brute force)
  • Storage: +20% overhead (index structure)
  • Impact: Missing 5% of "good" results acceptable
  • Conclusion: 50x speed gain worth 5% accuracy loss

Would Reconsider If:

  • Collection <500 chunks (brute force fast enough)
  • Perfect recall required (legal discovery, compliance)
  • Latency requirements >500ms

3. Context vs Storage (Chunk Overlap)

Decision: 20% overlap (200 of 1,000 chars)

Trade-off Analysis:

  • Storage increase: +20% (14.5MB → 17.4MB)
  • Context preservation: Significant improvement
  • Redundancy: Acceptable for better results
  • Impact: Better search results at marginal storage cost
  • Conclusion: 3MB extra storage justified by quality

Would Reconsider If:

  • Storage extremely limited (<100MB total)
  • Documents have perfect section boundaries (no mid-thought splits)
  • Query patterns favor exact-match over semantic

4. Privacy vs Quality (Local vs Cloud)

Decision: Local embeddings (sentence-transformers) over OpenAI API

Trade-off Analysis:

  • Quality delta: 7% (85% → 92%)
  • Cost delta: $0 → ~$50/month (7,396 chunks, periodic reindex)
  • Privacy: Full control vs data uploaded
  • Latency: Consistent vs network-dependent
  • Impact: Business documents contain sensitive information
  • Conclusion: Privacy and cost savings outweigh quality delta

Would Reconsider If:

  • Public documents only (no privacy concerns)
  • Budget allows API costs
  • 92% quality critical for use case

5. Flexibility vs Simplicity (File Format Support)

Decision: Markdown-only over multi-format (PDF, DOCX, HTML)

Trade-off Analysis:

  • Simplicity: Single parser vs multiple
  • Reliability: High vs varied (format quirks)
  • Maintenance: Low vs high (format changes)
  • Coverage: ~95% of use case (most docs are markdown)
  • Impact: 5% of documents require manual conversion
  • Conclusion: Focus on core use case, accept manual conversion

Would Reconsider If:

  • PDF/DOCX sources >30% of collection
  • Automated pipeline requires multi-format
  • Parsing libraries mature and stable

6. Real-time vs Manual (Index Updates)

Decision: Manual reindex trigger over file-watching

Trade-off Analysis:

  • Complexity: Low (explicit trigger) vs high (file watcher, race conditions)
  • Latency: Seconds (on-demand) vs instant (automatic)
  • Resource usage: On-demand vs continuous monitoring
  • Impact: 30-second delay acceptable for use case
  • Conclusion: Manual trigger simpler and sufficient

Would Reconsider If:

  • Real-time collaboration (multi-user editing)
  • High-frequency updates (>10/hour)
  • Zero-latency requirement

Summary

Key Architectural Decisions

  1. 4-Layer Separation: Clean boundaries enable independent optimization
  2. Local-First: Zero cost, full privacy, offline capability
  3. Section-Based Chunking: Preserves document structure for coherent results
  4. Optimized Model: all-MiniLM-L6-v2 balances speed, quality, size
  5. HNSW Indexing: 50x speed improvement at 5% accuracy cost
  6. Incremental Updates: 13-130x faster than full reindex for common use case

Performance Envelope

Metric Current Tested Limit
Documents 124 200 ~1,000
Chunks 7,396 10,000 ~50,000
Search Time 180ms 250ms <500ms
Storage 14.5MB 20MB ~100MB
Index Time 26 min 35 min ~3 hours

Future Optimization Opportunities

  1. GPU Acceleration: 10-50x faster embedding (if GPU available)
  2. Quantization: 4x storage reduction (int8 vs float32) at 1% quality loss
  3. Hybrid Search: Combine semantic + keyword for better recall
  4. Query Caching: Memoize common queries (dashboards, reports)
  5. Metadata Filtering: Pre-filter by date, category before semantic search
  6. Batch Queries: Process multiple queries in single embedding pass

Document Version: 1.0
Author: Jordan Minor
Last Updated: November 2025
Production Status: Deployed at MRMINOR LLC
Performance: 70-90% token efficiency improvement, <500ms search latency