Version: 1.0
Author: Jordan Minor
Last Updated: November 2025
- System Overview
- 4-Layer Architecture
- Chunking Strategy
- Embedding Model Selection
- Vector Database Configuration
- Update Mechanisms
- Performance Characteristics
- Design Trade-offs
Transform markdown document collection into semantically searchable knowledge base for AI assistant context management.
- Zero-cost operation (no API calls)
- Sub-second search performance (<500ms)
- Minimal context overhead (search = 0 tokens)
- Maintainable updates (incremental, not full reindex)
- Privacy-first (all local, no data upload)
In Scope:
- Markdown (.md) files only
- Section-based chunking (header hierarchy)
- Semantic similarity search
- Local inference (no cloud dependencies)
- Incremental updates (modified files only)
Out of Scope:
- Real-time file watching (manual reindex trigger)
- Non-markdown formats (PDF, DOCX, HTML)
- Cross-document relationship mapping
- Multi-language support (English optimized)
- Distributed deployment (single-machine only)
Separation of concerns: Raw data → Processing → Intelligence → Interface
Each layer has single responsibility, enabling independent optimization and testing.
Purpose: Source of truth for business documentation
Structure:
G:\My Drive\MRMINOR\
├── 01-Strategy/ # Business planning (9 docs)
├── 04-Technical/ # Technical specs (17 docs)
├── 08-Data/ # Operational protocols (17 docs)
├── 10-Financial/ # Financial tracking (10 docs)
├── 11-Legal-Compliance/ # Legal docs (6 docs)
└── [other folders...] # 77 total documents
Total: 124 markdown files
Characteristics:
- Standard markdown syntax (headers, lists, code blocks)
- Hierarchical headers (# ## ### ####)
- Mix of structured data (tables) and prose
- File sizes: 2KB - 50KB (median: 15KB)
Design Decision: Keep source files unchanged (read-only from system perspective)
- Rationale: Maintain single source of truth
- Benefit: No risk of data corruption
- Trade-off: Requires reindex on file changes
Purpose: Break documents into searchable units while preserving context
Processing Pipeline:
- File Discovery - Recursive scan of directory tree
- Markdown Parsing - Extract headers and content sections
- Chunking - Split into 1,000-character blocks with 200-char overlap
- Metadata Extraction - Capture file path, section headers, hierarchy
- Validation - Ensure no empty chunks, verify metadata completeness
Chunk Structure:
{
"content": "Actual text content (1000 chars max)",
"metadata": {
"file_path": "G:\\My Drive\\MRMINOR\\01-Strategy\\...",
"section_title": "Revenue Tracking Protocol",
"header_path": "Strategy > Financial > Revenue",
"chunk_index": 3,
"total_chunks": 12
}
}Chunking Parameters:
- CHUNK_SIZE: 1,000 characters
- CHUNK_OVERLAP: 200 characters (20%)
- MIN_CHUNK_SIZE: 100 characters (discard smaller)
- MAX_CHUNK_SIZE: 1,500 characters (split larger)
Design Decision: Section-based chunking with overlap
- Rationale: Preserves logical document structure
- Benefit: Search results include meaningful context
- Trade-off: Slightly more storage vs character-only chunking
Example:
Document: 5,000 characters
Sections: 4 (by ## headers)
Output: 6 chunks
- Section 1: 1 chunk (800 chars)
- Section 2: 2 chunks (1,200 chars → split)
- Section 3: 2 chunks (1,800 chars → split)
- Section 4: 1 chunk (600 chars)
Purpose: Convert text chunks into mathematical representations for similarity comparison
Embedding Model: all-MiniLM-L6-v2
- Architecture: Sentence Transformer (BERT-based)
- Dimensions: 384 (dense vectors)
- Size: 80MB (local download)
- Speed: ~200ms per chunk (CPU inference)
- Quality: 85% accuracy on semantic similarity tasks
Embedding Process:
- Load pre-trained model (one-time, cached locally)
- Tokenize chunk text (WordPiece tokenization)
- Forward pass through transformer (12 layers)
- Mean pooling of token embeddings
- L2 normalization (unit vector)
Vector Properties:
Input: "Revenue tracking requires monthly reconciliation"
Output: [0.023, -0.156, 0.089, ..., 0.234] # 384 dimensions
Norm: 1.0 (normalized for cosine similarity)Storage in ChromaDB:
- Vector: 384 × 4 bytes = 1.5KB per chunk
- Metadata: ~500 bytes per chunk
- Total: ~2KB per chunk
- 7,396 chunks = ~14.5MB storage
Design Decision: Local inference vs API
- Chosen: Local (sentence-transformers)
- Alternative: OpenAI embeddings API
- Rationale: Zero cost, privacy, offline capability
- Trade-off: Slightly lower quality (85% vs 90%) acceptable for use case
Purpose: Enable AI assistant to query knowledge base via Model Context Protocol (MCP)
MCP Server Implementation:
# Server structure (simplified)
@server.call_tool()
async def search_markdown(query: str, num_results: int = 5):
# 1. Embed query (same model as chunks)
query_embedding = model.encode(query)
# 2. Query ChromaDB (cosine similarity)
results = collection.query(
query_embeddings=[query_embedding],
n_results=num_results
)
# 3. Format results with metadata
return format_results(results)Search Flow:
- Client sends MCP request:
search_markdown("revenue tracking") - Server embeds query → 384-dim vector
- ChromaDB performs cosine similarity search
- Top N results ranked by similarity score
- Results formatted with file path + section context
- Response sent via MCP (stdio transport)
Result Format:
{
"file": "G:\\My Drive\\MRMINOR\\10-Financial\\revenue-tracking.md",
"section": "Monthly Reconciliation Process",
"header_path": "Financial > Revenue > Reconciliation",
"content": "Revenue tracking requires monthly reconciliation...",
"similarity": 0.847
}Performance Characteristics:
- Query embedding: ~50ms
- ChromaDB search: ~100-200ms
- Result formatting: ~10ms
- Total latency: 160-260ms (< 500ms target)
Design Decision: MCP vs REST API
- Chosen: MCP (Model Context Protocol)
- Alternative: REST API + HTTP server
- Rationale: Native Claude Desktop integration, stdio transport (no ports)
- Trade-off: MCP ecosystem required vs universal HTTP
Philosophy: Respect document structure for coherent search results
Algorithm:
- Parse markdown headers (# ## ### ####)
- Identify section boundaries
- Extract section content
- Apply character-based chunking if section > CHUNK_SIZE
- Add overlap between adjacent chunks
- Preserve section title in metadata
Example Document:
# Financial Management
## Revenue Tracking
Monthly reconciliation process requires...
[800 characters]
## Expense Categories
Operating expenses are categorized...
[1,500 characters → splits into 2 chunks]
### Travel Expenses
Travel costs include...
[400 characters]Chunking Output:
- Chunk 1: "Revenue Tracking" section (800 chars, no split)
- Chunk 2: "Expense Categories" section part 1 (1,000 chars)
- Chunk 3: "Expense Categories" section part 2 (700 chars, 200 overlap)
- Chunk 4: "Travel Expenses" subsection (400 chars)
Overlap Strategy:
Purpose: Prevent information loss at chunk boundaries
Implementation:
- Last 200 characters of Chunk N → First 200 characters of Chunk N+1
- Ensures sentences aren't cut mid-thought
- Maintains context continuity for embedding quality
Example:
Chunk 1: "...systems must validate input. Security protocols require..."
[overlap region: "Security protocols require..."]
Chunk 2: "Security protocols require multi-factor authentication..."
Why 20% Overlap?
- 10% overlap: Tested, insufficient for sentence completion
- 20% overlap: Optimal (captures sentence context)
- 30% overlap: Tested, marginal benefit vs storage cost
- Conclusion: 20% balances context preservation and efficiency
Alternative Approaches Considered:
-
Fixed-length chunking (rejected)
- Pro: Simple implementation
- Con: Splits mid-sentence, loses context
- Example: "The process requires auth[SPLIT]entication and validation"
-
Sentence-based chunking (rejected)
- Pro: Natural boundaries
- Con: Highly variable chunk sizes (50-2,000 chars)
- Impact: Inconsistent embedding quality
-
Paragraph-based chunking (rejected)
- Pro: Logical units
- Con: Markdown doesn't enforce paragraph structure
- Reality: Many docs use lists, not paragraphs
Requirements:
- Speed: <500ms per query (target: <300ms)
- Quality: >80% semantic similarity accuracy
- Size: <500MB (reasonable local storage)
- Cost: Zero (no API calls)
- Maintenance: Stable, well-supported
| Model | Dimensions | Size | Speed* | Quality** | Selected |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | 80MB | 200ms | 85% | ✅ YES |
| all-mpnet-base-v2 | 768 | 420MB | 450ms | 90% | ❌ NO |
| paraphrase-MiniLM-L3-v2 | 384 | 61MB | 150ms | 75% | ❌ NO |
| sentence-t5-base | 768 | 220MB | 600ms | 87% | ❌ NO |
| OpenAI text-embedding-3-small | 1536 | API | 300ms | 92% | ❌ NO |
* Speed measured on Intel i7-10th gen CPU
** Quality from MTEB benchmark (semantic similarity tasks)
Chosen because:
- Speed: 200ms meets <300ms target
- Quality: 85% exceeds 80% threshold
- Size: 80MB easily fits local storage
- Balance: Best speed/quality/size trade-off
- Proven: 50M+ downloads, actively maintained
Why not all-mpnet-base-v2?
- 5% quality improvement (90% vs 85%)
- 2.2x slower (450ms vs 200ms)
- 5.2x larger (420MB vs 80MB)
- Conclusion: Marginal quality gain not worth speed/size cost
Why not OpenAI API?
- Highest quality (92%)
- Fast API response (300ms)
- Blockers: Monthly cost ($$$), requires internet, privacy concerns
- Use case doesn't justify API dependency
all-MiniLM-L6-v2 Technical Details:
- Base: Distilled from BERT
- Layers: 6 transformer layers
- Hidden Size: 384 dimensions
- Attention Heads: 12
- Parameters: 22M (lightweight)
- Training: Sentence similarity datasets (NLI, STS, etc.)
- Tokenizer: WordPiece (30K vocab)
Why Sentence-Transformers Framework?
- Designed specifically for semantic similarity
- Optimized mean pooling of token embeddings
- Cosine similarity built-in
- Widely adopted (vs raw BERT models)
Why ChromaDB?
| Criteria | ChromaDB | Pinecone | Weaviate | Qdrant |
|---|---|---|---|---|
| Deployment | Local | Cloud | Self-host/Cloud | Self-host |
| Cost | $0 | $70+/month | $0-$25/month | $0 |
| Setup | pip install | Account + API | Docker | Docker |
| Latency | 100-200ms | 100-500ms | 100-300ms | 100-200ms |
| Privacy | Fully local | Data uploaded | Local option | Local option |
| Python Integration | Native | SDK | SDK | SDK |
| Maturity | Growing | Mature | Mature | Growing |
Decision: ChromaDB
- Rationale: Zero cost, fully local, simple setup, good-enough performance
- Trade-off: Less mature than Pinecone, but meets all requirements
Collection Settings:
collection = client.get_or_create_collection(
name="mrminor_docs",
metadata={"hnsw:space": "cosine"}, # Cosine similarity
embedding_function=None # We provide embeddings
)Distance Metric: Cosine Similarity
- Formula:
cosine(A, B) = (A · B) / (||A|| × ||B||) - Range: -1 to 1 (1 = identical, 0 = orthogonal, -1 = opposite)
- Our vectors: L2 normalized (||A|| = 1), so cosine = dot product
Why Cosine vs Euclidean?
- Cosine: Measures angle between vectors (semantic similarity)
- Euclidean: Measures absolute distance (magnitude matters)
- Example:
- "big dog" vs "large dog" → High cosine (same direction)
- "big dog" vs "big dog big dog" → Low Euclidean (different magnitudes)
- Conclusion: Cosine is correct for semantic search
Storage Backend:
- Default: DuckDB (embedded SQL database)
- Location:
./chroma_db/directory - Persistence: Automatic on collection.add()
- Size: ~2KB per chunk × 7,396 chunks = ~14.5MB
Indexing Algorithm: HNSW (Hierarchical Navigable Small World)
- Purpose: Fast approximate nearest neighbor search
- Complexity: O(log N) search time (vs O(N) brute force)
- Accuracy: >95% recall for top-10 results
- Trade-off: Slight accuracy loss for major speed gain
Batch Size Configuration:
# Original (caused failures)
collection.add(
embeddings=all_embeddings, # 7,396 vectors
documents=all_texts,
metadatas=all_metadata,
ids=all_ids
)
# Error: ChromaDB max batch size = 5,461
# Fixed (batched)
BATCH_SIZE = 1000
for i in range(0, len(embeddings), BATCH_SIZE):
batch = embeddings[i:i+BATCH_SIZE]
collection.add(...)
# Success: 8 batches, no failuresLesson Learned: Always batch large operations
- Discovery: Production bug during incremental update testing
- Root Cause: ChromaDB undocumented batch limit
- Solution: Implement batching with BATCH_SIZE = 1000
- Result: Reliable updates for any collection size
Purpose: Complete rebuild of vector database
Process:
- Delete existing collection
- Scan all .md files in directory tree
- Parse, chunk, and embed each file
- Batch insert to ChromaDB
- Verify chunk count and collection health
Performance:
- 124 files, 7,396 chunks
- Embedding time: 7,396 × 200ms = 1,479s ≈ 25 minutes
- ChromaDB insert: 8 seconds (batched)
- Total: ~26 minutes
When to Use:
- Initial setup (first time)
- Major document reorganization
- ChromaDB corruption or errors
- File count changes dramatically (>10% of collection)
Tool: reindex_documents (MCP tool)
Purpose: Update only changed files (fast, efficient)
Process:
- Receive list of modified file paths
- Delete old chunks for those files (query by file_path)
- Re-process only modified files
- Insert new chunks (batched)
- Return updated chunk count
Performance:
- 1 file (~60 chunks): ~12 seconds (60 × 200ms)
- 3 files (~180 chunks): ~36 seconds
- 10 files (~600 chunks): ~120 seconds (2 minutes)
Comparison:
| Files Changed | Full Reindex | Incremental | Speedup |
|---|---|---|---|
| 1 file | 26 minutes | 12 seconds | 130x |
| 3 files | 26 minutes | 36 seconds | 43x |
| 10 files | 26 minutes | 2 minutes | 13x |
| 50+ files | 26 minutes | ~10 minutes | 2.6x |
When to Use:
- Daily documentation updates (1-5 files)
- After editing specific documents
- Continuous workflow (update as you edit)
- Any change affecting <20% of collection
Tool: update_files (MCP tool)
Implementation Detail:
def update_files(file_paths: List[str]):
for file_path in file_paths:
# Delete old chunks
old_ids = get_chunk_ids_for_file(file_path)
collection.delete(ids=old_ids)
# Re-chunk and embed
new_chunks = process_file(file_path)
new_embeddings = model.encode(new_chunks)
# Batch insert new chunks
for batch in batched(new_embeddings, BATCH_SIZE):
collection.add(...)Search Query (End-to-End):
MCP Request received 0ms
├─ Query embedding 50ms (sentence-transformers)
├─ ChromaDB search 120ms (HNSW approximate search)
├─ Result formatting 10ms (JSON + metadata extraction)
└─ MCP Response sent 180ms TOTAL
Indexing (Per Chunk):
Chunk processing 5ms (parsing, metadata)
├─ Text embedding 200ms (sentence-transformers)
└─ ChromaDB insert 1ms (batched, amortized)
TOTAL per chunk ~206ms
Scaling Characteristics:
| Collection Size | Search Time | Index Time | Storage |
|---|---|---|---|
| 1,000 chunks | 150ms | 3.5 min | 2MB |
| 5,000 chunks | 170ms | 17 min | 10MB |
| 10,000 chunks | 200ms | 35 min | 20MB |
| 50,000 chunks | 250ms | 175 min | 100MB |
Observations:
- Search time: O(log N) - scales well
- Index time: O(N) - linear with document count
- Storage: O(N) - ~2KB per chunk
Decision: all-MiniLM-L6-v2 (fast) over all-mpnet-base-v2 (accurate)
Trade-off Analysis:
- Quality delta: 5% (85% → 90%)
- Speed delta: 2.2x slower (200ms → 450ms)
- Impact on use case: Business documents (not research papers)
- Conclusion: 85% quality sufficient, speed more valuable
Would Reconsider If:
- Using for academic/research documents (quality critical)
- Search latency >1 second acceptable
- Collection size <1,000 chunks (speed less important)
Decision: HNSW approximate search over exact brute-force
Trade-off Analysis:
- Accuracy: 95% recall (vs 100% brute force)
- Speed: 100-200ms (vs 5-10 seconds brute force)
- Storage: +20% overhead (index structure)
- Impact: Missing 5% of "good" results acceptable
- Conclusion: 50x speed gain worth 5% accuracy loss
Would Reconsider If:
- Collection <500 chunks (brute force fast enough)
- Perfect recall required (legal discovery, compliance)
- Latency requirements >500ms
Decision: 20% overlap (200 of 1,000 chars)
Trade-off Analysis:
- Storage increase: +20% (14.5MB → 17.4MB)
- Context preservation: Significant improvement
- Redundancy: Acceptable for better results
- Impact: Better search results at marginal storage cost
- Conclusion: 3MB extra storage justified by quality
Would Reconsider If:
- Storage extremely limited (<100MB total)
- Documents have perfect section boundaries (no mid-thought splits)
- Query patterns favor exact-match over semantic
Decision: Local embeddings (sentence-transformers) over OpenAI API
Trade-off Analysis:
- Quality delta: 7% (85% → 92%)
- Cost delta: $0 → ~$50/month (7,396 chunks, periodic reindex)
- Privacy: Full control vs data uploaded
- Latency: Consistent vs network-dependent
- Impact: Business documents contain sensitive information
- Conclusion: Privacy and cost savings outweigh quality delta
Would Reconsider If:
- Public documents only (no privacy concerns)
- Budget allows API costs
- 92% quality critical for use case
Decision: Markdown-only over multi-format (PDF, DOCX, HTML)
Trade-off Analysis:
- Simplicity: Single parser vs multiple
- Reliability: High vs varied (format quirks)
- Maintenance: Low vs high (format changes)
- Coverage: ~95% of use case (most docs are markdown)
- Impact: 5% of documents require manual conversion
- Conclusion: Focus on core use case, accept manual conversion
Would Reconsider If:
- PDF/DOCX sources >30% of collection
- Automated pipeline requires multi-format
- Parsing libraries mature and stable
Decision: Manual reindex trigger over file-watching
Trade-off Analysis:
- Complexity: Low (explicit trigger) vs high (file watcher, race conditions)
- Latency: Seconds (on-demand) vs instant (automatic)
- Resource usage: On-demand vs continuous monitoring
- Impact: 30-second delay acceptable for use case
- Conclusion: Manual trigger simpler and sufficient
Would Reconsider If:
- Real-time collaboration (multi-user editing)
- High-frequency updates (>10/hour)
- Zero-latency requirement
- 4-Layer Separation: Clean boundaries enable independent optimization
- Local-First: Zero cost, full privacy, offline capability
- Section-Based Chunking: Preserves document structure for coherent results
- Optimized Model: all-MiniLM-L6-v2 balances speed, quality, size
- HNSW Indexing: 50x speed improvement at 5% accuracy cost
- Incremental Updates: 13-130x faster than full reindex for common use case
| Metric | Current | Tested | Limit |
|---|---|---|---|
| Documents | 124 | 200 | ~1,000 |
| Chunks | 7,396 | 10,000 | ~50,000 |
| Search Time | 180ms | 250ms | <500ms |
| Storage | 14.5MB | 20MB | ~100MB |
| Index Time | 26 min | 35 min | ~3 hours |
- GPU Acceleration: 10-50x faster embedding (if GPU available)
- Quantization: 4x storage reduction (int8 vs float32) at 1% quality loss
- Hybrid Search: Combine semantic + keyword for better recall
- Query Caching: Memoize common queries (dashboards, reports)
- Metadata Filtering: Pre-filter by date, category before semantic search
- Batch Queries: Process multiple queries in single embedding pass
Document Version: 1.0
Author: Jordan Minor
Last Updated: November 2025
Production Status: Deployed at MRMINOR LLC
Performance: 70-90% token efficiency improvement, <500ms search latency