Skip to content

Implement Hierarchical Vector Search for Insights System #152

@jerrod-storm

Description

@jerrod-storm

Overview

Currently, the insights system uses a flat vector search approach where every insight's full content (topic + name + overview + details) is embedded into a single vector. This approach has performance and relevance limitations when searching large insight collections.

Proposed Solution: Hierarchical Vector Search

Implement a three-level hierarchical vector index that mirrors the natural structure of the insights system:

Topic (embedding of topic name + summary)
  ├── Insight 1 (embedding of name + overview)  
  │   └── Details (embedding of full details)
  ├── Insight 2 (embedding of name + overview)
  │   └── Details (embedding of full details)
  └── ...

Benefits

  1. Performance: 5-10x faster searches by progressive filtering
  2. Relevance: Better keyword matching with topic/title embeddings
  3. Scalability: Efficient elimination of irrelevant content categories
  4. Flexibility: Different embedding strategies for different content types

Implementation Approach

New Data Models

// New: Topic-level vectors
pub struct TopicVector {
  pub topic_id: String,           // "backend-api"
  pub topic_name: String,         // "Backend API"
  pub topic_embedding: Vec<f32>,  // Embedding of topic name + topic summary
  pub insight_count: i32,         // How many insights in this topic
  pub created_at: String,
  pub updated_at: String,
}

// Enhanced: Multi-level insight vectors
pub struct HierarchicalInsightRecord {
  pub insight_id: String,         // "backend-api:authentication"  
  pub topic_id: String,           // "backend-api"
  pub name: String,
  pub overview: String,
  pub details: String,
  
  // Multiple embeddings for different search strategies
  pub title_embedding: Vec<f32>,    // Name + overview (keyword-optimized)
  pub content_embedding: Vec<f32>,  // Full details (semantic-optimized)
  
  pub created_at: String,
  pub updated_at: String,
}

Three-Level Search Strategy

pub async fn hierarchical_search(
  &self,
  query: &str,
  limit: usize,
) -> Result<Vec<EmbeddingSearchResult>> {
  let query_embedding = create_embedding(query).await?;
  
  // Level 1: Find relevant topics (fast, broad filter)
  let relevant_topics = search_topic_vectors(&query_embedding, limit * 2).await?;
  
  // Level 2: Find relevant insights within those topics (focused)
  let relevant_insights = search_insight_vectors(
    &query_embedding, 
    &relevant_topics.iter().map(|t| &t.topic_id).collect::<Vec<_>>(),
    limit
  ).await?;
  
  // Level 3: Search detailed content only for promising insights
  search_content_vectors(&query_embedding, &relevant_insights, limit).await
}

Migration Checklist

Phase 1: Infrastructure Setup

  • Create new TopicVector model and LanceDB table
  • Create new HierarchicalInsightRecord model
  • Add topic summary generation (automatically derive from existing insights)
  • Implement topic embedding generation
  • Add migration utility to populate topic vectors from existing data

Phase 2: Enhanced Insight Vectors

  • Modify InsightRecord to support multiple embeddings
  • Update schema to include both title_embedding and content_embedding
  • Implement separate embedding generation for title vs content
  • Add migration utility to split existing insight embeddings
  • Update insight storage to generate both embedding types

Phase 3: Hierarchical Search Implementation

  • Implement topic-level search functions
  • Implement insight-level search with topic filtering
  • Implement content-level search with insight filtering
  • Create combined hierarchical search function
  • Add performance benchmarking and comparison with existing search

Phase 4: API Integration

  • Add hierarchical search endpoint alongside existing search
  • Update search handlers to support both search types
  • Add configuration option to choose search strategy
  • Update CLI to support hierarchical search parameters
  • Add search performance metrics and logging

Phase 5: Optimization & Migration

  • Compare performance between flat and hierarchical search
  • Tune similarity thresholds for each search level
  • Implement adaptive search (fall back to content search if topic/insight searches return few results)
  • Gradually migrate default search to hierarchical approach
  • Add comprehensive test coverage for hierarchical search

Phase 6: Cleanup (Optional)

  • Deprecate flat search approach (if hierarchical proves superior)
  • Remove single embedding field from InsightRecord
  • Clean up legacy search code
  • Update documentation and examples

Technical Considerations

  1. Backward Compatibility: Keep existing search functionality during migration
  2. Storage Overhead: Multiple embeddings per insight will increase storage ~2-3x
  3. Index Management: Topic summaries need to be regenerated when insights change
  4. Performance Monitoring: Track search latency and relevance metrics during migration
  5. Embedding Consistency: Ensure all embedding types use the same model version

Success Criteria

  • Search performance improves by 5-10x for large insight collections
  • Search relevance improves, especially for keyword-based queries
  • System remains backward compatible during migration
  • Migration can be completed incrementally without service disruption

Related Files

  • crates/insights/src/server/services/lancedb/models.rs - Data models
  • crates/insights/src/server/services/lancedb/search.rs - Search implementation
  • crates/insights/src/server/services/lancedb/records.rs - Arrow schema
  • crates/insights/src/server/handlers/insights.rs - API handlers

This enhancement would significantly improve the insights system's search performance and relevance, especially as the insight collection grows larger.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions