Skip to content

perf: Implement response caching for frequently asked questions #18

@doughayden

Description

@doughayden

Performance Optimization

Implement intelligent caching for frequently asked questions to improve response times and reduce API costs.

Level of Effort: 🟡 Medium (3-4 days)

  • Cache implementation: 2 days for caching logic and storage
  • Cache strategy: 1 day for intelligent cache management
  • Testing and tuning: 1 day for performance validation

Current Performance Characteristics

Response time breakdown:

  • Discovery Engine API call: 2-5 seconds (majority of time)
  • Response processing: 100-500ms
  • BigQuery logging: 50-200ms
  • Total: 2.5-6 seconds per query

Caching opportunities:

  • Identical questions from different users
  • Similar questions with equivalent answers
  • Popular topics that get asked repeatedly
  • FAQ-style questions with stable answers

Proposed Caching Strategy

1. Multi-Layer Caching

Layer 1: Exact Match Cache

# src/answer_app/cache_manager.py
class ResponseCache:
    def __init__(self, redis_client=None, ttl_seconds=3600):
        self.redis_client = redis_client
        self.memory_cache = {}
        self.ttl_seconds = ttl_seconds
    
    async def get_cached_response(self, query_hash: str) -> Optional[CachedResponse]:
        """Get cached response for exact query match."""
        
    async def cache_response(self, query: str, response: dict, metadata: dict):
        """Cache successful response with metadata."""
        
    def generate_query_hash(self, query: str, user_context: dict = None) -> str:
        """Generate consistent hash for query caching."""

Layer 2: Semantic Similarity Cache

class SemanticCache:
    def __init__(self, similarity_threshold=0.85):
        self.similarity_threshold = similarity_threshold
        self.query_embeddings = {}
    
    async def find_similar_cached_query(self, query: str) -> Optional[CachedResponse]:
        """Find semantically similar cached queries."""
        
    async def compute_query_embedding(self, query: str) -> List[float]:
        """Generate embeddings for semantic similarity."""

2. Cache Storage Options

Option A: Redis (Recommended for production)

  • Fast in-memory caching
  • Distributed caching across instances
  • Automatic expiration handling
  • Persistence options available

Option B: In-Memory Cache (Development)

  • Simple implementation for testing
  • No external dependencies
  • Limited to single instance

Option C: BigQuery Cache Table

  • Persistent cache storage
  • Query-based cache management
  • Integration with existing BigQuery setup

3. Intelligent Cache Management

Cache key strategy:

def generate_cache_key(query: str, user_context: dict) -> str:
    """Generate cache key considering context."""
    # Normalize query text
    normalized_query = normalize_query(query)
    
    # Include relevant context
    context_hash = hash_user_context(user_context)
    
    return f"query:{hash(normalized_query)}:ctx:{context_hash}"

def normalize_query(query: str) -> str:
    """Normalize query for consistent caching."""
    # Remove extra whitespace, lowercase, remove punctuation
    # Handle common variations and synonyms
    return query.lower().strip()

Cache invalidation strategy:

  • Time-based expiration (TTL)
  • Manual invalidation for updated content
  • LRU eviction for memory management
  • Version-based invalidation for content updates

4. Cache Analytics and Monitoring

Cache performance metrics:

class CacheMetrics:
    def track_cache_hit(self, query_type: str, response_time_saved: float):
        """Track successful cache hits."""
        
    def track_cache_miss(self, query_type: str, reason: str):
        """Track cache misses and reasons."""
        
    def get_cache_statistics(self) -> dict:
        """Get cache performance statistics."""
        return {
            "hit_rate": self.calculate_hit_rate(),
            "avg_response_time_saved": self.avg_time_saved,
            "popular_queries": self.get_top_cached_queries(),
            "cache_size": self.get_cache_size()
        }

Implementation Areas

Backend Components:

  • src/answer_app/cache_manager.py: Core caching logic
  • src/answer_app/semantic_cache.py: Similarity-based caching
  • src/answer_app/cache_metrics.py: Cache performance tracking
  • src/answer_app/main.py: Cache middleware integration

Configuration:

  • src/answer_app/config.yaml: Cache configuration options
  • Cache deployment: Redis or alternative cache storage

Infrastructure:

  • terraform/modules/cache/: Redis deployment (if using Redis)
  • Monitoring: Cache performance dashboards

Configuration Options

Add to config.yaml:

caching:
  enabled: true
  provider: "redis"  # redis, memory, bigquery
  
  redis:
    host: "localhost"
    port: 6379
    db: 0
    password: null
  
  cache_settings:
    default_ttl: 3600  # 1 hour
    max_cache_size: 10000  # entries
    similarity_threshold: 0.85
    
  cache_policies:
    exact_match_ttl: 3600
    semantic_match_ttl: 1800
    popular_query_ttl: 7200
    
  monitoring:
    log_cache_hits: true
    track_performance: true
    alert_on_low_hit_rate: 0.3

Cache Warming Strategies

1. Popular Query Pre-loading

  • Identify frequently asked questions from BigQuery logs
  • Pre-populate cache with common queries
  • Update cache during low-traffic periods

2. Proactive Caching

  • Cache responses for trending topics
  • Pre-generate responses for FAQ content
  • Background cache refresh for expiring popular items

Testing Strategy

Performance Testing:

  • Cache hit rate measurement
  • Response time improvement validation
  • Memory usage monitoring
  • Cache invalidation testing

Functionality Testing:

  • Exact match caching accuracy
  • Semantic similarity matching
  • Cache expiration behavior
  • Cache consistency across instances

Expected Performance Improvements

With 50% cache hit rate:

  • Average response time: 2.5-6s → 1.5-3s (50% improvement for cached queries)
  • Discovery Engine API calls: 50% reduction
  • Cost savings: ~50% reduction in Discovery Engine costs
  • Improved user experience for repeat questions

With semantic caching:

  • Additional 20-30% cache hit rate for similar questions
  • Better handling of question variations
  • Reduced API costs for semantically similar queries

Acceptance Criteria

  • Exact match caching implemented and working
  • Semantic similarity caching (optional advanced feature)
  • Configurable cache TTL and size limits
  • Cache performance monitoring and metrics
  • Cache hit rate > 40% for typical usage patterns
  • Response time improvement measurable
  • Cache invalidation working correctly
  • Documentation for cache configuration and tuning

Priority

Low - Performance optimization that provides value but isn't critical at current scale.

When to Implement

This becomes more valuable when:

  • Application handles >100 queries/day consistently
  • Discovery Engine API costs become significant
  • Response time optimization becomes important
  • Users frequently ask similar questions
  • Traffic patterns show repeated query patterns

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions