Skip to content

perf: Implement BigQuery batch operations for high-traffic scenarios #13

@doughayden

Description

@doughayden

Performance Optimization

Implement batch operations for BigQuery insertions to improve performance and reduce API call overhead during high-traffic periods.

Level of Effort: 🟡 Medium (2-3 days)

  • Implementation: 1.5-2 days for batch logic and configuration
  • Testing: 0.5-1 day for performance and reliability testing

Current Implementation

File: src/answer_app/utils.py (lines 299-330)

async def bq_insert_row_data(self, table_id: str, row_data: dict) -> None:
    """Insert a single row into BigQuery table."""
    # Current: Single row insertion for each request

Current limitations:

  • Each query/feedback generates individual BigQuery insert
  • High API call volume during peak usage
  • Potential rate limiting issues
  • Suboptimal cost efficiency for batch operations

Performance Impact Analysis

Current state:

  • 1 API call per user query
  • 1 API call per feedback submission
  • No batching or queuing mechanism

Expected improvements with batching:

  • 50-80% reduction in BigQuery API calls
  • Better handling of traffic spikes
  • Reduced latency for individual requests
  • Lower BigQuery costs for high-volume usage

Recommended Implementation

1. Batch Queue System

class BigQueryBatcher:
    def __init__(self, batch_size: int = 100, flush_interval: int = 30):
        self.batch_size = batch_size
        self.flush_interval = flush_interval
        self.queue = []
        self.last_flush = time.time()
    
    async def add_row(self, table_id: str, row_data: dict):
        """Add row to batch queue."""
        self.queue.append((table_id, row_data))
        
        if len(self.queue) >= self.batch_size:
            await self.flush()
    
    async def flush(self):
        """Flush batch to BigQuery."""
        if not self.queue:
            return
            
        # Group by table_id and batch insert
        # Implementation details...

2. Configuration Options

Add to config.yaml:

bigquery:
  batch_size: 100          # Rows per batch
  flush_interval: 30       # Seconds between forced flushes
  max_queue_size: 1000     # Maximum queue size before blocking
  enable_batching: true    # Feature flag

3. Graceful Degradation

  • Fall back to single inserts if batching fails
  • Implement proper error handling for partial batch failures
  • Add monitoring for batch performance

Implementation Areas

Files to Modify:

  • src/answer_app/utils.py: Add batching logic to UtilHandler
  • src/answer_app/config.yaml: Add BigQuery batching configuration
  • src/answer_app/main.py: Initialize batcher and handle graceful shutdown

New Components:

  • BigQuery batch manager class
  • Background flush scheduler
  • Batch monitoring and metrics

Configuration Strategy

Development Environment:

  • Smaller batch sizes for faster feedback
  • Shorter flush intervals for debugging

Production Environment:

  • Larger batch sizes for efficiency
  • Longer flush intervals for optimal performance

Testing Requirements

  • Batch queue functionality
  • Flush timing and triggers
  • Error handling for partial failures
  • Performance benchmarking
  • Memory usage monitoring
  • Graceful shutdown behavior

Acceptance Criteria

  • Batch queue system implemented
  • Configurable batch size and flush interval
  • Backward compatibility with single inserts
  • Proper error handling and logging
  • Performance metrics and monitoring
  • Graceful application shutdown
  • Documentation for configuration options

Priority

Medium - Performance optimization that becomes more valuable as traffic increases.

When to Implement

This optimization becomes more valuable when:

  • Application handles >100 queries/hour consistently
  • BigQuery costs become a concern
  • API rate limiting becomes an issue
  • Traffic patterns show consistent batch opportunities

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions