This guide maps common symptoms to potential causes and solutions using Longbow's metrics and logs.
Symptom: DoPut operations are taking longer than expected.
Check Metrics:
longbow_wal_writes_total: Is the rate consistent?longbow_wal_bytes_written_total: Are you writing unusually large batches?
Potential Causes:
- Slow Disk: The WAL requires high IOPS. Ensure
LONGBOW_DATA_PATHis on an SSD. - Large Batches: Extremely large Arrow batches can cause GC pauses. Try reducing batch size.
Symptom: Pod is getting OOMKilled or memory usage is climbing indefinitely.
Check Metrics:
longbow_vector_index_size: Is the index growing as expected?longbow_memory_fragmentation_ratio: Is Go runtime retaining memory?
Potential Causes:
- Snapshot Lag: If snapshots are failing, the WAL grows, and memory isn't
freed. Check
longbow_snapshot_operations_total{status="error"}. - Configuration: Ensure
LONGBOW_MAX_MEMORYis set to a value lower than your container's hard limit.
Symptom: Memory usage suddenly doubles, leading to OOM kills, even when data ingestion rate is stable.
Check Metrics:
longbow_learned_index_adaptations_total{status="running"}: Is a background index swap in progress?longbow_store_memory_usage_bytes: Identify the spike onset.
Cause: Longbow's Adaptive Learned Index and Auto-Sharding mechanisms build replacement indices in the background to ensure zero-downtime search. This process temporarily doubles the memory footprint of the index being replaced.
Solution:
- Increase Buffer: Ensure
LONGBOW_MAX_MEMORYis set with at least a 50% buffer above your steady-state index size. - Limit Concurrent Migrations: Avoid triggering multiple collection migrations simultaneously.
- Disable Auto-Adaptation: If memory is critical, disable automatic switching via config:
learned_index: adaptation: enable_adaptation: false
Symptom: Longbow takes a long time to become ready after a restart.
Check Metrics:
longbow_wal_replay_duration_seconds: High values indicate a large WAL.
Solution:
- Decrease
LONGBOW_SNAPSHOT_INTERVAL. A shorter interval means a smaller WAL to replay on startup, as older data is already in Parquet.
Symptom: Pod crashes with open /data/wal.log: permission denied.
Cause: The application runs as a non-root user (UID 1000) while /data is owned by root or the filesystem is read-only.
Solution:
- Ensure
persistence.wal.enabledistruein Helm values to mount a PersistentVolume. - Verify
podSecurityContext.fsGroupis set to2000(or similar) to ensure the volume is writable by the app user.
Symptom: panic: Failed to process config: converting '6.7108864e+07' to type int.
Cause: Helm passes large numeric values as floating-point scientific notation if not explicitly quoted.
Solution:
- Quote all large integer values in
values.yaml(e.g.,maxRecvMsgSize: "67108864").
Symptom: longbow_replication_lag_seconds is high (> 30s) on follower nodes.
Cause: Network variance, slow disk I/O on follower, or high write throughput overwhelming replication stream.
Solution:
- Check follower disk IOPS and CPU.
- Ensure network connectivity between Leader and Follower is stable (
longbow_gossip_pings_total{direction="failed"}). - If persisting, consider scaling out with more shards to distribute write load.
Symptom: Log shows WARN GPU initialization failed, using CPU-only.
Cause:
- CUDA/Metal: Missing drivers or unsupported hardware.
- Memory: Insufficient GPU memory (OOM).
- Permissions: Access to GPU device denied.
Solution:
- Verify NVIDIA drivers/CUDA toolkit (Linux) or macOS version (Apple Silicon).
- Check
nvidia-smiorpowermetrics(macOS). - Ensure
GPU_ENABLED=trueis set.
Symptom: longbow_s3_operations_total{status="error"} is increasing.
Cause: AWS Credentials expiry, bucket policy denial, or network timeouts.
Solution:
- Verify
AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY. - Check IAM permissions for
s3:PutObjectands3:GetObject. - Inspect logs for specific S3 error codes (e.g.,
403 Forbidden,503 Slow Down).