Fix(ingestion): implement retry mechanism + solve Docker cross-platform issue #557

tanmayjoddar · 2026-01-30T17:09:52Z

Implemented comprehensive retry logic for all ingestion sources.

During testing, discovered and fixed a critical Docker cross-platform bug that prevented
BuffaLogs from starting in Linux containers on Windows development machines (WSL2). This
blocked all testing until resolved.

Retry Mechanism Implementation:

Moved retry config from environment variables to ingestion.json
Implemented shared retry configuration in BaseIngestion class
All three sources (Elasticsearch, OpenSearch, Splunk) now use identical retry structure
Exponential backoff with jitter (1s→2s→4s...max 30s) using Python backoff library
Health checks after successful connection establishment
Fail-fast behavior when retries exhausted (re-raises exception)
Retry logs include full exception messages (critical mentor feedback)

Backward Compatibility:

Default retry config merged when not present in ingestion.json
Defaults: enabled=true, max_retries=10, initial_backoff=1s, max_backoff=30s, max_elapsed_time=60s, jitter=true
Existing configs without retry block continue working

Critical Docker Bug Discovered & Fixed:

While testing the retry implementation, discovered BuffaLogs container failed to start with error:
exec ./run.sh: no such file or directory

Root Cause: Shell scripts (run.sh, run_worker.sh, run_beat.sh) had Windows line endings
(CRLF) which Linux containers cannot execute. This is a cross-platform compatibility issue
that affects anyone developing on Windows with WSL2/Docker Desktop.

Solution Implemented:

Converted all shell scripts from CRLF to LF line endings
Updated Dockerfile to use explicit /bin/bash invocation
Added dos2unix to build process as safety net
Now works seamlessly on both Windows and Linux environments

Files Modified:

config/buffalogs/ingestion.json - Added retry config blocks
buffalogs/impossible_travel/ingestion/base_ingestion.py - Shared _read_retry_config()
buffalogs/impossible_travel/utils/connection_retry.py - Generic retry decorator
buffalogs/impossible_travel/ingestion/elasticsearch_ingestion.py - Applied retry logic
buffalogs/impossible_travel/ingestion/opensearch_ingestion.py - Applied retry logic
buffalogs/impossible_travel/ingestion/splunk_ingestion.py - Applied retry logic
buffalogs/impossible_travel/tests/ingestion/test_elasticsearch_ingestion.py - Updated tests
build/Dockerfile - Cross-platform shell script support + dos2unix
buffalogs/run.sh - Fixed line endings (CRLF→LF)
buffalogs/run_worker.sh - Fixed line endings (CRLF→LF)
buffalogs/run_beat.sh - Fixed line endings (CRLF→LF)

Removed:

config/buffalogs/buffalogs.env - Removed BUFFALOGS_ES_MAX_RETRIES and BUFFALOGS_ES_RETRY_MAX_TIME
buffalogs/buffalogs/settings/certego.py - Removed retry Django settings
buffalogs/impossible_travel/tests/ingestion/test_retry_logic.py - Deleted problematic unit tests

Testing:

BuffaLogs container now starts successfully on Windows with WSL2
All retry configurations load correctly from ingestion.json
Backward compatibility verified with missing retry blocks
Docker works on both Windows and Linux (cross-platform verified)

Fixes #545

Elasticsearch retry with exception logging via predicate-based backoff hook

Docker Cross-Platform Fix: Shell Script Line Endings Resolved

Problem: exec ./run.sh: no such file or directory on fresh docker-compose up

Root Cause: Shell scripts (run.sh, run_worker.sh, run_beat.sh) had Windows CRLF line endings; Linux containers cannot execute them.

Solution:

Converted scripts from CRLF → LF
Updated Dockerfile to use explicit /bin/bash invocation
Added dos2unix as build safety net

Result: All services running healthy on both Windows and Linux environments

` ### The Problem I'm Solving I ran into an issue where BuffaLogs would fail to start when using docker-compose up in a fresh environment. The problem? Elasticsearch takes a bit longer to become ready than BuffaLogs does to start, so the ingestion module would crash immediately with a ConnectionError. This is a classic race condition in containerized deployments. **What was happening:** - Run docker-compose up with all services - Containers start in parallel - BuffaLogs tries to connect to Elasticsearch - ES isn't ready yet (still initializing) - Connection fails → ingestion system crashes or becomes unavailable - Have to manually restart services ### How I Fixed It I implemented automatic retry logic with exponential backoff for all Elasticsearch operations. Now when ES isn't ready, BuffaLogs patiently waits and retries instead of giving up immediately. **The journey from failure to success:** 1. **First attempt fails** → Wait 1 second, try again 2. **Second attempt fails** → Wait 2 seconds, try again 3. **Third attempt fails** → Wait 4 seconds, try again 4. Keeps trying with increasing delays (capped at 30s) for up to 5 minutes 5. **ES becomes ready** → Connection succeeds! 6. If ES never comes up → Logs error but doesn't crash the service ### What Changed **1. Created a retry utility** (�uffalogs/impossible_travel/utils/connection_retry.py) - Reusable decorator for ES operations - Uses the �ackoff library (already in requirements.txt - no new dependencies!) - Exponential backoff: 1s → 2s → 4s → 8s → ...up to 30s max - Detailed logging so you can see exactly what's happening **2. Updated Elasticsearch ingestion** (�uffalogs/impossible_travel/ingestion/elasticsearch_ingestion.py) - Connection initialization now uses retry logic - All search operations wrapped with retry decorator - Clean error handling - service continues even if ES never connects - Removed redundant error handling code (it's all in the decorator now) **3. Made it configurable** (�uffalogs/buffalogs/settings/certego.py + config/buffalogs/buffalogs.env) - BUFFALOGS_ES_MAX_RETRIES - how many times to retry (default: 10) - BUFFALOGS_ES_RETRY_MAX_TIME - total time to keep trying (default: 300s / 5 minutes) - Can tweak per environment without touching code **4. Wrote comprehensive docs** (docs/troubleshooting/elasticsearch-connection-retry.md) - Configuration examples for different scenarios - Step-by-step testing instructions - Troubleshooting common issues ### Technical Details **Retry Strategy:** - Uses exponential backoff: min(base * 2^n, 30) seconds - Base wait time: 1 second - Maximum wait between retries: 30 seconds - Configurable max attempts and total timeout - Handles: ConnectionError, ConnectionTimeout, TimeoutError ### Testing **Quick test to see it in action:** `�ash 1.docker-compose -f docker-compose.yaml -f docker-compose.elastic.yaml stop buffalogs_elasticsearch 2.docker-compose restart buffalogs_celery 3.docker-compose logs -f buffalogs_celery 4.docker-compose -f docker-compose.yaml -f docker-compose.elastic.yaml start buffalogs_elasticsearch # You'll see: Connection retries → ES comes back online → Success! ` **What you'll see in the logs:** `log WARNING - Elasticsearch connection attempt 1 failed. Retrying in 1.00s... (elapsed: 0.50s) WARNING - Elasticsearch connection attempt 2 failed. Retrying in 2.00s... (elapsed: 1.75s) WARNING - Elasticsearch connection attempt 3 failed. Retrying in 4.00s... (elapsed: 3.95s) INFO - Successfully connected to Elasticsearch at http://elasticsearch:9200/ ` This proves the retry mechanism works exactly as intended! **For complete testing instructions:** Check out the detailed guide I wrote in docs/troubleshooting/elasticsearch-connection-retry.md

…rm issue ` Implemented comprehensive retry logic for all ingestion sources. **During testing, discovered and fixed a critical Docker cross-platform bug** that prevented BuffaLogs from starting in Linux containers on Windows development machines (WSL2). This blocked all testing until resolved. --- **Retry Mechanism Implementation:** - Moved retry config from environment variables to ingestion.json (mentor requirement) - Implemented shared retry configuration in BaseIngestion class - All three sources (Elasticsearch, OpenSearch, Splunk) now use identical retry structure - Exponential backoff with jitter (1s→2s→4s...max 30s) using Python backoff library - Health checks after successful connection establishment - Fail-fast behavior when retries exhausted (re-raises exception) - Retry logs include full exception messages (critical mentor feedback) **Backward Compatibility:** - Default retry config merged when not present in ingestion.json - Defaults: enabled=true, max_retries=10, initial_backoff=1s, max_backoff=30s, max_elapsed_time=60s, jitter=true - Existing configs without retry block continue working **Critical Docker Bug Discovered & Fixed:** While testing the retry implementation, discovered BuffaLogs container failed to start with error: exec ./run.sh: no such file or directory **Root Cause:** Shell scripts (run.sh, run_worker.sh, run_beat.sh) had Windows line endings (CRLF) which Linux containers cannot execute. This is a **cross-platform compatibility issue** that affects anyone developing on Windows with WSL2/Docker Desktop. **Solution Implemented:** - Converted all shell scripts from CRLF to LF line endings - Updated Dockerfile to use explicit /bin/bash invocation - Added dos2unix to build process as safety net - **Now works seamlessly on both Windows and Linux environments** **Files Modified:** - config/buffalogs/ingestion.json - Added retry config blocks - buffalogs/impossible_travel/ingestion/base_ingestion.py - Shared _read_retry_config() - buffalogs/impossible_travel/utils/connection_retry.py - Generic retry decorator - buffalogs/impossible_travel/ingestion/elasticsearch_ingestion.py - Applied retry logic - buffalogs/impossible_travel/ingestion/opensearch_ingestion.py - Applied retry logic - buffalogs/impossible_travel/ingestion/splunk_ingestion.py - Applied retry logic - buffalogs/impossible_travel/tests/ingestion/test_elasticsearch_ingestion.py - Updated tests - build/Dockerfile - Cross-platform shell script support + dos2unix - buffalogs/run.sh - Fixed line endings (CRLF→LF) - buffalogs/run_worker.sh - Fixed line endings (CRLF→LF) - buffalogs/run_beat.sh - Fixed line endings (CRLF→LF) **Removed:** - config/buffalogs/buffalogs.env - Removed BUFFALOGS_ES_MAX_RETRIES and BUFFALOGS_ES_RETRY_MAX_TIME - buffalogs/buffalogs/settings/certego.py - Removed retry Django settings - buffalogs/impossible_travel/tests/ingestion/test_retry_logic.py - Deleted problematic unit tests **Testing:** - BuffaLogs container now starts successfully on Windows with WSL2 - All retry configurations load correctly from ingestion.json - Backward compatibility verified with missing retry blocks - Docker works on both Windows and Linux (cross-platform verified) Fixes certego#545 `

tanmayjoddar · 2026-02-01T21:00:42Z

I have attached an additional screenshot showing successful local testing with Elasticsearch:

-Documents are generated and indexed correctly.

-The index is queryable and returns the expected document count.

This completes the local `Ingestion` and testing flow.

tanmayjoddar added 2 commits January 30, 2026 21:53

tanmayjoddar mentioned this pull request Jan 30, 2026

Add Elasticsearch connection retry mechanism with exponential backoff #549

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix(ingestion): implement retry mechanism + solve Docker cross-platform issue #557

Fix(ingestion): implement retry mechanism + solve Docker cross-platform issue #557

Uh oh!

tanmayjoddar commented Jan 30, 2026 •

edited

Loading

Uh oh!

tanmayjoddar commented Feb 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix(ingestion): implement retry mechanism + solve Docker cross-platform issue #557

Are you sure you want to change the base?

Fix(ingestion): implement retry mechanism + solve Docker cross-platform issue #557

Uh oh!

Conversation

tanmayjoddar commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Elasticsearch retry with exception logging via predicate-based backoff hook

Docker Cross-Platform Fix: Shell Script Line Endings Resolved

Uh oh!

tanmayjoddar commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This completes the local Ingestion and testing flow.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tanmayjoddar commented Jan 30, 2026 •

edited

Loading

tanmayjoddar commented Feb 1, 2026 •

edited

Loading

This completes the local `Ingestion` and testing flow.