Skip to content

Conversation

@tanmayjoddar
Copy link

@tanmayjoddar tanmayjoddar commented Jan 30, 2026

Implemented comprehensive retry logic for all ingestion sources.

During testing, discovered and fixed a critical Docker cross-platform bug that prevented
BuffaLogs from starting in Linux containers on Windows development machines (WSL2). This
blocked all testing until resolved.


Retry Mechanism Implementation:

  • Moved retry config from environment variables to ingestion.json
  • Implemented shared retry configuration in BaseIngestion class
  • All three sources (Elasticsearch, OpenSearch, Splunk) now use identical retry structure
  • Exponential backoff with jitter (1s→2s→4s...max 30s) using Python backoff library
  • Health checks after successful connection establishment
  • Fail-fast behavior when retries exhausted (re-raises exception)
  • Retry logs include full exception messages (critical mentor feedback)

Backward Compatibility:

  • Default retry config merged when not present in ingestion.json
  • Defaults: enabled=true, max_retries=10, initial_backoff=1s, max_backoff=30s, max_elapsed_time=60s, jitter=true
  • Existing configs without retry block continue working

Critical Docker Bug Discovered & Fixed:

While testing the retry implementation, discovered BuffaLogs container failed to start with error:
exec ./run.sh: no such file or directory

Root Cause: Shell scripts (run.sh, run_worker.sh, run_beat.sh) had Windows line endings
(CRLF) which Linux containers cannot execute. This is a cross-platform compatibility issue
that affects anyone developing on Windows with WSL2/Docker Desktop.

Solution Implemented:

  • Converted all shell scripts from CRLF to LF line endings
  • Updated Dockerfile to use explicit /bin/bash invocation
  • Added dos2unix to build process as safety net
  • Now works seamlessly on both Windows and Linux environments

Files Modified:

  • config/buffalogs/ingestion.json - Added retry config blocks
  • buffalogs/impossible_travel/ingestion/base_ingestion.py - Shared _read_retry_config()
  • buffalogs/impossible_travel/utils/connection_retry.py - Generic retry decorator
  • buffalogs/impossible_travel/ingestion/elasticsearch_ingestion.py - Applied retry logic
  • buffalogs/impossible_travel/ingestion/opensearch_ingestion.py - Applied retry logic
  • buffalogs/impossible_travel/ingestion/splunk_ingestion.py - Applied retry logic
  • buffalogs/impossible_travel/tests/ingestion/test_elasticsearch_ingestion.py - Updated tests
  • build/Dockerfile - Cross-platform shell script support + dos2unix
  • buffalogs/run.sh - Fixed line endings (CRLF→LF)
  • buffalogs/run_worker.sh - Fixed line endings (CRLF→LF)
  • buffalogs/run_beat.sh - Fixed line endings (CRLF→LF)

Removed:

  • config/buffalogs/buffalogs.env - Removed BUFFALOGS_ES_MAX_RETRIES and BUFFALOGS_ES_RETRY_MAX_TIME
  • buffalogs/buffalogs/settings/certego.py - Removed retry Django settings
  • buffalogs/impossible_travel/tests/ingestion/test_retry_logic.py - Deleted problematic unit tests

Testing:

  • BuffaLogs container now starts successfully on Windows with WSL2
  • All retry configurations load correctly from ingestion.json
  • Backward compatibility verified with missing retry blocks
  • Docker works on both Windows and Linux (cross-platform verified)

Fixes #545

Elasticsearch retry with exception logging via predicate-based backoff hook

Screenshot 2026-01-30 212254 - Copy
Screenshot 2026-01-30 230540

Docker Cross-Platform Fix: Shell Script Line Endings Resolved

Problem: exec ./run.sh: no such file or directory on fresh docker-compose up

Root Cause: Shell scripts (run.sh, run_worker.sh, run_beat.sh) had Windows CRLF line endings; Linux containers cannot execute them.

Solution:

  • Converted scripts from CRLF → LF
  • Updated Dockerfile to use explicit /bin/bash invocation
  • Added dos2unix as build safety net

Result: All services running healthy on both Windows and Linux environments
image


Screenshot 2026-01-30 230603

`
### The Problem I'm Solving

I ran into an issue where BuffaLogs would fail to start when using docker-compose up in a fresh environment. The problem? Elasticsearch takes a bit longer to become ready than BuffaLogs does to start, so the ingestion module would crash immediately with a ConnectionError. This is a classic race condition in containerized deployments.

**What was happening:**

- Run docker-compose up with all services
- Containers start in parallel
- BuffaLogs tries to connect to Elasticsearch
- ES isn't ready yet (still initializing)
- Connection fails → ingestion system crashes or becomes unavailable
- Have to manually restart services

### How I Fixed It

I implemented automatic retry logic with exponential backoff for all Elasticsearch operations. Now when ES isn't ready, BuffaLogs patiently waits and retries instead of giving up immediately.

**The journey from failure to success:**

1. **First attempt fails** → Wait 1 second, try again
2. **Second attempt fails** → Wait 2 seconds, try again
3. **Third attempt fails** → Wait 4 seconds, try again
4. Keeps trying with increasing delays (capped at 30s) for up to 5 minutes
5. **ES becomes ready** → Connection succeeds!
6. If ES never comes up → Logs error but doesn't crash the service

### What Changed

**1. Created a retry utility** (�uffalogs/impossible_travel/utils/connection_retry.py)

- Reusable decorator for ES operations
- Uses the �ackoff library (already in requirements.txt - no new dependencies!)
- Exponential backoff: 1s → 2s → 4s → 8s → ...up to 30s max
- Detailed logging so you can see exactly what's happening

**2. Updated Elasticsearch ingestion** (�uffalogs/impossible_travel/ingestion/elasticsearch_ingestion.py)

- Connection initialization now uses retry logic
- All search operations wrapped with retry decorator
- Clean error handling - service continues even if ES never connects
- Removed redundant error handling code (it's all in the decorator now)

**3. Made it configurable** (�uffalogs/buffalogs/settings/certego.py + config/buffalogs/buffalogs.env)

- BUFFALOGS_ES_MAX_RETRIES - how many times to retry (default: 10)
- BUFFALOGS_ES_RETRY_MAX_TIME - total time to keep trying (default: 300s / 5 minutes)
- Can tweak per environment without touching code

**4. Wrote comprehensive docs** (docs/troubleshooting/elasticsearch-connection-retry.md)

- Configuration examples for different scenarios
- Step-by-step testing instructions
- Troubleshooting common issues

### Technical Details

**Retry Strategy:**

- Uses exponential backoff: min(base * 2^n, 30) seconds
- Base wait time: 1 second
- Maximum wait between retries: 30 seconds
- Configurable max attempts and total timeout
- Handles: ConnectionError, ConnectionTimeout, TimeoutError

### Testing

**Quick test to see it in action:**

`�ash

1.docker-compose -f docker-compose.yaml -f docker-compose.elastic.yaml stop buffalogs_elasticsearch

2.docker-compose restart buffalogs_celery

3.docker-compose logs -f buffalogs_celery

4.docker-compose -f docker-compose.yaml -f docker-compose.elastic.yaml start buffalogs_elasticsearch

# You'll see: Connection retries → ES comes back online → Success!
`

**What you'll see in the logs:**

`log
WARNING - Elasticsearch connection attempt 1 failed. Retrying in 1.00s... (elapsed: 0.50s)
WARNING - Elasticsearch connection attempt 2 failed. Retrying in 2.00s... (elapsed: 1.75s)
WARNING - Elasticsearch connection attempt 3 failed. Retrying in 4.00s... (elapsed: 3.95s)
INFO - Successfully connected to Elasticsearch at http://elasticsearch:9200/
`

This proves the retry mechanism works exactly as intended!

**For complete testing instructions:** Check out the detailed guide I wrote in docs/troubleshooting/elasticsearch-connection-retry.md
…rm issue

`
Implemented comprehensive retry logic for all ingestion sources.

**During testing, discovered and fixed a critical Docker cross-platform bug** that prevented
BuffaLogs from starting in Linux containers on Windows development machines (WSL2). This
blocked all testing until resolved.

---

**Retry Mechanism Implementation:**

- Moved retry config from environment variables to ingestion.json (mentor requirement)
- Implemented shared retry configuration in BaseIngestion class
- All three sources (Elasticsearch, OpenSearch, Splunk) now use identical retry structure
- Exponential backoff with jitter (1s→2s→4s...max 30s) using Python backoff library
- Health checks after successful connection establishment
- Fail-fast behavior when retries exhausted (re-raises exception)
- Retry logs include full exception messages (critical mentor feedback)

**Backward Compatibility:**
- Default retry config merged when not present in ingestion.json
- Defaults: enabled=true, max_retries=10, initial_backoff=1s, max_backoff=30s, max_elapsed_time=60s, jitter=true
- Existing configs without retry block continue working

**Critical Docker Bug Discovered & Fixed:**

While testing the retry implementation, discovered BuffaLogs container failed to start with error:
exec ./run.sh: no such file or directory

**Root Cause:** Shell scripts (run.sh, run_worker.sh, run_beat.sh) had Windows line endings
(CRLF) which Linux containers cannot execute. This is a **cross-platform compatibility issue**
that affects anyone developing on Windows with WSL2/Docker Desktop.

**Solution Implemented:**
- Converted all shell scripts from CRLF to LF line endings
- Updated Dockerfile to use explicit /bin/bash invocation
- Added dos2unix to build process as safety net
- **Now works seamlessly on both Windows and Linux environments**

**Files Modified:**
- config/buffalogs/ingestion.json - Added retry config blocks
- buffalogs/impossible_travel/ingestion/base_ingestion.py - Shared _read_retry_config()
- buffalogs/impossible_travel/utils/connection_retry.py - Generic retry decorator
- buffalogs/impossible_travel/ingestion/elasticsearch_ingestion.py - Applied retry logic
- buffalogs/impossible_travel/ingestion/opensearch_ingestion.py - Applied retry logic
- buffalogs/impossible_travel/ingestion/splunk_ingestion.py - Applied retry logic
- buffalogs/impossible_travel/tests/ingestion/test_elasticsearch_ingestion.py - Updated tests
- build/Dockerfile - Cross-platform shell script support + dos2unix
- buffalogs/run.sh - Fixed line endings (CRLF→LF)
- buffalogs/run_worker.sh - Fixed line endings (CRLF→LF)
- buffalogs/run_beat.sh - Fixed line endings (CRLF→LF)

**Removed:**
- config/buffalogs/buffalogs.env - Removed BUFFALOGS_ES_MAX_RETRIES and BUFFALOGS_ES_RETRY_MAX_TIME
- buffalogs/buffalogs/settings/certego.py - Removed retry Django settings
- buffalogs/impossible_travel/tests/ingestion/test_retry_logic.py - Deleted problematic unit tests

**Testing:**
-  BuffaLogs container now starts successfully on Windows with WSL2
-  All retry configurations load correctly from ingestion.json
-  Backward compatibility verified with missing retry blocks
-  Docker works on both Windows and Linux (cross-platform verified)

Fixes certego#545
`
@tanmayjoddar
Copy link
Author

tanmayjoddar commented Feb 1, 2026

I have attached an additional screenshot showing successful local testing with Elasticsearch:

-Documents are generated and indexed correctly.

-The index is queryable and returns the expected document count.

This completes the local Ingestion and testing flow.

Screenshot 2026-02-02 022354

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add retry mechanism and error handling for Elasticsearch connection failures

1 participant