Analyzed and fixed 6 critical memory management issues in AIOps that could cause memory leaks and OOM errors in production.
| # | Issue | File | Severity | Fix |
|---|---|---|---|---|
| 1 | Unbounded stampede locks dictionary | cache.py |
🔴 Critical | LRU eviction, max 1000 locks |
| 2 | Unbounded RateLimiter calls list | cache.py |
🟡 High | Bounded to 2x max_calls, thread-safe |
| 3 | Unbounded workflows storage | orchestrator.py |
🔴 Critical | LRU eviction, max 100 workflows |
| 4 | Uncleaned DAG execution tasks | orchestrator.py |
🟡 High | Proper cleanup in finally block |
| 5 | Unbounded agent instance cache | registry.py |
🔴 Critical | LRU eviction, max 50 instances |
| 6 | Missing context managers | Multiple | 🟡 High | Added __enter__/__exit__ support |
- ✅
aiops/core/cache.py- Fixed 3 issues - ✅
aiops/core/semantic_cache.py- Added context managers - ✅
aiops/agents/orchestrator.py- Fixed 2 issues - ✅
aiops/agents/registry.py- Fixed 1 issue
All fixes verified with comprehensive test suite:
- ✅ Stampede locks bounded
- ✅ RateLimiter bounded and thread-safe
- ✅ SemanticCache bounded with LRU
- ✅ Workflow history bounded
- ✅ Agent registry bounded
- ✅ Context managers working
- ✅ All files compile successfully
# Could grow to millions of entries
_stampede_locks = {} # Unbounded!
workflows = {} # Unbounded!
_instances = {} # Unbounded!# Bounded with LRU eviction
_stampede_locks = {} # Max 1000 with LRU
workflows = OrderedDict() # Max 100 with LRU
_instances = OrderedDict() # Max 50 with LRU# Automatic cleanup
async with Cache() as cache:
cache.set("key", "value")
# Cleanup happens automatically
async with SemanticCache() as scache:
await scache.aset("prompt", "result")
# Cleanup happens automatically# Tune for your workload
orchestrator = AgentOrchestrator(max_workflow_history=200)
registry = AgentRegistry(max_cached_instances=100)
semantic_cache = SemanticCache(max_entries=500)# When needed
limiter.clear() # Clear rate limit history
orchestrator.clear_workflows() # Clear workflow history
registry.clear_cache() # Clear agent instances- Monitor cache sizes in production
- Configure limits based on your workload
- Use context managers for automatic cleanup
- Profile memory usage periodically
🎯 No more memory leaks from unbounded caches 🎯 Configurable limits for production tuning 🎯 Proper cleanup with context managers 🎯 Thread-safe operations throughout 🎯 Zero breaking changes to existing code
See MEMORY_MANAGEMENT_REPORT.md for detailed analysis and code changes.