Epic 15: Multi-Server Management - Phase 1 Complete ✅

Phase 3 (Scale) - Epic 15: Fleet Dashboard
Status: Phase 1 Complete (Week 1-2)
Date: January 26, 2026

✅ Completed Components

1. Database Schema (Migration Executed)

Tables Created: 6 core tables + 24 partitions + 2 views

Core Tables

managed_servers (17 columns)
- Server registry with health tracking
- Multi-tenant with organization_id FK
- JSONB for tags and metadata
- 8 indexes for performance
server_groups (9 columns)
- Logical grouping of servers
- Filter criteria (tags, regions, environments)
- 2 indexes
server_group_members (3 columns)
- Many-to-many relationship
- Composite PK (server_id, group_id)
- 2 indexes
server_metrics_aggregated (17 columns, PARTITIONED)
- Time-series metrics storage
- RANGE partitioned by timestamp
- 12 monthly partitions for 2026
- Tracks: CPU, memory, disk, network, services, LLM, users
- 2 indexes
server_health_checks (11 columns, PARTITIONED)
- Health check history
- RANGE partitioned by timestamp
- 12 monthly partitions for 2026
- Component-level health (database, Redis, services)
- 2 indexes
fleet_operations (15 columns)
- Bulk operation audit log
- JSONB for parameters and results
- Status tracking (pending, running, completed, failed)
- 3 indexes

Views

v_fleet_summary: Organization-level server counts by status and health
v_server_health_overview: Latest health check per server (DISTINCT ON)

Migration Status: ✅ Executed successfully via Docker Partitions: 24 total (12 metrics + 12 health checks for 2026)

2. Backend Manager (830 lines)

File: backend/multi_server_manager.py

MultiServerManager Class

Initialization:

__init__(db_pool) - Takes asyncpg connection pool
initialize() - Sets up aiohttp session for server communication
cleanup() - Closes HTTP session

Server Management (7 methods):

register_server() - Register new managed server with health check
update_server() - Update server configuration (name, region, env, tags, status)
delete_server() - Remove server (cascades to metrics/health checks)
get_server(server_id) - Fetch server details
list_servers() - Query servers with filtering (status, health, region, env, tags)

Health Checks (3 methods):

_perform_health_check() - Execute health check on managed server
- Calls /health endpoint on managed server
- Tracks response time
- Records database, Redis, services health
- Updates server health_status and last_seen_at
check_all_servers_health() - Check all active servers in org
get_server_health_history() - Historical health check data

Metrics Collection (2 methods):

collect_server_metrics() - Fetch metrics from managed server
- Calls /api/v1/metrics/current endpoint
- Stores in server_metrics_aggregated table
- Tracks: resources, services, LLM, users
get_server_metrics() - Historical metrics query

Server Groups (6 methods):

create_group() - Create logical server group
get_group(group_id) - Fetch group details
list_groups(org_id) - All groups in organization
add_server_to_group() - Add server to group
remove_server_from_group() - Remove server from group
get_group_servers(group_id) - All servers in group

Fleet Operations (3 methods):

create_fleet_operation() - Create bulk operation record
update_fleet_operation_status() - Update operation progress
get_fleet_operation() - Fetch operation details

Fleet Summary (1 method):

get_fleet_summary() - Organization-level fleet overview

3. REST API (680 lines)

File: backend/multi_server_api.py

Endpoint Categories

Server Management (5 endpoints):

POST /api/v1/fleet/servers - Register server (admin)
GET /api/v1/fleet/servers - List servers with filtering
GET /api/v1/fleet/servers/{id} - Get server details
PATCH /api/v1/fleet/servers/{id} - Update server (admin)
DELETE /api/v1/fleet/servers/{id} - Delete server (admin)

Health Checks (3 endpoints):

POST /api/v1/fleet/servers/{id}/health-check - Trigger health check
POST /api/v1/fleet/health-check/all - Check all servers
GET /api/v1/fleet/servers/{id}/health-history - Health history

Metrics (2 endpoints):

GET /api/v1/fleet/servers/{id}/metrics - Get metrics (time range, period)
POST /api/v1/fleet/servers/{id}/metrics/collect - Collect metrics (admin)

Server Groups (6 endpoints):

POST /api/v1/fleet/groups - Create group (admin)
GET /api/v1/fleet/groups - List all groups
GET /api/v1/fleet/groups/{id} - Get group details
GET /api/v1/fleet/groups/{id}/servers - Servers in group
POST /api/v1/fleet/groups/members - Add server to group (admin)
DELETE /api/v1/fleet/groups/members - Remove server from group (admin)

Fleet Operations (2 endpoints):

POST /api/v1/fleet/operations - Create bulk operation (admin)
GET /api/v1/fleet/operations/{id} - Get operation status

Fleet Summary (1 endpoint):

GET /api/v1/fleet/summary - Organization fleet overview

Pydantic Models (11 models):

ServerRegistrationRequest
ServerUpdateRequest
ServerGroupRequest
GroupMembershipRequest
FleetOperationRequest
HealthCheckResponse
ServerResponse
FleetSummaryResponse

Security:

All endpoints require authentication
Admin-only endpoints: register, update, delete, groups, operations
Organization-level access control (can only access own servers)
API token hashing (SHA-256) for managed server credentials

4. Integration

File: backend/server.py

Import added (line ~248):

from multi_server_api import router as fleet_router

Router registration (line ~990):

app.include_router(fleet_router)
logger.info("🚢 Fleet Management API registered at /api/v1/fleet (Epic 15)")

Status: ✅ Integrated into main FastAPI app

📊 Implementation Statistics

Component	Lines of Code	Files
Database Schema	350 (migration)	1 migration file
Backend Manager	830	multi_server_manager.py
REST API	680	multi_server_api.py
Server Integration	5	server.py (2 edits)
Total	1,865 lines	3 files

Database Objects:

6 tables
24 partitions (monthly for 2026)
2 views
15+ indexes
4 foreign key constraints

API Endpoints: 19 total

5 Server Management
3 Health Checks
2 Metrics
6 Server Groups
2 Fleet Operations
1 Fleet Summary

🔧 Technical Highlights

Architecture Decisions

Pull-Based Model: Control plane polls managed servers
- Avoids firewall/networking complexity
- Managed servers don't need to know about control plane
- API tokens for authentication
Partitioned Tables: Time-series data partitioned by month
- 12 partitions for metrics (2026-01 through 2026-12)
- 12 partitions for health checks
- Improves query performance for time-based queries
- Enables efficient data retention policies
JSONB Flexibility:
- tags - Array of strings for filtering
- metadata - Arbitrary key-value data
- parameters - Operation-specific config
- results - Operation results storage
- Enables extensibility without schema changes
Async/Await: Full async implementation
- asyncpg for database
- aiohttp for HTTP communication
- FastAPI for async endpoints
Security:
- API tokens hashed with SHA-256
- Organization-level isolation
- Role-based access (admin for mutations)
- Foreign key constraints for data integrity

Performance Optimizations

Indexes (15+ total):
- Single-column indexes on common filters (status, health, region, env)
- Composite indexes on time-series queries (server_id + timestamp)
- GIN index on JSONB tags
- Supports efficient filtering and aggregation
Partitioning Strategy:
- RANGE partitioning by timestamp
- Monthly partitions for 2026
- Partition pruning for time-range queries
- Easier data archival (drop old partitions)
Connection Pooling:
- Uses asyncpg connection pool
- Reuses database connections
- HTTP session reuse via aiohttp
Views:
- v_fleet_summary - Pre-aggregated server counts
- v_server_health_overview - Latest health per server
- Reduces query complexity in application code

🚧 Known Limitations (To Address in Later Phases)

API Token Storage: Currently uses placeholder tokens
- Need secure storage/retrieval mechanism
- Consider encryption at rest
- Rotate tokens periodically
Health Check Frequency: No background worker yet
- Currently manual/on-demand
- Phase 2 will add 30-second background worker
Metrics Collection: No background worker yet
- Currently manual/on-demand
- Phase 2 will add 60-second background worker
Partition Management: Manual partition creation
- Only 2026 partitions created
- Need automated partition creation for future months
- Need partition cleanup/archival strategy
Fleet Operations: Framework in place, no execution logic
- Tables and endpoints exist
- Actual operation execution (restart, update, etc.) in Phase 4
Error Handling: Basic error handling
- Need retry logic for failed health checks
- Circuit breaker for unreachable servers
- Better error reporting
Alerting: No integration with Smart Alerts (Epic 13)
- Should trigger alerts on critical health status
- Should alert on operation failures

📋 Next Steps (Phase 2: Week 2-3)

Background Workers (2 workers)

fleet_health_worker.py
- Runs every 30 seconds
- Checks health of all active servers
- Updates health_status in managed_servers
- Records results in server_health_checks
fleet_metrics_worker.py
- Runs every 60 seconds
- Collects metrics from all active servers
- Stores in server_metrics_aggregated
- Handles failures gracefully

Enhancements

Secure Token Storage
- Encrypt API tokens at rest
- Decrypt on-the-fly for health checks
- Token rotation mechanism
Alerting Integration
- Trigger Smart Alerts (Epic 13) on critical health
- Alert on prolonged unreachable status
- Alert on operation failures
Automatic Partition Management
- Create partitions for upcoming months
- Archive old partitions
- Monitor partition sizes

🎯 Success Criteria

Phase 1 Achievements ✅

Database schema designed and implemented
All 6 tables created with proper constraints
24 partitions created for time-series data
2 views for aggregated data
Backend manager with 20+ methods
REST API with 19 endpoints
Integrated into main FastAPI app
No compilation/syntax errors
Migration executed successfully

Testing Status

Unit tests for MultiServerManager
Integration tests for REST API
Load tests for partitioned queries
Health check endpoint verification
Metrics collection endpoint verification

Testing planned for Phase 5 (Week 5-6)

📦 Deliverables

Code Files

alembic/versions/20260126_1500_create_fleet_management_tables.py
- Database migration (executed ✅)
- Creates all tables, indexes, partitions, views
backend/multi_server_manager.py
- Core business logic
- 830 lines, 20+ methods
- Async implementation
backend/multi_server_api.py
- REST API endpoints
- 680 lines, 19 endpoints
- Pydantic models for validation
backend/server.py (modified)
- Imported fleet_router
- Registered at /api/v1/fleet

Documentation

EPIC_15_MULTI_SERVER.md
- Complete specification (750 lines)
- Architecture diagrams
- Implementation plan
EPIC_15_PHASE_1_COMPLETE.md (this file)
- Phase 1 summary
- Implementation statistics
- Next steps

🔗 Related Epics

Epic 13: Smart Alerts - Will integrate for health alerting
Epic 14: Cost Optimization - Will aggregate costs across fleet
Epic 6.1: Colonel Atlas - Will provide fleet management assistance
Epic 7.1: Edge Devices - Managed servers can be edge nodes
Epic 8.1: Webhooks - Can trigger webhooks on fleet events

📖 API Documentation

Base URL: /api/v1/fleet

Authentication: Bearer token (all endpoints)

Rate Limiting: Standard API rate limits apply

Example Usage:

# Register a server
curl -X POST https://ops-center.example.com/api/v1/fleet/servers \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "prod-east-1",
    "hostname": "prod-east-1.example.com",
    "api_url": "https://prod-east-1.example.com",
    "api_token": "secret_token_here",
    "region": "us-east-1",
    "environment": "production",
    "tags": ["production", "critical"]
  }'

# List all servers
curl https://ops-center.example.com/api/v1/fleet/servers \
  -H "Authorization: Bearer $TOKEN"

# Get fleet summary
curl https://ops-center.example.com/api/v1/fleet/summary \
  -H "Authorization: Bearer $TOKEN"

# Trigger health check
curl -X POST https://ops-center.example.com/api/v1/fleet/servers/{server_id}/health-check \
  -H "Authorization: Bearer $TOKEN"

# Get server metrics (last 24 hours)
curl "https://ops-center.example.com/api/v1/fleet/servers/{server_id}/metrics?period=1m&limit=1440" \
  -H "Authorization: Bearer $TOKEN"

🏆 Conclusion

Phase 1 Complete: Database schema, backend manager, and REST API are fully implemented and integrated. The foundation for multi-server management is now in place.

Next Phase: Phase 2 will add background workers for automated health checks and metrics collection, enabling real-time fleet monitoring.

Timeline: On track for 6-week Epic 15 implementation (Phase 1: 2 weeks complete)

Epic 15: Multi-Server Management - Building enterprise-grade fleet orchestration for Ops-Center

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic 15: Multi-Server Management - Phase 1 Complete ✅

✅ Completed Components

1. Database Schema (Migration Executed)

Core Tables

Views

2. Backend Manager (830 lines)

MultiServerManager Class

3. REST API (680 lines)

Endpoint Categories

4. Integration

📊 Implementation Statistics

🔧 Technical Highlights

Architecture Decisions

Performance Optimizations

🚧 Known Limitations (To Address in Later Phases)

📋 Next Steps (Phase 2: Week 2-3)

Background Workers (2 workers)

Enhancements

🎯 Success Criteria

Phase 1 Achievements ✅

Testing Status

📦 Deliverables

Code Files

Documentation

🔗 Related Epics

📖 API Documentation

🏆 Conclusion

FilesExpand file tree

EPIC_15_PHASE_1_COMPLETE.md

Latest commit

History

EPIC_15_PHASE_1_COMPLETE.md

File metadata and controls

Epic 15: Multi-Server Management - Phase 1 Complete ✅

✅ Completed Components

1. Database Schema (Migration Executed)

Core Tables

Views

2. Backend Manager (830 lines)

MultiServerManager Class

3. REST API (680 lines)

Endpoint Categories

4. Integration

📊 Implementation Statistics

🔧 Technical Highlights

Architecture Decisions

Performance Optimizations

🚧 Known Limitations (To Address in Later Phases)

📋 Next Steps (Phase 2: Week 2-3)

Background Workers (2 workers)

Enhancements

🎯 Success Criteria

Phase 1 Achievements ✅

Testing Status

📦 Deliverables

Code Files

Documentation

🔗 Related Epics

📖 API Documentation

🏆 Conclusion