Phase 3 (Scale) - Epic 15: Fleet Dashboard
Status: Phase 1 Complete (Week 1-2)
Date: January 26, 2026
Tables Created: 6 core tables + 24 partitions + 2 views
-
managed_servers (17 columns)
- Server registry with health tracking
- Multi-tenant with organization_id FK
- JSONB for tags and metadata
- 8 indexes for performance
-
server_groups (9 columns)
- Logical grouping of servers
- Filter criteria (tags, regions, environments)
- 2 indexes
-
server_group_members (3 columns)
- Many-to-many relationship
- Composite PK (server_id, group_id)
- 2 indexes
-
server_metrics_aggregated (17 columns, PARTITIONED)
- Time-series metrics storage
- RANGE partitioned by timestamp
- 12 monthly partitions for 2026
- Tracks: CPU, memory, disk, network, services, LLM, users
- 2 indexes
-
server_health_checks (11 columns, PARTITIONED)
- Health check history
- RANGE partitioned by timestamp
- 12 monthly partitions for 2026
- Component-level health (database, Redis, services)
- 2 indexes
-
fleet_operations (15 columns)
- Bulk operation audit log
- JSONB for parameters and results
- Status tracking (pending, running, completed, failed)
- 3 indexes
- v_fleet_summary: Organization-level server counts by status and health
- v_server_health_overview: Latest health check per server (DISTINCT ON)
Migration Status: ✅ Executed successfully via Docker Partitions: 24 total (12 metrics + 12 health checks for 2026)
File: backend/multi_server_manager.py
Initialization:
__init__(db_pool)- Takes asyncpg connection poolinitialize()- Sets up aiohttp session for server communicationcleanup()- Closes HTTP session
Server Management (7 methods):
register_server()- Register new managed server with health checkupdate_server()- Update server configuration (name, region, env, tags, status)delete_server()- Remove server (cascades to metrics/health checks)get_server(server_id)- Fetch server detailslist_servers()- Query servers with filtering (status, health, region, env, tags)
Health Checks (3 methods):
_perform_health_check()- Execute health check on managed server- Calls
/healthendpoint on managed server - Tracks response time
- Records database, Redis, services health
- Updates server health_status and last_seen_at
- Calls
check_all_servers_health()- Check all active servers in orgget_server_health_history()- Historical health check data
Metrics Collection (2 methods):
collect_server_metrics()- Fetch metrics from managed server- Calls
/api/v1/metrics/currentendpoint - Stores in server_metrics_aggregated table
- Tracks: resources, services, LLM, users
- Calls
get_server_metrics()- Historical metrics query
Server Groups (6 methods):
create_group()- Create logical server groupget_group(group_id)- Fetch group detailslist_groups(org_id)- All groups in organizationadd_server_to_group()- Add server to groupremove_server_from_group()- Remove server from groupget_group_servers(group_id)- All servers in group
Fleet Operations (3 methods):
create_fleet_operation()- Create bulk operation recordupdate_fleet_operation_status()- Update operation progressget_fleet_operation()- Fetch operation details
Fleet Summary (1 method):
get_fleet_summary()- Organization-level fleet overview
File: backend/multi_server_api.py
Server Management (5 endpoints):
POST /api/v1/fleet/servers- Register server (admin)GET /api/v1/fleet/servers- List servers with filteringGET /api/v1/fleet/servers/{id}- Get server detailsPATCH /api/v1/fleet/servers/{id}- Update server (admin)DELETE /api/v1/fleet/servers/{id}- Delete server (admin)
Health Checks (3 endpoints):
POST /api/v1/fleet/servers/{id}/health-check- Trigger health checkPOST /api/v1/fleet/health-check/all- Check all serversGET /api/v1/fleet/servers/{id}/health-history- Health history
Metrics (2 endpoints):
GET /api/v1/fleet/servers/{id}/metrics- Get metrics (time range, period)POST /api/v1/fleet/servers/{id}/metrics/collect- Collect metrics (admin)
Server Groups (6 endpoints):
POST /api/v1/fleet/groups- Create group (admin)GET /api/v1/fleet/groups- List all groupsGET /api/v1/fleet/groups/{id}- Get group detailsGET /api/v1/fleet/groups/{id}/servers- Servers in groupPOST /api/v1/fleet/groups/members- Add server to group (admin)DELETE /api/v1/fleet/groups/members- Remove server from group (admin)
Fleet Operations (2 endpoints):
POST /api/v1/fleet/operations- Create bulk operation (admin)GET /api/v1/fleet/operations/{id}- Get operation status
Fleet Summary (1 endpoint):
GET /api/v1/fleet/summary- Organization fleet overview
Pydantic Models (11 models):
ServerRegistrationRequestServerUpdateRequestServerGroupRequestGroupMembershipRequestFleetOperationRequestHealthCheckResponseServerResponseFleetSummaryResponse
Security:
- All endpoints require authentication
- Admin-only endpoints: register, update, delete, groups, operations
- Organization-level access control (can only access own servers)
- API token hashing (SHA-256) for managed server credentials
File: backend/server.py
Import added (line ~248):
from multi_server_api import router as fleet_routerRouter registration (line ~990):
app.include_router(fleet_router)
logger.info("🚢 Fleet Management API registered at /api/v1/fleet (Epic 15)")Status: ✅ Integrated into main FastAPI app
| Component | Lines of Code | Files |
|---|---|---|
| Database Schema | 350 (migration) | 1 migration file |
| Backend Manager | 830 | multi_server_manager.py |
| REST API | 680 | multi_server_api.py |
| Server Integration | 5 | server.py (2 edits) |
| Total | 1,865 lines | 3 files |
Database Objects:
- 6 tables
- 24 partitions (monthly for 2026)
- 2 views
- 15+ indexes
- 4 foreign key constraints
API Endpoints: 19 total
- 5 Server Management
- 3 Health Checks
- 2 Metrics
- 6 Server Groups
- 2 Fleet Operations
- 1 Fleet Summary
-
Pull-Based Model: Control plane polls managed servers
- Avoids firewall/networking complexity
- Managed servers don't need to know about control plane
- API tokens for authentication
-
Partitioned Tables: Time-series data partitioned by month
- 12 partitions for metrics (2026-01 through 2026-12)
- 12 partitions for health checks
- Improves query performance for time-based queries
- Enables efficient data retention policies
-
JSONB Flexibility:
tags- Array of strings for filteringmetadata- Arbitrary key-value dataparameters- Operation-specific configresults- Operation results storage- Enables extensibility without schema changes
-
Async/Await: Full async implementation
- asyncpg for database
- aiohttp for HTTP communication
- FastAPI for async endpoints
-
Security:
- API tokens hashed with SHA-256
- Organization-level isolation
- Role-based access (admin for mutations)
- Foreign key constraints for data integrity
-
Indexes (15+ total):
- Single-column indexes on common filters (status, health, region, env)
- Composite indexes on time-series queries (server_id + timestamp)
- GIN index on JSONB tags
- Supports efficient filtering and aggregation
-
Partitioning Strategy:
- RANGE partitioning by timestamp
- Monthly partitions for 2026
- Partition pruning for time-range queries
- Easier data archival (drop old partitions)
-
Connection Pooling:
- Uses asyncpg connection pool
- Reuses database connections
- HTTP session reuse via aiohttp
-
Views:
v_fleet_summary- Pre-aggregated server countsv_server_health_overview- Latest health per server- Reduces query complexity in application code
-
API Token Storage: Currently uses placeholder tokens
- Need secure storage/retrieval mechanism
- Consider encryption at rest
- Rotate tokens periodically
-
Health Check Frequency: No background worker yet
- Currently manual/on-demand
- Phase 2 will add 30-second background worker
-
Metrics Collection: No background worker yet
- Currently manual/on-demand
- Phase 2 will add 60-second background worker
-
Partition Management: Manual partition creation
- Only 2026 partitions created
- Need automated partition creation for future months
- Need partition cleanup/archival strategy
-
Fleet Operations: Framework in place, no execution logic
- Tables and endpoints exist
- Actual operation execution (restart, update, etc.) in Phase 4
-
Error Handling: Basic error handling
- Need retry logic for failed health checks
- Circuit breaker for unreachable servers
- Better error reporting
-
Alerting: No integration with Smart Alerts (Epic 13)
- Should trigger alerts on critical health status
- Should alert on operation failures
-
fleet_health_worker.py
- Runs every 30 seconds
- Checks health of all active servers
- Updates health_status in managed_servers
- Records results in server_health_checks
-
fleet_metrics_worker.py
- Runs every 60 seconds
- Collects metrics from all active servers
- Stores in server_metrics_aggregated
- Handles failures gracefully
-
Secure Token Storage
- Encrypt API tokens at rest
- Decrypt on-the-fly for health checks
- Token rotation mechanism
-
Alerting Integration
- Trigger Smart Alerts (Epic 13) on critical health
- Alert on prolonged unreachable status
- Alert on operation failures
-
Automatic Partition Management
- Create partitions for upcoming months
- Archive old partitions
- Monitor partition sizes
- Database schema designed and implemented
- All 6 tables created with proper constraints
- 24 partitions created for time-series data
- 2 views for aggregated data
- Backend manager with 20+ methods
- REST API with 19 endpoints
- Integrated into main FastAPI app
- No compilation/syntax errors
- Migration executed successfully
- Unit tests for MultiServerManager
- Integration tests for REST API
- Load tests for partitioned queries
- Health check endpoint verification
- Metrics collection endpoint verification
Testing planned for Phase 5 (Week 5-6)
-
alembic/versions/20260126_1500_create_fleet_management_tables.py
- Database migration (executed ✅)
- Creates all tables, indexes, partitions, views
-
backend/multi_server_manager.py
- Core business logic
- 830 lines, 20+ methods
- Async implementation
-
backend/multi_server_api.py
- REST API endpoints
- 680 lines, 19 endpoints
- Pydantic models for validation
-
backend/server.py (modified)
- Imported fleet_router
- Registered at /api/v1/fleet
-
EPIC_15_MULTI_SERVER.md
- Complete specification (750 lines)
- Architecture diagrams
- Implementation plan
-
EPIC_15_PHASE_1_COMPLETE.md (this file)
- Phase 1 summary
- Implementation statistics
- Next steps
- Epic 13: Smart Alerts - Will integrate for health alerting
- Epic 14: Cost Optimization - Will aggregate costs across fleet
- Epic 6.1: Colonel Atlas - Will provide fleet management assistance
- Epic 7.1: Edge Devices - Managed servers can be edge nodes
- Epic 8.1: Webhooks - Can trigger webhooks on fleet events
Base URL: /api/v1/fleet
Authentication: Bearer token (all endpoints)
Rate Limiting: Standard API rate limits apply
Example Usage:
# Register a server
curl -X POST https://ops-center.example.com/api/v1/fleet/servers \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "prod-east-1",
"hostname": "prod-east-1.example.com",
"api_url": "https://prod-east-1.example.com",
"api_token": "secret_token_here",
"region": "us-east-1",
"environment": "production",
"tags": ["production", "critical"]
}'
# List all servers
curl https://ops-center.example.com/api/v1/fleet/servers \
-H "Authorization: Bearer $TOKEN"
# Get fleet summary
curl https://ops-center.example.com/api/v1/fleet/summary \
-H "Authorization: Bearer $TOKEN"
# Trigger health check
curl -X POST https://ops-center.example.com/api/v1/fleet/servers/{server_id}/health-check \
-H "Authorization: Bearer $TOKEN"
# Get server metrics (last 24 hours)
curl "https://ops-center.example.com/api/v1/fleet/servers/{server_id}/metrics?period=1m&limit=1440" \
-H "Authorization: Bearer $TOKEN"Phase 1 Complete: Database schema, backend manager, and REST API are fully implemented and integrated. The foundation for multi-server management is now in place.
Next Phase: Phase 2 will add background workers for automated health checks and metrics collection, enabling real-time fleet monitoring.
Timeline: On track for 6-week Epic 15 implementation (Phase 1: 2 weeks complete)
Epic 15: Multi-Server Management - Building enterprise-grade fleet orchestration for Ops-Center