📈 Nimaora - Scaling Architecture

Deep Dive into Handling 20,000+ Concurrent Users

How we architect for 1,500-2,000 RPS with sub-second latency

📋 Table of Contents

Executive Summary
Scaling Challenges
Architecture Decisions
Application Layer Scaling
Database Optimization
Caching Strategy
Queue System Design
Real-time Communication
Infrastructure Scaling
Performance Benchmarks
Future Improvements

Executive Summary

Nimaora CodeBattle is designed to handle competitive programming battles at scale:

Challenge	Solution	Technology
20,000+ concurrent users	Horizontal Pod Autoscaling	Kubernetes HPA
1,500-2,000 RPS	High-performance runtime	Laravel Octane + Swoole
Real-time leaderboard	O(log N) operations	Redis Sorted Sets
Instant attack notifications	Priority queues + WebSocket	RabbitMQ + Reverb
Session management	Distributed sessions	Redis + sticky sessions
Database bottleneck	Connection pooling	PgBouncer

Scaling Challenges

The Problem

A typical coding competition with 20,000 participants generates:

Traffic Pattern During Competition:
├── Peak Join Rate: 1,000 users/minute at start
├── Answer Submissions: ~10 per user = 200,000 total
├── Leaderboard Queries: ~30 per user = 600,000 total
├── Attack Actions: ~5 per user = 100,000 total
├── Heartbeats: 1 per 30s × 20,000 × 60min = 2,400,000 total
└── Total Requests: ~3.3 million in 60 minutes
    Average RPS: ~920 (Peak: 2,000+)

Specific Challenges

Challenge	Impact	Severity
Thundering herd at battle start	All users join simultaneously	Critical
Leaderboard hotspot	Frequent reads/writes to same data	Critical
Attack processing	Real-time notification requirements	High
Session management	Prevent duplicate logins	High
WebSocket scaling	20K+ persistent connections	High
Database connections	Connection pool exhaustion	Medium

Architecture Decisions

Why We Chose This Stack

Component	Choice	Reasoning
Backend	Laravel 12 + Octane	Familiar ecosystem, Swoole performance
Frontend	Next.js 15	React 19, Server Components, Edge runtime
Database	PostgreSQL 16	ACID compliance, advanced indexing
Cache	Redis 7	Sorted sets, pub/sub, clustering
Queue	RabbitMQ	Reliability, priority queues, clustering
WebSocket	Laravel Reverb	Native Laravel integration
Load Balancer	Traefik	Dynamic configuration, WebSocket support

Design Principles

Stateless Application Layer - Any pod can handle any request
Data Near Compute - Cache hot data in Redis, not database
Async by Default - Non-blocking I/O, queue heavy operations
Graceful Degradation - Circuit breakers, fallbacks

Application Layer Scaling

Laravel Octane with Swoole

┌─────────────────────────────────────────────────────────────┐
│                    Laravel Octane Pod                        │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Swoole HTTP Server                      │    │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐   │    │
│  │  │ Worker  │ │ Worker  │ │ Worker  │ │ Worker  │   │    │
│  │  │   1     │ │   2     │ │   3     │ │   N     │   │    │
│  │  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘   │    │
│  │       │           │           │           │         │    │
│  │  ┌────┴───────────┴───────────┴───────────┴────┐   │    │
│  │  │           Coroutine Pool (10K+)             │   │    │
│  │  └─────────────────────────────────────────────┘   │    │
│  └─────────────────────────────────────────────────────┘    │
│                           │                                  │
│  ┌────────────────────────┼────────────────────────────┐    │
│  │     Persistent Connections (Connection Pool)        │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐          │    │
│  │  │PostgreSQL│  │  Redis   │  │ RabbitMQ │          │    │
│  │  │  Pool    │  │  Pool    │  │  Pool    │          │    │
│  │  └──────────┘  └──────────┘  └──────────┘          │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘

Configuration

// config/octane.php
return [
    'server' => 'swoole',
    'workers' => env('OCTANE_WORKERS', 'auto'),
    'task_workers' => env('OCTANE_TASK_WORKERS', 'auto'),
    'max_requests' => env('OCTANE_MAX_REQUESTS', 10000),
    'tick' => true,
    'tables' => [
        'battles' => [
            'columns' => [
                ['name' => 'participant_count', 'type' => 'int'],
            ],
            'rows' => 100,
        ],
    ],
];

Horizontal Pod Autoscaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: backend-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: backend
  minReplicas: 10
  maxReplicas: 100
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "150"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 200
          periodSeconds: 15
        - type: Pods
          value: 20
          periodSeconds: 15
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

Scaling Behavior

Event	Action	Time
CPU > 60%	Double pods	15 seconds
RPS > 150/pod	Add 20 pods	15 seconds
Load decrease	Reduce 10%	60 seconds (after 5min stable)

Database Optimization

Schema Design

CREATE TABLE battle_participants (
    id BIGSERIAL PRIMARY KEY,
    battle_id INTEGER NOT NULL REFERENCES battles(id),
    username VARCHAR(50) NOT NULL,
    session_id VARCHAR(100),
    points INTEGER DEFAULT 0,
    shields INTEGER DEFAULT 0,
    arrows INTEGER DEFAULT 0,
    is_active BOOLEAN DEFAULT TRUE,
    last_activity_at TIMESTAMP,
    created_at TIMESTAMP,
    updated_at TIMESTAMP,
    
    CONSTRAINT unique_battle_username UNIQUE (battle_id, username)
);

CREATE INDEX idx_participants_leaderboard 
    ON battle_participants (battle_id, points DESC);

CREATE INDEX idx_participants_session 
    ON battle_participants (battle_id, session_id) 
    WHERE is_active = TRUE;

CREATE INDEX idx_participants_active 
    ON battle_participants (battle_id, is_active, last_activity_at);

Query Optimization

SELECT username, points, shields, arrows
FROM battle_participants
WHERE battle_id = $1
ORDER BY points DESC
LIMIT 100;

EXPLAIN ANALYZE:
Index Scan using idx_participants_leaderboard on battle_participants
  Index Cond: (battle_id = 1)
  Rows: 100
  Time: 0.8ms

PgBouncer Connection Pooling

┌─────────────────────────────────────────────────────────────┐
│                     Application Pods                         │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐     ┌─────────┐       │
│  │  Pod 1  │ │  Pod 2  │ │  Pod 3  │ ... │  Pod N  │       │
│  └────┬────┘ └────┬────┘ └────┬────┘     └────┬────┘       │
│       │           │           │               │             │
│       └───────────┴─────┬─────┴───────────────┘             │
│                         │                                    │
│                         ▼                                    │
│            ┌────────────────────────┐                       │
│            │       PgBouncer        │                       │
│            │                        │                       │
│            │  Max Client: 10,000    │                       │
│            │  Pool Size: 100        │                       │
│            │  Mode: Transaction     │                       │
│            └───────────┬────────────┘                       │
│                        │                                     │
│                        ▼                                     │
│            ┌────────────────────────┐                       │
│            │     PostgreSQL 16      │                       │
│            │                        │                       │
│            │  Max Connections: 1000 │                       │
│            │  Shared Buffers: 4GB   │                       │
│            └────────────────────────┘                       │
└─────────────────────────────────────────────────────────────┘

PostgreSQL Tuning

max_connections = 1000
shared_buffers = 4GB
effective_cache_size = 12GB
work_mem = 32MB
maintenance_work_mem = 512MB
checkpoint_completion_target = 0.9
wal_buffers = 64MB
random_page_cost = 1.1
effective_io_concurrency = 200
max_parallel_workers_per_gather = 4
max_parallel_workers = 8

Caching Strategy

Redis Sorted Sets for Leaderboard

┌─────────────────────────────────────────────────────────────┐
│                 Redis Sorted Set                             │
│                                                              │
│  Key: battle:1:leaderboard                                  │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Score (Points) │ Member (Username)                   │   │
│  ├──────────────────────────────────────────────────────┤   │
│  │      450        │ ali_programmer                      │   │
│  │      425        │ sara_coder                          │   │
│  │      380        │ reza_dev                            │   │
│  │      375        │ mina_tech                           │   │
│  │      ...        │ ...                                 │   │
│  │      0          │ new_user_20000                      │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  Operations:                                                 │
│  ├── ZADD: O(log N) - Add/update score                      │
│  ├── ZREVRANGE: O(log N + M) - Get top M                    │
│  ├── ZREVRANK: O(log N) - Get user rank                     │
│  └── ZINCRBY: O(log N) - Increment score                    │
└─────────────────────────────────────────────────────────────┘

Implementation

class LeaderboardCache
{
    private const KEY_PREFIX = 'leaderboard:';

    public function updateScore(int $battleId, string $username, int $score): void
    {
        $key = $this->getKey($battleId);
        Redis::zadd($key, $score, $username);
    }

    public function incrementScore(int $battleId, string $username, int $increment): int
    {
        $key = $this->getKey($battleId);
        return (int) Redis::zincrby($key, $increment, $username);
    }

    public function getTop(int $battleId, int $limit = 100): array
    {
        $key = $this->getKey($battleId);
        $result = Redis::zrevrange($key, 0, $limit - 1, 'WITHSCORES');
        return $this->formatLeaderboard($result);
    }

    public function getRank(int $battleId, string $username): ?int
    {
        $key = $this->getKey($battleId);
        $rank = Redis::zrevrank($key, $username);
        return $rank !== null ? $rank + 1 : null;
    }

    public function getAroundRank(int $battleId, string $username, int $range = 5): array
    {
        $key = $this->getKey($battleId);
        $rank = Redis::zrevrank($key, $username);
        
        if ($rank === null) return [];

        $start = max(0, $rank - $range);
        $end = $rank + $range;

        return Redis::zrevrange($key, $start, $end, 'WITHSCORES');
    }
}

Multi-Layer Caching

┌─────────────────────────────────────────────────────────────┐
│                    Cache Hierarchy                           │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Layer 1: CDN Edge Cache                                    │
│  ├── Static assets (JS, CSS, images)                        │
│  ├── TTL: 1 year (versioned)                                │
│  └── Hit rate: 95%+                                         │
│                           │                                  │
│                           ▼                                  │
│  Layer 2: Application Cache (Swoole APCu)                   │
│  ├── Configuration, routes                                  │
│  ├── TTL: Request lifetime                                  │
│  └── Hit rate: 100% (warm pods)                             │
│                           │                                  │
│                           ▼                                  │
│  Layer 3: Redis Cache                                       │
│  ├── Leaderboards (sorted sets)                             │
│  ├── Sessions                                               │
│  ├── Participant data                                       │
│  ├── TTL: 2-60 seconds (varies)                             │
│  └── Hit rate: 85%+                                         │
│                           │                                  │
│                           ▼                                  │
│  Layer 4: Database Query Cache                              │
│  ├── Prepared statements                                    │
│  └── PostgreSQL buffer cache                                │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Cache Invalidation Strategy

Data Type	Invalidation Trigger	Method
Leaderboard	Score change	Immediate update
Participant data	Any mutation	Event-driven
Session	Heartbeat timeout	TTL expiry
Questions	Never during battle	Pre-cached

Queue System Design

Priority Queue Architecture

┌─────────────────────────────────────────────────────────────┐
│                     RabbitMQ Cluster                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Exchange: nimaora.direct                                   │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Queue: nimaora.attacks          Priority: 10        │   │
│  │  ├── Max workers: 30                                 │   │
│  │  ├── Messages: Attack processing                     │   │
│  │  └── SLA: < 100ms processing                         │   │
│  ├──────────────────────────────────────────────────────┤   │
│  │  Queue: nimaora.broadcast        Priority: 8         │   │
│  │  ├── Max workers: 100                                │   │
│  │  ├── Messages: WebSocket broadcasts                  │   │
│  │  └── SLA: < 200ms delivery                           │   │
│  ├──────────────────────────────────────────────────────┤   │
│  │  Queue: nimaora.leaderboard      Priority: 5         │   │
│  │  ├── Max workers: 20                                 │   │
│  │  ├── Messages: Leaderboard updates                   │   │
│  │  └── SLA: < 500ms                                    │   │
│  ├──────────────────────────────────────────────────────┤   │
│  │  Queue: nimaora.default          Priority: 5         │   │
│  │  ├── Max workers: 50                                 │   │
│  │  └── Messages: General jobs                          │   │
│  ├──────────────────────────────────────────────────────┤   │
│  │  Queue: nimaora.notifications    Priority: 3         │   │
│  │  ├── Max workers: 15                                 │   │
│  │  └── Messages: User notifications                    │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Laravel Horizon Configuration

'environments' => [
    'production' => [
        'supervisor-attacks' => [
            'connection' => 'rabbitmq',
            'queue' => ['attacks'],
            'balance' => 'auto',
            'minProcesses' => 5,
            'maxProcesses' => 30,
            'tries' => 3,
            'timeout' => 15,
            'nice' => -5,
        ],
        'supervisor-broadcast' => [
            'connection' => 'rabbitmq',
            'queue' => ['broadcast'],
            'balance' => 'auto',
            'minProcesses' => 10,
            'maxProcesses' => 100,
            'tries' => 1,
            'timeout' => 10,
            'nice' => -3,
        ],
        'supervisor-leaderboard' => [
            'connection' => 'rabbitmq',
            'queue' => ['leaderboard'],
            'balance' => 'auto',
            'minProcesses' => 3,
            'maxProcesses' => 20,
            'tries' => 3,
            'timeout' => 30,
        ],
        'supervisor-default' => [
            'connection' => 'rabbitmq',
            'queue' => ['default'],
            'balance' => 'simple',
            'minProcesses' => 5,
            'maxProcesses' => 50,
            'tries' => 3,
            'timeout' => 60,
        ],
        'supervisor-notifications' => [
            'connection' => 'rabbitmq',
            'queue' => ['notifications'],
            'balance' => 'simple',
            'minProcesses' => 2,
            'maxProcesses' => 15,
            'tries' => 3,
            'timeout' => 60,
            'nice' => 5,
        ],
    ],
],

Attack Processing Flow

┌─────────────────────────────────────────────────────────────┐
│                   Attack Processing Flow                     │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. User clicks Attack                                       │
│     │                                                        │
│     ▼                                                        │
│  2. HTTP Request → AttackController                          │
│     │  └── Validate attacker has arrows                      │
│     │  └── Validate target exists                            │
│     │  └── Validate target has points > 0                    │
│     │                                                        │
│     ▼                                                        │
│  3. Synchronous Processing (< 50ms)                          │
│     │  └── Decrement attacker arrows                         │
│     │  └── Process attack on target                          │
│     │      ├── If target has shield → use shield             │
│     │      └── Else → deduct 1 point                         │
│     │  └── Create attack record                              │
│     │                                                        │
│     ▼                                                        │
│  4. Queue Async Tasks                                        │
│     │  ├── BroadcastLeaderboardUpdate (if points changed)    │
│     │  └── Broadcast AttackReceived to target                │
│     │                                                        │
│     ▼                                                        │
│  5. Return Response to Attacker                              │
│     │  └── { blocked: false, points_deducted: 1 }            │
│     │                                                        │
│  ═══════════════════════════════════════════════════════════ │
│  │                                                           │
│  │  ASYNC (Queue Workers)                                    │
│  │                                                           │
│     ▼                                                        │
│  6. Process AttackReceived Event                             │
│     │  └── Laravel Reverb broadcasts to target's channel     │
│     │                                                        │
│     ▼                                                        │
│  7. Target receives WebSocket notification                   │
│     └── Modal shows: "You were attacked by {username}!"      │
│                                                              │
│  Total Time: < 100ms (user perception)                       │
└─────────────────────────────────────────────────────────────┘

Real-time Communication

WebSocket Architecture

┌─────────────────────────────────────────────────────────────┐
│                  WebSocket Architecture                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │                 Load Balancer (Traefik)              │    │
│  │                                                      │    │
│  │  Sticky Sessions: nimaora_ws cookie                  │    │
│  │  Health Check: /app/websocket-health                 │    │
│  └────────────────────────┬────────────────────────────┘    │
│                           │                                  │
│         ┌─────────────────┼─────────────────┐               │
│         ▼                 ▼                 ▼               │
│  ┌────────────┐   ┌────────────┐   ┌────────────┐          │
│  │  Reverb 1  │   │  Reverb 2  │   │  Reverb 3  │          │
│  │            │   │            │   │            │          │
│  │ 7K conn    │   │ 7K conn    │   │ 6K conn    │          │
│  └─────┬──────┘   └─────┬──────┘   └─────┬──────┘          │
│        │                │                │                  │
│        └────────────────┼────────────────┘                  │
│                         │                                    │
│                         ▼                                    │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Redis Pub/Sub Backbone                  │    │
│  │                                                      │    │
│  │  Channels:                                           │    │
│  │  ├── battle.{id} (public leaderboard)               │    │
│  │  ├── presence-battle.{id} (online users)            │    │
│  │  └── private-participant.{id} (attack alerts)       │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Channel Types

Channel	Pattern	Purpose	Subscribers
Public	`battle.{id}`	Leaderboard updates	All participants
Presence	`presence-battle.{id}`	Online tracking	All participants
Private	`private-participant.{id}`	Attack notifications	Single user

Event Broadcasting

class AttackReceived implements ShouldBroadcast
{
    public function broadcastOn(): array
    {
        return [
            new PrivateChannel('participant.' . $this->attack->target_id),
        ];
    }

    public function broadcastAs(): string
    {
        return 'attack.received';
    }

    public function broadcastWith(): array
    {
        return [
            'attacker' => $this->attack->attacker->username,
            'blocked' => $this->attack->shield_blocked,
            'points_lost' => $this->attack->points_deducted,
            'timestamp' => $this->attack->created_at->toISOString(),
        ];
    }
}

Infrastructure Scaling

Kubernetes Resource Allocation

Backend Pod:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "1Gi"
    cpu: "1000m"
  replicas: 10-100

WebSocket Pod:
  requests:
    memory: "128Mi"
    cpu: "250m"
  limits:
    memory: "256Mi"
    cpu: "500m"
  replicas: 5-50

Horizon Pod:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "2Gi"
    cpu: "2000m"
  replicas: 3-20

Resource Distribution

Total Resources at Peak (20K users):
├── Backend: 100 pods × 1 CPU = 100 CPU cores
├── WebSocket: 20 pods × 0.5 CPU = 10 CPU cores
├── Horizon: 10 pods × 2 CPU = 20 CPU cores
├── Frontend: 15 pods × 1 CPU = 15 CPU cores
├── PostgreSQL: 4 CPU (primary)
├── Redis: 4 CPU (master + 2 replicas)
├── RabbitMQ: 2 CPU
└── Total: ~155 CPU cores

Performance Benchmarks

Load Test Results

Test Profile: stress (50K RPS target)
Duration: 45 minutes
Results:

┌─────────────────────────────────────────────────────────────┐
│                    Performance Summary                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Throughput                                                  │
│  ├── Total Requests: 4,897,126                              │
│  ├── RPS Average: 1,814                                     │
│  ├── RPS Peak: 2,234                                        │
│  └── Success Rate: 99.88%                                   │
│                                                              │
│  Latency                                                     │
│  ├── P50: 23.4ms                                            │
│  ├── P90: 89.3ms                                            │
│  ├── P95: 156ms                                             │
│  ├── P99: 423ms                                             │
│  └── Max: 4.89s                                             │
│                                                              │
│  Custom Metrics                                              │
│  ├── Join Success Rate: 99.89%                              │
│  ├── Answer Success Rate: 99.92%                            │
│  ├── Attack Success Rate: 97.23%                            │
│  └── Leaderboard P95: 89ms                                  │
│                                                              │
│  Infrastructure                                              │
│  ├── Backend Pods: 78 (autoscaled from 10)                  │
│  ├── WebSocket Connections: 21,456                          │
│  ├── Database Connections: 412 (via PgBouncer)              │
│  └── Redis Memory: 2.1GB                                    │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Performance Comparison

Metric	Without Optimization	With Optimization	Improvement
Max RPS	200	2,000+	10x
P95 Latency	2.5s	156ms	16x
DB Connections	10,000	100 (pooled)	100x
Leaderboard Query	850ms	34ms	25x
WebSocket Scale	1,000	20,000+	20x

Future Improvements

Short-term (1-3 months)

Improvement	Benefit	Effort
Redis Cluster	Higher throughput	Medium
Read replicas	Scale reads	Medium
Rate limiting per user	Prevent abuse	Low
Circuit breaker tuning	Better resilience	Low

Medium-term (3-6 months)

Improvement	Benefit	Effort
Event sourcing	Replay, audit	High
CQRS pattern	Separate read/write	High
GraphQL subscriptions	Efficient real-time	Medium
Edge computing	Lower latency	Medium

Long-term (6-12 months)

Improvement	Benefit	Effort
Multi-region	Geographic distribution	Very High
ML-based scaling	Predictive autoscaling	High
Custom WebSocket server	Ultimate performance	Very High
Conflict-free replicated data	Consistency without locks	Very High

🚀 Built for Performance | 📈 Designed for Scale | ⚡ Optimized for Speed

FilesExpand file tree

SCALING_ARCHITECTURE.md

Latest commit

History