Skip to content

Commit cf870a7

Browse files
committed
fix: resolved frontend glitches and added more features, refined system diagram to reflect current state in README
1 parent a699ed3 commit cf870a7

34 files changed

Lines changed: 1915 additions & 392 deletions

.claude/settings.local.json

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,20 @@
2020
"Bash(.venv/Scripts/python.exe -m pip:*)",
2121
"Bash(.venv/Scripts/python.exe:*)",
2222
"Bash(where:*)",
23-
"Bash(ls:*)"
23+
"Bash(ls:*)",
24+
"mcp__excalidraw__read_diagram_guide",
25+
"mcp__excalidraw__clear_canvas",
26+
"mcp__excalidraw__batch_create_elements",
27+
"mcp__excalidraw__set_viewport",
28+
"mcp__excalidraw__get_canvas_screenshot",
29+
"mcp__excalidraw__update_element",
30+
"mcp__excalidraw__export_scene",
31+
"mcp__excalidraw__export_to_image",
32+
"mcp__excalidraw__delete_element",
33+
"mcp__excalidraw__describe_scene",
34+
"mcp__excalidraw__create_element",
35+
"mcp__excalidraw__query_elements",
36+
"mcp__excalidraw__import_scene"
2437
]
2538
},
2639
"enableAllProjectMcpServers": true,

README.md

Lines changed: 79 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -9,62 +9,64 @@ WalStream is a production-ready CDC (Change Data Capture) system that:
99
- Captures PostgreSQL WAL changes in real-time using `wal2json` format-version 2
1010
- Dual-writes to Redis Streams (low-latency buffer) and Kafka (durable archive)
1111
- Provides speed-controlled replay with idempotency guarantees
12-
- Offers a REST API control plane with JWT authentication
12+
- Offers a REST API control plane with JWT user auth and API-key service auth
1313
- Includes real-time WebSocket event streaming
1414
- Uses async-native libraries throughout (aiokafka, redis.asyncio, grpc.aio)
1515

1616
## Architecture
1717

18+
![WalStream CDC System Design](architecture.png)
19+
1820
```
19-
PostgreSQL (WAL with wal2json v2)
20-
|
21-
v
22-
Ingestor ----+----> Redis Stream (live buffer, ~1hr, MAXLEN trimmed)
23-
| |
24-
| +----> Kafka Topic (durable archive, 7 days, idempotent producer)
25-
|
26-
v
27-
Prometheus Metrics (:9090)
28-
29-
+------------------+
30-
| Replay Router |
31-
| Redis < 1hr age |
32-
| Kafka >= 1hr age |
33-
+--------+---------+
34-
|
35-
v
36-
FastAPI Control Plane (:8000)
37-
|
38-
+---> Job Workers (lease-based, checkpoint/resume)
39-
| |
40-
| v
41-
| +------------------+
42-
| | gRPC Replayer |
43-
| | (centralized |
44-
| | deduplication) |
45-
| +--------+---------+
46-
| |
47-
| v
48-
| Target Database
49-
|
50-
+---> WebSocket Events (/api/v1/events/ws)
51-
+---> REST API (/api/v1/docs)
52-
+---> Prometheus Metrics (:9091)
21+
Source PostgreSQL (:5432, WAL + wal2json v2)
22+
-> Ingestor
23+
-> Redis Stream (live buffer, MAXLEN trimmed)
24+
-> Kafka Topic (durable archive)
25+
-> Prometheus metrics (:9090)
26+
27+
Worker Service (`app.workers.manager`)
28+
-> Replay Router (Redis < 1h / Kafka >= 1h)
29+
-> ReplayEvent RPC calls
30+
31+
gRPC Replayer (:50051, :9092 metrics)
32+
-> Target Database
33+
34+
FastAPI Control API (:8000, :9091 metrics)
35+
-> REST API (/api/v1/*)
36+
-> WebSocket Events (/api/v1/events/ws)
37+
-> AuthN/AuthZ (JWT + API keys, RBAC, audit, rate limiting)
38+
39+
Control PostgreSQL (:5433; jobs, users, audit logs, API keys, dedup)
40+
<-> FastAPI Control API
41+
<-> Worker Service
42+
<-> gRPC Replayer (dedup state)
43+
44+
Frontend Dashboard (:3000)
45+
-> Control API (REST + WebSocket)
46+
-> Metrics proxy UI (/api/metrics)
47+
48+
Prometheus (:9099)
49+
-> Scrapes ingestor/control/replayer metrics
50+
-> Alertmanager (:9093)
51+
-> Grafana (:3001)
5352
```
5453

55-
## Architectural Decisions (As of February 15, 2026)
54+
## Architectural Decisions
5655

57-
- **FastAPI control plane over Django**: Legacy Django scaffolding was removed and all control-plane features live in `control/app/`.
5856
- **Centralized dedup in Replayer**: Workers do not maintain local dedup state; idempotency is enforced in `replayer/server.py`.
5957
- **At-least-once replay semantics**: Replayer uses `exists -> apply -> mark_processed` so failed applies are retried instead of being pre-marked as duplicates.
6058
- **Fail-fast job progression on unreplayable events**: Workers retry each event up to `MAX_EVENT_RETRIES` times after the initial attempt; if still failing, the job is marked `FAILED` without advancing checkpoint beyond the failed event.
6159
- **Dual replay source strategy**: Replay router serves recent/small windows from Redis and historical/large windows from Kafka.
6260
- **Source-pinned resume behavior**: Resumed jobs reuse their original `replay_source` to avoid cross-source checkpoint mismatches (for example Redis stream IDs interpreted as Kafka offsets).
6361
- **Per-partition Kafka checkpoints**: Kafka progress is stored in JSON (`checkpoint`) keyed by partition, while `last_processed_id` is retained as a fallback string checkpoint.
6462
- **Safe Kafka partition seek semantics**: On resume, uncheckpointed partitions seek to `start_ms`; if no offset exists at/after `start_ms`, they seek to partition end to avoid replaying out-of-window historical data.
63+
- **Cross-partition Kafka end-boundary safety**: Replay does not stop globally on the first out-of-range message; partitions are completed independently to avoid dropping in-range records from other partitions.
6564
- **Lease-based worker ownership**: Job execution uses DB-backed leases plus renewal to prevent concurrent workers from processing the same job.
65+
- **Lease expiry recovery for orphaned jobs**: Workers can reclaim expired-lease `RUNNING` jobs after crashes, not just `QUEUED` jobs.
66+
- **Read/write session separation by path**: Auth and health/readiness/event-stream read paths use read sessions; API key `last_used_at` updates are handled with explicit write context.
67+
- **Control-plane operational gauges**: `walstream_jobs_active` and `walstream_job_lease_expired` are exported for replay stall and lease-integrity alerting.
6668
- **Async I/O across services**: `aiokafka`, `redis.asyncio`, `grpc.aio`, and async SQLAlchemy are used to keep ingestion and replay non-blocking.
67-
- **JWT auth for REST and WebSocket**: API routes and WebSocket stream are token-authenticated to keep monitoring and control endpoints private.
69+
- **Hybrid auth model**: User-facing clients use JWT; service clients can authenticate with `X-API-Key`; health probes remain public.
6870

6971
## Frontend Dashboard
7072

@@ -87,7 +89,7 @@ Frontend (Next.js 14 + TypeScript)
8789
- **Framework**: Next.js 14 with App Router
8890
- **Language**: TypeScript
8991
- **UI Components**: shadcn/ui + Tailwind CSS
90-
- **State Management**: Zustand (auth, events)
92+
- **State Management**: Zustand (auth, events, ui)
9193
- **Server State**: TanStack Query (React Query)
9294
- **Charts**: Recharts
9395
- **Virtualization**: @tanstack/react-virtual
@@ -100,6 +102,7 @@ Frontend (Next.js 14 + TypeScript)
100102
- **Speed-Controlled Replay**: Configurable 0.1x to 100x replay speed
101103
- **Durable Job Execution**: Lease-based workers with checkpoint/resume
102104
- **REST API**: FastAPI-based control plane with OpenAPI docs
105+
- **Security Controls**: RBAC, API keys, audit logging, and Redis-backed rate limiting
103106
- **WebSocket Streaming**: Real-time event monitoring
104107
- **Prometheus Metrics**: Full observability for all components
105108
- **Async-Native**: aiokafka, redis.asyncio, grpc.aio for non-blocking I/O
@@ -134,6 +137,7 @@ WalStream uses **two PostgreSQL databases**:
134137
git clone <repo-url>
135138
cd CDC-FastAPI
136139
cp .env.example .env
140+
cp frontend/.env.example frontend/.env.local
137141
```
138142

139143
### 2. Start Infrastructure
@@ -186,8 +190,10 @@ The Docker setup automatically:
186190
- Configures PostgreSQL with `wal_level=logical`
187191
- Runs `db/init.sql` to create tables and replication user
188192

189-
> **Note**: The default `postgres:16-alpine` image does NOT include wal2json.
190-
> For production, use an image with wal2json pre-installed, or build a custom image:
193+
> **Note**: In this repo, `docker-compose.yml` already builds the source
194+
> PostgreSQL service from `db/Dockerfile`, which installs `wal2json`.
195+
> If you run PostgreSQL outside this compose stack, ensure `wal2json` is
196+
> installed on that server:
191197
> ```dockerfile
192198
> FROM postgres:16
193199
> RUN apt-get update && apt-get install -y postgresql-16-wal2json && rm -rf /var/lib/apt/lists/*
@@ -398,6 +404,12 @@ REDIS_URL=redis://localhost:6379/0
398404
KAFKA_BROKER=localhost:29092
399405
```
400406

407+
Create frontend local env from template:
408+
409+
```bash
410+
cp frontend/.env.example frontend/.env.local
411+
```
412+
401413
### 7. Start Services (Local Development)
402414

403415
```bash
@@ -492,6 +504,10 @@ curl -X POST http://localhost:8000/api/v1/auth/register \
492504
# Get token
493505
TOKEN=$(curl -X POST http://localhost:8000/api/v1/auth/token \
494506
-d "username=admin&password=secret123" | jq -r '.access_token')
507+
508+
# Service-to-service auth (if an API key is provisioned)
509+
curl http://localhost:8000/api/v1/jobs \
510+
-H "X-API-Key: $API_KEY"
495511
```
496512

497513
### Replay Jobs
@@ -548,18 +564,19 @@ CDC-FastAPI/
548564
│ │ │ └── api/ # API routes (metrics proxy)
549565
│ │ ├── components/ # React components
550566
│ │ │ ├── ui/ # shadcn/ui components
551-
│ │ │ ├── layout/ # Sidebar, Header
567+
│ │ │ ├── layout/ # Sidebar, Header, ThemeToggle
552568
│ │ │ ├── auth/ # AuthGuard
553569
│ │ │ ├── errors/ # ErrorBoundary, ErrorFallback
554570
│ │ │ ├── jobs/ # Job management
555571
│ │ │ ├── events/ # Event stream
556572
│ │ │ └── metrics/ # Charts & gauges
557573
│ │ ├── hooks/ # Custom React hooks
558-
│ │ ├── stores/ # Zustand stores (auth, events)
559-
│ │ ├── lib/ # API client, utilities
574+
│ │ ├── stores/ # Zustand stores (auth, events, ui)
575+
│ │ ├── lib/ # API client, constants, utilities
560576
│ │ ├── types/ # TypeScript types
561-
│ │ └── middleware.ts # Auth route protection
577+
│ │ └── middleware.ts # Auth route protection
562578
│ ├── Dockerfile # Multi-stage Docker build
579+
│ ├── .env.example # Frontend environment template
563580
│ └── package.json
564581
565582
├── walstream-proto/ # Protobuf definitions & Pydantic models
@@ -585,12 +602,16 @@ CDC-FastAPI/
585602
│ │ │ ├── jobs.py # Replay job management
586603
│ │ │ ├── events.py # WebSocket streaming
587604
│ │ │ └── health.py # Health checks
605+
│ │ ├── middleware/ # Auth context, audit, and rate limiting
588606
│ │ ├── models/ # SQLAlchemy ORM models
589607
│ │ │ ├── replay_job.py # ReplayJob with lease support
590608
│ │ │ ├── user.py # User model
591-
│ │ │ └── dedup_state.py # Deduplication state
609+
│ │ │ ├── dedup_state.py # Deduplication state
610+
│ │ │ ├── api_key.py # API key model
611+
│ │ │ └── audit_log.py # Audit log model
592612
│ │ ├── services/ # Business logic
593613
│ │ │ ├── replay_router.py # Redis/Kafka source routing
614+
│ │ │ ├── job_service.py # Job lifecycle transaction logic
594615
│ │ │ └── dedup_store.py # Legacy dedup utility (not in active replay path)
595616
│ │ └── workers/ # Background workers
596617
│ │ ├── job_worker.py # Durable job execution
@@ -680,6 +701,8 @@ docker-compose --profile monitoring up -d
680701
| `walstream_events_failed_total` | Failed replay attempts |
681702
| `walstream_ingest_latency_seconds` | WAL-to-publish latency |
682703
| `walstream_replay_latency_seconds` | Per-event replay latency |
704+
| `walstream_jobs_active{state="queued|running"}` | Active replay jobs by state |
705+
| `walstream_job_lease_expired{state="running"}` | Running jobs with expired worker leases |
683706

684707
### SLO Targets & Alert Thresholds
685708

@@ -883,16 +906,16 @@ See `google_python_style_guide.md` for a local reference.
883906
- [x] Error boundaries (`ErrorBoundary`, `ErrorFallback`, per-route `error.tsx`)
884907
- [x] `not-found.tsx` — custom 404 page
885908
- [x] Docker config (multi-stage `Dockerfile`, `.dockerignore`)
886-
- [ ] Stale-cookie session handling UX alignment between middleware guards and API 401 handling
887-
- [ ] Degraded readiness/503 UI states for dashboard health cards and overview widgets
888-
- [ ] **Dark/light mode toggle** (`ThemeToggle.tsx`) — Tailwind dark mode is configured but no toggle UI
889-
- [ ] **Mobile responsive sidebar**no hamburger menu for small screens
890-
- [ ] **Extracted dashboard components**`OverviewCards`, `RecentJobs`, `HealthStatus`, `QuickActions` are inline in `page.tsx`, not reusable components
891-
- [ ] **UI store** (`stores/uiStore.ts`) — sidebar collapse, theme state
892-
- [ ] **Utility files**`lib/utils/formatters.ts`, `lib/constants.ts`
893-
- [ ] **Frontend `.env.example`** — environment template for developers
894-
- [ ] **`useDebounce` hook** — for search input debouncing
895-
- [ ] **Dynamic imports** for chart components (code splitting)
909+
- [x] Stale-cookie session handling UX alignment between middleware guards and API 401 handling
910+
- [x] Degraded readiness/503 UI states for dashboard health cards and overview widgets
911+
- [x] **Dark/light mode toggle** (`ThemeToggle.tsx`) — `next-themes` with system/light/dark support
912+
- [x] **Mobile responsive sidebar** — hamburger menu with overlay on small screens
913+
- [x] **Extracted dashboard components**`OverviewCards`, `RecentJobs`, `HealthStatus` extracted from `page.tsx`
914+
- [x] **UI store** (`stores/uiStore.ts`) — sidebar open/close state for mobile
915+
- [x] **Utility files**`lib/utils/formatters.ts`, `lib/constants.ts`
916+
- [x] **Frontend `.env.example`** — environment template for developers
917+
- [x] **`useDebounce` hook** — for search input debouncing in EventFilters
918+
- [x] **Dynamic imports** for chart components (code splitting with `next/dynamic`)
896919

897920
### Frontend — Testing
898921

0 commit comments

Comments
 (0)