From e51d5ed7465d9e96f6305bb01e9f74891f70b844 Mon Sep 17 00:00:00 2001
From: shaia <shaia@outlook.com>
Date: Fri, 6 Mar 2026 09:22:02 +0200
Subject: [PATCH] docs: Update ROADMAP with accurate status and reorganized
 priorities

Clean up backend coverage table (remove misleading footnotes),
reorganize phases to reflect current project state, and update
completion markers for recently finished work.
---
 ROADMAP.md | 1292 ++++------------------------------------------------
 1 file changed, 83 insertions(+), 1209 deletions(-)

diff --git a/ROADMAP.md b/ROADMAP.md
index 9395130..c5dcd51 100644
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -33,20 +33,16 @@ Each algorithm should have scalar (CPU) + SIMD + OMP variants. Track gaps here.
 | **N-S Solvers**     | Explicit Euler | done | done     | —        | done     | —    |
 |                     | Projection     | done | done     | —        | done     | done |
 |                     | RK2 (Heun)     | done | done     | —        | done     | —    |
-| **Linear Solvers**  | Jacobi         | done | done     | done     | —        | done²|
+| **Linear Solvers**  | Jacobi         | done | done     | done     | —        | —    |
 |                     | SOR            | done | done     | done     | —        | —    |
 |                     | Red-Black SOR  | done | done     | done     | done     | —    |
 |                     | CG / PCG       | done | done     | done     | done     | —    |
-|                     | BiCGSTAB       | done | done¹    | done¹    | —        | —    |
+|                     | BiCGSTAB       | done | done     | done     | —        | —    |
+| **Boundary Conds**  | All types      | done | done     | done     | done     | done |
 
-¹ AVX2/NEON implementations include OpenMP parallelization internally. No separate OMP backend exists.
-² GPU Jacobi is monolithic inside `solver_projection_jacobi_gpu.cu`, not a standalone reusable linear solver.
+### What's Missing
 
-### Critical Gaps
-
-- [x] 3D support complete (all phases), see Phase 3.1
-- [x] Symmetry planes now supported
-- [ ] Only structured grids
+- [ ] No 3D validation benchmarks yet
 - [ ] No turbulence models
 - [ ] Limited linear solvers (no multigrid)
 - [ ] No restart/checkpoint capability
@@ -63,16 +59,8 @@ The OMP Red-Black SOR Poisson solver fails to converge on certain problem config
 
 Impact:
 
-- OMP projection solver switched to CG_OMP (dedicated OMP CG solver) as workaround
+- OMP projection solver switched to CG_OMP as workaround
 - Red-Black SOR remains available but unreliable for production use with OMP backend
-- CG_OMP provides reliable convergence with fully parallelized primitives
-
-Root cause unknown — requires investigation of:
-
-- Omega parameter tuning for Neumann BCs (currently uses default 1.5)
-- Parallel race conditions in red/black sweeps
-- Boundary condition application in OMP implementation
-- Comparison with working AVX2 Red-Black SOR implementation
 
 Action items:
 
@@ -82,30 +70,17 @@ Action items:
 - [ ] Add convergence diagnostics (residual history logging)
 - [ ] Consider switch to Chebyshev acceleration or SSOR
 
-Workaround: Use CG or switch to AVX2/CPU backends for production.
-
-**Grid Convergence Non-Monotonic Behavior (P1)**
+**Grid Convergence Non-Monotonic Behavior (P1)** ✅ RESOLVED
 
-Current grid convergence tests use relaxed tolerance (`prev_error + 0.08`) because RMS error does not strictly decrease with grid refinement when using the scalar Red-Black SOR Poisson solver.
-
-Observed behavior:
-
-- 17×17: RMS ~0.056
-- 25×25: RMS ~0.037
-- 33×33: RMS ~0.094 (worse than 25×25!)
-
-Root cause: The scalar Poisson solver has accuracy limitations at larger grid sizes that prevent proper convergence.
+Root cause: Red-Black SOR Poisson solver had insufficient convergence on larger grids. Switching all projection backends to CG (Conjugate Gradient) resolved the issue. Grid convergence is now strictly monotonic (17×17: 0.046, 25×25: 0.037, 33×33: 0.032).
 
 Action items:
 
-- [ ] Investigate why 33×33 produces worse results than 25×25
-- [ ] May need more Poisson iterations for larger grids
-- [ ] Consider using SIMD Red-Black SOR or multigrid for better accuracy
-- [ ] Remove relaxed tolerance (`+ 0.08`) from grid convergence tests
-- [ ] Ensure RMS(33×33) < RMS(25×25) < RMS(17×17)
-- [ ] Add strict grid convergence test that FAILs if RMS increases with refinement
-
-Acceptance criteria: RMS error strictly decreases with each grid refinement level; convergence order approaches O(h²) asymptotically.
+- [x] Investigate why 33×33 produces worse results than 25×25 — Red-Black SOR Poisson solver
+- [x] May need more Poisson iterations for larger grids — solved by switch to CG
+- [x] Consider using SIMD Red-Black SOR or multigrid for better accuracy — CG sufficient
+- [x] Remove relaxed tolerance from grid convergence tests
+- [x] Add strict grid convergence test that FAILs if RMS increases with refinement
 
 #### Limitations
 
@@ -123,27 +98,17 @@ For the uniform-grid Laplacian with constant coefficients, the Jacobi preconditi
 
 **Modular Library Circular Dependencies (Section 4.2)**
 
-The modular libraries have circular dependencies: `cfd_scalar`/`cfd_simd` call `poisson_solve()` (defined in `cfd_api`), while `cfd_api` links against `cfd_scalar`/`cfd_simd`.
-
-Current solution:
-
-- On Linux: GNU linker groups (`-Wl,--start-group` ... `-Wl,--end-group`) resolve circular references
-- On Windows/macOS: Linker automatically handles multiple passes
-- Shared builds: Recompile sources into unified shared library
-
-Future: Consider weak symbols, conditional registration at runtime, or plugin architecture with dynamic loading.
+The modular libraries have circular dependencies: `cfd_scalar`/`cfd_simd` call `poisson_solve()` (defined in `cfd_api`), while `cfd_api` links against `cfd_scalar`/`cfd_simd`. Resolved on Linux with linker groups; Windows/macOS handle automatically. Future: consider weak symbols or plugin architecture.
 
 **OMP Loop Variable int Overflow for Large Grids (P3)**
 
 Status: Deferred — low risk, no practical impact yet
 
-OMP backends cast `size_t` loop variables to `int` for MSVC OpenMP 2.0 compatibility. For grids where `nx * ny > INT_MAX` (~2.1 billion), this overflows. Requires grids larger than ~46K × 46K (~16 GB per field), beyond current practical scope.
-
-Action items:
+OMP backends cast `size_t` loop variables to `int` for MSVC OpenMP 2.0 compatibility. Overflows for grids where `nx * ny > INT_MAX` (~46K × 46K).
 
 - [ ] Audit all OMP backends for `size_t` → `int` casts
 - [ ] Add `CFD_ASSERT(nx * ny <= INT_MAX)` guards if targeting large grids
-- [ ] Consider requiring OpenMP 3.0+ (unsigned loop vars) when dropping MSVC OMP 2.0 support
+- [ ] Consider requiring OpenMP 3.0+ when dropping MSVC OMP 2.0 support
 
 **Debug Mode SIMD Benchmarking (Section 1.10)**
 
@@ -153,179 +118,26 @@ Current tests run in Debug mode where SIMD may be slower than scalar due to lack
 
 **Spectral Radius Test BC Assumption (Section 1.2.3)**
 
-The Jacobi spectral radius test uses Dirichlet BCs (p=0 on boundary) because the ρ = cos(πh) formula applies only to the Dirichlet problem. With Neumann BCs, the discrete Laplacian has a constant null space giving eigenvalue 1. The SOR optimal ω = 2/(1 + sin(πh)) also applies to Dirichlet BCs; with Neumann BCs optimal ω is typically lower (1.5-1.7).
-
-#### Resolved
-
-**Stretched Grid Formula Bug (FIXED in v0.1.7)**
-
-File: `lib/src/core/grid.c` (`grid_initialize_stretched`)
-
-Fixed the hyperbolic stretching formula using tanh-based stretching:
-
-```c
-x[i] = xmin + (xmax - xmin) * (1.0 + tanh(beta * (2.0 * xi - 1.0)) / tanh(beta)) / 2.0
-```
-
-Fixed behavior: Grid spans full domain, points cluster near boundaries, higher beta = more clustering, beta=0 falls back to uniform grid.
-
-Tests added: `tests/core/test_grid.c` with 16 unit tests.
+The Jacobi spectral radius test uses Dirichlet BCs (p=0 on boundary) because the ρ = cos(πh) formula applies only to the Dirichlet problem. The SOR optimal ω = 2/(1 + sin(πh)) also applies to Dirichlet BCs; with Neumann BCs optimal ω is typically lower (1.5-1.7).
 
 ---
 
-## Phase 0: Architecture & Robustness (P0 - Critical)
-
-**Goal:** Transform the codebase from a research prototype to a production-safe library.
-
-### 0.1 Safe Error Handling
-
-- [x] Replace all `exit()` calls with error code propagation
-- [x] Implement `cfd_status_t` and `cfd_get_error_string()`
-- [x] Thread-local error context for rich error reporting
-- [x] Ensure resource cleanup on error paths (no leaks on failure)
+## Phase 0: Architecture & Robustness (P0 - Critical) ✅
 
-### 0.2 Thread Safety & Global State
+**Status:** COMPLETE
 
-- [x] Remove static buffers in `utils.c` (path management)
-- [x] Make `SolverRegistry` thread-safe or context-bound
-- [x] Ensure `SimulationData` and solvers are re-entrant
-- [x] Validate thread-safety with concurrent simulation tests
+- [x] 0.1 Safe Error Handling — `cfd_status_t`, error propagation, resource cleanup
+- [x] 0.2 Thread Safety — removed static buffers, thread-safe registry, re-entrant solvers
+- [x] 0.3 API Robustness — input validation, logging callback, version header, symbol visibility
+- [x] 0.4 API & Robustness Testing — comprehensive test suites for all subsystems
+- [x] 0.5 Error Handling & Robustness — removed static `warned` flag, `run_simulation_step()` returns `cfd_status_t`
+- [x] 0.6 Structured Logging & Diagnostics — `cfd_log()` API with levels, component tags, thread-safe callbacks
 
-### 0.3 API Robustness
+**0.6 Deferred items:**
 
-- [x] Comprehensive input validation (check for NULL, NaN, invalid ranges)
-- [x] Configurable logging callback (`cfd_set_log_callback`)
-- [x] Version header (`cfd_version.h`) with version macros
-- [x] Symbol visibility control (hide private symbols) & export headers
-- [x] Library initialization/shutdown functions
-
-### 0.4 API & Robustness Testing
-
-**Implemented:**
-
-- [x] Core functionality tests (`tests/core/`)
-- [x] Input validation tests (`test_input_validation.c`)
-- [x] Error handling tests (`test_error_handling.c`)
-- [x] Re-entrancy/thread-safety tests (`test_reentrancy.c`)
-- [x] Solver tests organized by architecture (CPU, SIMD, OMP, GPU)
-- [x] Simulation API tests (`tests/simulation/`)
-- [x] I/O tests (VTK, CSV, output paths)
-- [x] Physics validation tests (`test_physics_validation.c`)
-
-**Still needed:**
-
-- [x] Negative testing suite (more edge cases)
-- [x] Memory leak checks (ASan in CI; Valgrind via local `ctest -T MemCheck`)
-- [x] Add `TEST_FAIL_PRINTF` macro to eliminate repeated `snprintf` + `TEST_FAIL_MESSAGE` pattern (5+ occurrences, causes `-Wformat-truncation` warnings). Enable `UNITY_INCLUDE_PRINT_FORMATTED` or define a local helper macro.
-
-### 0.5 Error Handling & Robustness (P0) ✅
-
-**Status:** COMPLETE (PR #139, Feb 2026)
-
-**Problem:** Projection solvers used thread-unsafe static `warned` flag for Poisson convergence failures, continuing execution with inaccurate pressure fields instead of returning errors.
-
-**Solution implemented:**
-
-- [x] Removed static `warned` flag from all projection solvers (CPU, OMP, AVX2) ✅
-- [x] Return `CFD_ERROR_MAX_ITER` immediately on Poisson convergence failure ✅
-- [x] **Breaking API change:** `run_simulation_step()` and `run_simulation_solve()` now return `cfd_status_t` instead of `void` ✅
-- [x] Updated solver registry wrappers to propagate errors properly ✅
-- [x] Updated all ~15 test files to check return values ✅
-- [x] Updated all examples to handle errors in loops ✅
-- [x] Updated README with new function signatures ✅
-- [x] Proper distinction between `CFD_ERROR_UNSUPPORTED` (backend unavailable) and other errors ✅
-
-**Impact:**
-
-- Thread-safe error handling (no data races in OpenMP builds)
-- Users can now detect and respond to convergence failures
-- Better diagnostics for validation debugging
-- Compile-time safety (void → cfd_status_t breaks code that ignores errors)
-
-**Files modified:**
-
-- `lib/include/cfd/api/simulation_api.h` - Function signature changes
-- `lib/src/api/simulation_api.c` - Error propagation implementation
-- `lib/src/api/solver_registry.c` - Wrapper error handling
-- `lib/src/solvers/navier_stokes/{cpu,omp,avx2}/solver_projection*.c` - Return errors
-- `tests/**/*.c` - Test updates for error checking
-- `examples/*.c` - Example error handling
-- `README.md` - API documentation updates
-
-### 0.6 Structured Logging & Diagnostics (P2) ✅
-
-**Status:** Complete
-
-**Implementation Tasks:**
-
-- [x] Define `cfd_log()` API with log levels and component tags
-- [x] Implement default console handler (stderr for WARNING/ERROR, stdout for DEBUG/INFO)
-- [x] Add thread-safe logging (atomic global log level, per-thread callbacks)
-- [x] Replace all `fprintf(stderr, ...)` calls with `cfd_log()`
-- [x] Replace diagnostic `printf()` calls with `cfd_log()`
-- [x] Add log filtering by level (suppress DEBUG in production)
-- [x] Support custom log handlers via `cfd_set_log_callback_ex()` (with component)
-- [ ] Add log filtering by component (e.g., only show "boundary" logs) — deferred
-- [ ] Add timestamps and colored output — deferred
-- [ ] Add structured data API for metrics (convergence stats, timings) — deferred
-
-**API added:**
-
-```c
-// Log levels (renumbered to include DEBUG)
-typedef enum {
-    CFD_LOG_LEVEL_DEBUG = 0,
-    CFD_LOG_LEVEL_INFO = 1,
-    CFD_LOG_LEVEL_WARNING = 2,
-    CFD_LOG_LEVEL_ERROR = 3
-} cfd_log_level_t;
-
-// Core logging function with printf-style formatting
-void cfd_log(cfd_log_level_t level, const char* component,
-             const char* fmt, ...);
-
-// Global log level control (atomic, thread-safe)
-void cfd_set_log_level(cfd_log_level_t level);
-cfd_log_level_t cfd_get_log_level(void);
-
-// Extended callback with component tag
-typedef void (*cfd_log_callback_ex_t)(cfd_log_level_t level,
-    const char* component, const char* message);
-void cfd_set_log_callback_ex(cfd_log_callback_ex_t callback);
-
-// Convenience macros
-CFD_LOG_DEBUG(component, ...)
-CFD_LOG_INFO(component, ...)
-CFD_LOG_WARNING(component, ...)
-CFD_LOG_ERROR(component, ...)
-```
-
-**Design decisions:**
-
-- Default log level is INFO (DEBUG suppressed unless explicitly enabled)
-- Per-thread legacy callback takes priority over global extended callback
-- `cfd_error()` always sets thread-local error state regardless of log level filter
-- Convergence verbose output uses `CFD_LOG_DEBUG` with `if (params->verbose)` guard
-- Component tags: "gpu", "boundary", "poisson", "projection", "solver", "simd"
-- Default output format: `LEVEL [component]: message`
-
-**Files modified:**
-
-- `lib/include/cfd/core/logging.h` — Extended API with DEBUG level, `cfd_log()`, macros
-- `lib/src/core/logging.c` — `cfd_log()` implementation, atomic log level, extended callback
-- `lib/src/api/solver_registry.c` — 5 fprintf → CFD_LOG_ERROR/WARNING
-- `lib/src/boundary/boundary_conditions.c` — 1 fprintf → CFD_LOG_ERROR
-- `lib/src/solvers/linear/linear_solver.c` — 2 fprintf/printf → CFD_LOG_ERROR/DEBUG
-- `lib/src/solvers/linear/simd/linear_solver_simd_dispatch.c` — 1 fprintf → CFD_LOG_DEBUG
-- `lib/src/solvers/linear/cpu/linear_solver_cg.c` — 1 printf → CFD_LOG_DEBUG
-- `lib/src/solvers/linear/cpu/linear_solver_bicgstab.c` — 1 printf → CFD_LOG_DEBUG
-- `lib/src/solvers/linear/avx2/linear_solver_cg_avx2.c` — 1 printf → CFD_LOG_DEBUG
-- `lib/src/solvers/linear/neon/linear_solver_cg_neon.c` — 1 printf → CFD_LOG_DEBUG
-- `lib/src/solvers/linear/omp/linear_solver_cg_omp.c` — 1 printf → CFD_LOG_DEBUG
-- `lib/src/solvers/navier_stokes/avx2/solver_explicit_euler_avx2.c` — 4 printf → CFD_LOG_INFO
-- `lib/src/solvers/navier_stokes/avx2/solver_projection_avx2.c` — 1 fprintf → CFD_LOG_WARNING
-- `lib/src/solvers/navier_stokes/avx2/solver_rk2_avx2.c` — 2 printf → CFD_LOG_INFO
-- `tests/core/test_logging.c` — 11 new tests (16 total)
+- [ ] Log filtering by component (e.g., only show "boundary" logs)
+- [ ] Timestamps and colored output
+- [ ] Structured data API for metrics (convergence stats, timings)
 
 ---
 
@@ -333,129 +145,23 @@ CFD_LOG_ERROR(component, ...)
 
 **Goal:** Make the solver practically usable for real problems.
 
-### 1.1 Boundary Conditions (P0 - Critical)
-
-- [x] Boundary condition abstraction layer (`boundary_conditions.h/.c`)
-- [x] Runtime backend selection (Scalar, SIMD, OpenMP, GPU)
-- [x] Neumann (zero-gradient) boundary conditions
-- [x] Periodic boundary conditions
-- [x] SIMD-optimized BC application (AVX2)
-- [x] OpenMP-parallelized BC application
-- [x] CUDA GPU BC kernels
-- [x] Dirichlet (fixed value) boundary conditions
-- [x] No-slip wall conditions
-- [x] Inlet velocity specification (uniform, parabolic, custom profiles)
-- [x] Outlet (zero-gradient/convective)
-- [x] Symmetry planes
-- [x] Moving wall boundaries (via Dirichlet BCs, see lid-driven cavity example)
-- [x] Time-varying boundary conditions
-
-**Implemented files:**
-
-- `lib/include/cfd/boundary/boundary_conditions.h` - Public API with backend selection
-- `lib/include/cfd/boundary/boundary_conditions_gpu.cuh` - GPU API declarations
-- `lib/src/boundary/boundary_conditions.c` - Public API dispatcher
-- `lib/src/boundary/cpu/boundary_conditions_scalar.c` - Scalar implementation
-- `lib/src/boundary/cpu/boundary_conditions_outlet_scalar.c` - Scalar outlet BC
-- `lib/src/boundary/omp/boundary_conditions_omp.c` - OpenMP parallelization
-- `lib/src/boundary/omp/boundary_conditions_outlet_omp.c` - OpenMP outlet BC
-- `lib/src/boundary/avx2/boundary_conditions_avx2.c` - AVX2 SIMD optimizations
-- `lib/src/boundary/avx2/boundary_conditions_inlet_avx2.c` - AVX2 inlet BC
-- `lib/src/boundary/avx2/boundary_conditions_outlet_avx2.c` - AVX2 outlet BC
-- `lib/src/boundary/neon/boundary_conditions_neon.c` - NEON SIMD optimizations
-- `lib/src/boundary/neon/boundary_conditions_inlet_neon.c` - NEON inlet BC
-- `lib/src/boundary/neon/boundary_conditions_outlet_neon.c` - NEON outlet BC
-- `lib/src/boundary/simd/boundary_conditions_simd_dispatch.c` - SIMD runtime dispatch
-- `lib/src/boundary/gpu/boundary_conditions_gpu.cu` - CUDA kernels
-- `lib/src/boundary/boundary_conditions_internal.h` - Internal declarations
-- `lib/src/boundary/boundary_conditions_inlet_common.h` - Shared inlet BC helpers
-- `lib/src/boundary/boundary_conditions_outlet_common.h` - Shared outlet BC helpers
-- `lib/src/boundary/boundary_conditions_time.h` - Time modulation helpers
-- `lib/src/boundary/cpu/boundary_conditions_inlet_time_scalar.c` - Time-varying inlet scalar implementation
-- `tests/core/test_boundary_conditions_symmetry.c` - Symmetry BC unit tests
-
-#### 1.1.1 Boundary Conditions Code Refactoring (P2) ✅
-
-**Refactoring Complete.** Reduced ~500 lines of duplicated code across BC backends using parameterized header templates.
-
-1. **Consolidate Inlet BC Implementations (Priority 1)**
-   - [x] Deleted 3 redundant inlet files (OMP/AVX2/NEON) — all delegate to scalar
-   - **Savings:** 196 lines removed
-
-2. **Extract Common Outlet SIMD Template (Priority 2)**
-   - [x] Created `boundary_conditions_outlet_simd.h` parameterized by SIMD intrinsics
-   - [x] AVX2/NEON outlet files reduced from ~135 lines each to ~33 lines
-   - **Savings:** 99 lines removed
-
-3. **Templatize OMP vs Scalar (Priority 3)**
-   - [x] Created `boundary_conditions_core_impl.h` with token-pasting macros
-   - [x] Scalar and OMP files include shared template with `BC_CORE_USE_OMP` flag
-   - **Savings:** 29 lines removed
-
-4. **Unify AVX2 and NEON (Priority 4)**
-   - [x] Created `boundary_conditions_simd_impl.h` parameterized by STORE/LOAD/BROADCAST/VEC_TYPE/WIDTH/MASK
-   - [x] AVX2/NEON main BC files reduced from ~218 lines each to ~55 lines
-   - **Savings:** 181 lines removed
-
-**Files created:**
-
-- `lib/src/boundary/boundary_conditions_core_impl.h`
-- `lib/src/boundary/boundary_conditions_outlet_simd.h`
-- `lib/src/boundary/boundary_conditions_simd_impl.h`
-
-**Files deleted:**
-
-- `lib/src/boundary/omp/boundary_conditions_inlet_omp.c`
-- `lib/src/boundary/avx2/boundary_conditions_inlet_avx2.c`
-- `lib/src/boundary/neon/boundary_conditions_inlet_neon.c`
-
-**Total savings:** ~505 lines removed across 4 priorities
+### 1.1 Boundary Conditions (P0 - Critical) ✅
 
-### 1.2 Linear Solvers (P0 - Critical)
+All BC types implemented across all backends (Scalar, AVX2, NEON, OMP, GPU): Dirichlet, Neumann, Periodic, No-slip, Inlet (uniform/parabolic/custom), Outlet, Symmetry, Moving wall, Time-varying. Code refactored with shared templates (~505 lines removed).
 
-**Implemented:**
+### 1.2 Linear Solvers (P0 - Critical)
 
-- [x] SOR (Successive Over-Relaxation) - CPU baseline
-- [x] Jacobi SIMD (`poisson_jacobi_simd.c`) - AVX2 vectorized, fully parallelizable
-- [x] Red-Black SOR SIMD (`poisson_redblack_simd.c`) - AVX2 with SOR convergence rate
-- [x] GPU Jacobi Poisson solver (`solver_projection_jacobi_gpu.cu`)
-- [x] Integrate SIMD Poisson into projection solver (`solver_projection_simd.c`)
+**Implemented:** Jacobi, SOR, Red-Black SOR, CG/PCG, BiCGSTAB (all with SIMD backends). CG is the default Poisson solver for all projection methods.
 
 **Still needed:**
 
-- [x] Solver abstraction interface
-- [x] Conjugate Gradient (CG) for SPD systems (scalar, AVX2, NEON, OMP backends)
-  - **Note:** CG is now the default Poisson solver for all projection methods (PR #139)
-  - CPU uses CG_SCALAR, AVX2 uses CG_SIMD, OMP uses CG_OMP for reliable O(√κ) convergence
-- [x] BiCGSTAB for non-symmetric systems ✅
-  - [x] BiCGSTAB Scalar (CPU) ✅
-  - [x] BiCGSTAB SIMD (AVX2/NEON with integrated OpenMP parallelization) ✅
-
-  **Note:** AVX2 and NEON implementations include OpenMP parallelization via `#pragma omp parallel for` in inner loops. There is no separate `POISSON_BACKEND_OMP` backend for BiCGSTAB.
-
-  **Files created (BiCGSTAB backends):**
-
-  - `lib/src/solvers/linear/avx2/linear_solver_bicgstab_avx2.c` - AVX2 SIMD implementation with OpenMP
-  - `lib/src/solvers/linear/neon/linear_solver_bicgstab_neon.c` - NEON SIMD implementation with OpenMP
-  - `tests/math/test_bicgstab_avx2.c` - AVX2 vs scalar consistency test
-  - `tests/math/test_bicgstab_neon.c` - NEON vs scalar consistency test
-
-  Updated files:
-
-  - `lib/src/solvers/linear/simd/linear_solver_simd_dispatch.c` - Added BiCGSTAB dispatcher
-  - `lib/src/solvers/linear/linear_solver_internal.h` - Added BiCGSTAB SIMD factory declaration
-
 - [ ] GMRES (Generalized Minimal Residual) for non-symmetric systems
   - [ ] GMRES scalar
   - [ ] GMRES AVX2
   - [ ] GMRES NEON
   - [ ] GMRES OMP
   - [ ] GMRES GPU
-- [x] Jacobi (diagonal) preconditioner for CG (scalar, AVX2, NEON backends)
 - [ ] SSOR (Symmetric SOR) preconditioner
-- [x] SOR SIMD variants (Block SOR — see docs/technical-notes/block-sor-simd.md)
-  - [x] SOR AVX2
-  - [x] SOR NEON
 - [ ] GPU (CUDA) linear solver backends
   - [ ] Standalone Jacobi GPU (refactor from monolithic `solver_projection_jacobi_gpu.cu`)
   - [ ] Red-Black SOR GPU
@@ -467,86 +173,7 @@ CFD_LOG_ERROR(component, ...)
 - [ ] AMG preconditioner (for use with CG/GMRES/BiCGSTAB)
 - [ ] Performance benchmarking in Release mode
 
-**Note:** See [Known Issues](#known-issues) — SIMD Poisson strict tolerance limitation.
-
-#### 1.2.1 Poisson Solver Accuracy Tests ✅
-
-- [x] Zero RHS test (solution should remain constant)
-- [x] Uniform RHS test (quadratic solution)
-- [x] Sinusoidal RHS test with analytical solution comparison
-- [x] Convergence rate verification for all Poisson variants (SOR, Jacobi, Red-Black)
-- [x] Residual convergence tracking
-
-**Files created:**
-
-- `tests/math/test_poisson_accuracy.c` - 15 tests covering all accuracy requirements
-
-#### 1.2.2 Laplacian Operator Validation ✅
-
-**Test the discrete Laplacian against manufactured solutions:**
-
-- [x] Manufactured solution: p = sin(πx)sin(πy) → ∇²p = -2π²p
-- [x] Verify 2nd-order accuracy O(dx²) with grid refinement
-- [x] Compare CPU and CG implementations (SIMD compared when available)
-- [x] Verify stencil symmetry (self-adjoint property)
-
-**Files created:**
-
-- `tests/math/test_laplacian_accuracy.c` - 4 tests covering all accuracy requirements
-
-**Files tested:** `lib/src/solvers/linear/cpu/linear_solver_cg.c` - `apply_laplacian()`
-
-#### 1.2.3 Linear Solver Convergence Validation ✅
-
-**Verify convergence rates match theory:**
-
-- [x] Jacobi: spectral radius ρ = cos(πh) verified to <1% accuracy (Dirichlet BCs)
-- [x] SOR: over-relaxation (ω > 1) converges faster than Gauss-Seidel
-- [x] Red-Black SOR: comparable convergence to standard SOR (ratio 0.5-2.0)
-- [x] CG: convergence in O(√κ) iterations verified
-
-**Files created:**
-
-- `tests/math/test_linear_solver_convergence.c` - 6 tests covering convergence properties
-
-**Note:** See [Known Issues](#known-issues) — spectral radius test BC assumption.
-
-**Files still to create (future work):**
-
-- `src/solvers/linear/multigrid.c`
-
-#### 1.2.4 Preconditioned Conjugate Gradient (PCG) ✅
-
-**Jacobi (diagonal) preconditioner for CG solver:**
-
-- [x] Added `poisson_precond_type_t` enum (NONE, JACOBI) to `poisson_solver.h`
-- [x] Added `preconditioner` field to `poisson_solver_params_t`
-- [x] Implemented PCG algorithm in scalar CG solver
-- [x] Implemented PCG in AVX2 backend with SIMD-vectorized preconditioner application
-- [x] Implemented PCG in NEON backend with SIMD-vectorized preconditioner application
-- [x] Backward compatible: POISSON_PRECOND_NONE (default) gives standard CG behavior
-
-**PCG Algorithm:**
-
-```
-z = M⁻¹r           (apply preconditioner)
-p = z              (search direction from preconditioned residual)
-ρ = (r, z)         (instead of (r, r))
-```
-
-**Note:** See [Known Issues](#known-issues) — Jacobi preconditioner on uniform grids.
-
-**Files modified:**
-
-- `lib/include/cfd/solvers/poisson_solver.h` - Added preconditioner enum and params field
-- `lib/src/solvers/linear/linear_solver.c` - Default params initialization
-- `lib/src/solvers/linear/cpu/linear_solver_cg.c` - Scalar PCG implementation
-- `lib/src/solvers/linear/avx2/linear_solver_cg_avx2.c` - AVX2 PCG implementation
-- `lib/src/solvers/linear/neon/linear_solver_cg_neon.c` - NEON PCG implementation
-
-**Files created:**
-
-- `tests/math/test_pcg_convergence.c` - 4 tests verifying PCG correctness and consistency
+**Completed sub-sections:** Poisson Accuracy Tests (1.2.1), Laplacian Validation (1.2.2), Convergence Validation (1.2.3), PCG (1.2.4).
 
 ### 1.3 Numerical Schemes (P1)
 
@@ -554,88 +181,9 @@ p = z              (search direction from preconditioned residual)
 - [ ] Central differencing with delayed correction
 - [ ] High-resolution TVD schemes (Van Leer, Superbee)
 - [ ] Gradient limiters (Barth-Jespersen, Venkatakrishnan)
-
-#### 1.3.1 Finite Difference Stencil Tests ✅
-
-Unit tests for individual stencil operations:
-
-- [x] First derivative (central difference) - verify O(h²) accuracy
-- [x] Second derivative - verify O(h²) accuracy
-- [x] 2D Laplacian (5-point stencil) - verify O(h²) accuracy
-- [x] Divergence operator - verify O(h²) accuracy
-- [x] Gradient operator - verify O(h²) accuracy
-
-**Test approach:** Use smooth analytical functions (e.g., `sin(kx)*sin(ky)`), compute numerical derivatives using shared stencil implementations, compare to analytical derivatives, verify error scaling.
-
-**Files created:**
-
-- `lib/include/cfd/math/stencils.h` - Shared stencil implementations (header-only)
-- `tests/math/test_finite_differences.c` - 9 tests verifying O(h²) convergence
-
-**Future work:**
-
 - [ ] Migrate solver code to use `cfd/math/stencils.h` (currently inline implementations)
-  - `solver_explicit_euler.c`, `solver_projection.c`, `linear_solver_jacobi.c`, etc.
-  - This ensures tests exercise the exact production code paths
-
-#### 1.3.2 Convergence Order Verification ✅
-
-- [x] Spatial convergence tests (h-refinement: 16→32→64→128)
-- [x] Temporal convergence tests (dt-refinement)
-- [x] Automated order-of-accuracy computation
-- [x] Verify super-linear spatial convergence (~1.5 order, BC-limited)
-
-**Results:** See [Known Issues](#known-issues) — convergence order BC-limited. Spatial rate > 1.4 achieved, error decreases monotonically with refinement.
-
-**Files created:**
-
-- `tests/math/test_convergence_order.c`
-
-#### 1.3.3 Method of Manufactured Solutions (MMS) ✅
-
-- [x] Define manufactured velocity/pressure fields with known derivatives
-- [x] Compute analytical source terms from Navier-Stokes substitution
-- [x] Run solver with manufactured source terms
-- [x] Compare numerical to manufactured solution
-- [x] Verify convergence order
-
-**Manufactured solution (Modified Taylor-Green):**
-
-```c
-u(x,y,t) =  cos(x) * sin(y) * exp(-αt)
-v(x,y,t) = -sin(x) * cos(y) * exp(-αt)
-Source:  f = (2ν - α) · u_exact  // With α = ν
-```
-
-**Implementation:**
-
-- Added `ns_source_func_t` callback to `ns_solver_params_t` for custom source terms
-- Propagated callback to all solver variants: CPU explicit Euler (via `compute_source_terms()`), CPU projection, AVX2 explicit Euler (scalar path), AVX2 projection, OMP explicit Euler, OMP projection, OMP RK2 (already via helper)
-- Created `tests/math/test_mms.c` with 5 tests:
-  1. Source callback mechanism verification (constant source α=0 for robust direction check)
-  2. Spatial convergence (Euler) — verified O(h^1.5+), monotonic with 1% tolerance
-  3. Spatial convergence (RK2) — verified O(h^1.5+), monotonic with 1% tolerance
-  4. Temporal convergence (Euler) — verified error decreases
-  5. Temporal convergence (RK2) — verified better rate than Euler
-
-#### 1.3.4 Divergence-Free Constraint Validation ✅
-
-**Verify projection method enforces incompressibility:**
-
-- [x] Measure max|∇·u| after projection step (should be bounded)
-- [x] Test with various initial velocity fields (sinusoidal, Taylor-Green, vortex pair)
-- [x] Verify all projection backends (CPU, AVX2, OMP, GPU)
-
-**Results:**
-
-- Divergence computation verified against analytically div-free fields (<1e-10)
-- All backends produce consistent results (within 10%)
-- Divergence stays bounded (<10.0) - matches existing solver behavior
-- GPU test skipped when CUDA not available
 
-**Files created:**
-
-- `tests/math/test_divergence_free.c`
+**Completed sub-sections:** Stencil Tests (1.3.1), Convergence Order (1.3.2), MMS (1.3.3), Divergence-Free (1.3.4).
 
 ### 1.4 Steady-State Solver (P1)
 
@@ -644,24 +192,15 @@ Source:  f = (2ν - α) · u_exact  // With α = ν
 - [ ] Pseudo-transient continuation
 - [ ] Convergence acceleration (relaxation)
 
-**Files to create:**
-
-- `include/cfd/solvers/steady_state.h`
-- `src/solvers/steady/simple.c`
-- `src/solvers/steady/simplec.c`
-- `src/solvers/steady/piso.c`
-
 ### 1.5 Time Integration (P1)
 
-Each time integrator requires a scalar (CPU) reference implementation first, then backend variants.
-See `/add-ns-time-integrator` command for the cross-backend workflow.
+Each time integrator requires a scalar (CPU) reference implementation first, then backend variants. See `/add-ns-time-integrator` for the cross-backend workflow.
+
+**Implemented:** RK2 (Heun's method) — all CPU/AVX2/OMP backends, O(dt²) verified.
 
-**Algorithms:**
+**Still needed:**
 
-- [x] RK2 (Heun's method) — scalar
-  - [x] RK2 AVX2/SIMD (`rk2_avx2`, `rk2_optimized`)
-  - [x] RK2 OpenMP (`rk2_omp`)
-  - [ ] RK2 CUDA (`rk2_gpu`)
+- [ ] RK2 CUDA (`rk2_gpu`)
 - [ ] RK4 (classical Runge-Kutta) — scalar
   - [ ] RK4 AVX2+OMP (`rk4_optimized`)
   - [ ] RK4 OpenMP (`rk4_omp`)
@@ -671,24 +210,6 @@ See `/add-ns-time-integrator` command for the cross-backend workflow.
 - [ ] BDF2 (backward differentiation)
 - [ ] Adaptive time stepping with error control
 
-**RK2 Implementation:**
-
-- O(dt²) temporal accuracy verified via self-convergence test (ratio ≈ 4.0)
-- Uses periodic stencil indexing in RHS evaluation to avoid ghost-cell order reduction
-- BCs applied only after full RK2 step (not between stages)
-
-**Files created:**
-
-- `lib/src/solvers/navier_stokes/cpu/solver_rk2.c`
-- `tests/solvers/navier_stokes/cpu/test_solver_rk2.c`
-
-**Files modified:**
-
-- `lib/include/cfd/solvers/navier_stokes_solver.h` — added `NS_SOLVER_TYPE_RK2`
-- `lib/src/api/solver_registry.c` — registered RK2 factory
-- `lib/CMakeLists.txt` — added source file
-- `CMakeLists.txt` — added test entries
-
 ### 1.6 Restart/Checkpoint (P1)
 
 - [ ] Binary checkpoint format
@@ -706,12 +227,6 @@ Solve F(x)=0 where F is nonlinear. Required for steady-state Navier-Stokes.
 - [ ] Line search and globalization
 - [ ] Nonlinear solver abstraction interface
 
-**Files to create:**
-
-- `include/cfd/solvers/nonlinear_solver.h`
-- `src/solvers/nonlinear/newton.c`
-- `src/solvers/nonlinear/picard.c`
-
 ### 1.8 Eigenvalue Solvers (P3)
 
 Find eigenvalues/eigenvectors for stability analysis.
@@ -721,49 +236,16 @@ Find eigenvalues/eigenvectors for stability analysis.
 - [ ] Arnoldi iteration
 - [ ] Stability analysis framework
 
-**Files to create:**
-
-- `include/cfd/solvers/eigen_solver.h`
-- `src/solvers/eigen/power_iteration.c`
-- `src/solvers/eigen/arnoldi.c`
-
 ### 1.9 Derived Fields (P2)
 
 **Status:** OpenMP implemented, SIMD and CUDA pending
 
-**Current Implementation:**
-
-Located in `lib/src/core/derived_fields.c`:
-
-- [x] OpenMP parallelization for velocity magnitude computation
-- [x] OpenMP parallel reduction for field statistics (min/max/avg)
-- [x] Threshold-based parallelization (OMP_THRESHOLD = 1000 cells)
-- [x] Fallback to sequential for small grids
+**Current:** OpenMP parallelization for velocity magnitude and field statistics with threshold-based parallelization (OMP_THRESHOLD = 1000 cells).
 
 **Still needed:**
 
 #### SIMD Optimization (AVX2/NEON)
 
-Best for: All grid sizes, especially when combined with OpenMP
-
-**Implementation approach:**
-
-```c
-// Velocity magnitude with AVX2 (4 doubles per iteration)
-for (size_t i = 0; i < n; i += 4) {
-    __m256d u = _mm256_loadu_pd(&field->u[i]);
-    __m256d v = _mm256_loadu_pd(&field->v[i]);
-    __m256d u2 = _mm256_mul_pd(u, u);
-    __m256d v2 = _mm256_mul_pd(v, v);
-    __m256d sum = _mm256_add_pd(u2, v2);
-    __m256d mag = _mm256_sqrt_pd(sum);
-    _mm256_storeu_pd(&vel_mag[i], mag);
-}
-// Handle remainder elements
-```
-
-**Tasks:**
-
 - [ ] Add AVX2 implementation for velocity magnitude
 - [ ] Add NEON implementation for velocity magnitude (ARM64)
 - [ ] Combine with OpenMP (`#pragma omp simd` or manual intrinsics)
@@ -771,193 +253,38 @@ for (size_t i = 0; i < n; i += 4) {
 - [ ] Add runtime CPU feature detection
 - [ ] Benchmark against scalar+OpenMP version
 
-**Benefits:**
-
-- No thread overhead (works on small grids)
-- Combines well with OpenMP for multi-core
-- Reference: existing SIMD solvers in `lib/src/solvers/linear/avx2/`
-
-**Challenges:**
-
-- Architecture-specific code paths needed
-- Remainder handling for non-aligned sizes
-- Statistics reductions more complex with SIMD
-
 #### CUDA GPU Acceleration
 
-Best for: Very large grids (100k+ cells), GPU available
-
-**Implementation approach:**
-
-```cuda
-__global__ void compute_velocity_magnitude_kernel(
-    double* __restrict__ vel_mag,
-    const double* __restrict__ u,
-    const double* __restrict__ v,
-    size_t n)
-{
-    size_t i = blockIdx.x * blockDim.x + threadIdx.x;
-    if (i < n) {
-        vel_mag[i] = sqrt(u[i] * u[i] + v[i] * v[i]);
-    }
-}
-```
-
-**Tasks:**
-
 - [ ] Add CUDA kernel for velocity magnitude
 - [ ] Add parallel reduction for statistics (use CUB library or custom)
 - [ ] Share GPU memory with CUDA solvers (avoid CPU<->GPU transfers)
 - [ ] Add threshold logic (only use GPU for large grids)
 - [ ] Benchmark transfer overhead vs compute benefit
 
-**Benefits:**
-
-- Massive parallelism for large grids
-- Can share GPU memory with CUDA solvers
-- Reference: existing CUDA infrastructure in `lib/src/solvers/gpu/`
-
-**Challenges:**
-
-- GPU memory transfer overhead for small grids
-- More complex reduction implementation
-- Only beneficial if data already on GPU
-
-**Performance Thresholds (Approximate):**
-
-| Grid Size        | Recommended Approach                         |
-|------------------|----------------------------------------------|
-| < 1,000 cells    | Sequential (overhead not worth it)           |
-| 1,000 - 10,000   | OpenMP (2-4 threads) ✅ Implemented          |
-| 10,000 - 100,000 | OpenMP + SIMD                                |
-| > 100,000        | CUDA (if GPU available) or OpenMP + SIMD     |
-
-**Note:** These operations are memory-bound, not compute-bound:
-
-- Velocity magnitude: Read 2 arrays, write 1 array
-- Statistics: Read 1 array, compute 3 values
-
-For memory-bound operations:
-
-- SIMD helps with cache line utilization
-- OpenMP helps with memory bandwidth aggregation across cores
-- CUDA helps only if data is already on GPU
-
-**Files to Modify:**
-
-- `lib/src/core/derived_fields.c` - Add SIMD/CUDA variants
-- `lib/src/core/derived_fields_simd.c` - New file for SIMD implementation
-- `lib/src/core/derived_fields_gpu.cu` - New file for CUDA kernels
-- `tests/io/test_csv_output.c` - Verify correctness after optimization
-
 ### 1.10 SIMD Projection Solver Optimization (P2)
 
-**Status:** SIMD Poisson integration completed (December 2024), further optimizations pending
-
-**Current State:**
-
-- [x] SIMD Poisson solvers implemented (Jacobi and Red-Black SOR with AVX2)
-- [x] SIMD Poisson integrated into projection solver (`solver_projection_simd.c`)
-- [x] Full projection method with SIMD corrector step
-- [x] All tests pass with results matching scalar implementation
-- [x] `u_new` buffer reused as temp for Poisson solver (in-place Red-Black SOR)
-
-**Reference:** See [docs/technical-notes/simd-optimization-analysis.md](docs/technical-notes/simd-optimization-analysis.md) for detailed technical analysis.
-
-**Performance Analysis:**
-
-Current SIMD implementation provides ~1.3-1.5x speedup with Poisson solver being 70-80% of total runtime. Speedup limited by Amdahl's law:
-
-- Parallelizable fraction: ~80% (Poisson + Corrector)
-- SIMD speedup: 1.5-2x (Red-Black has gather/scatter overhead)
-- Expected overall: 1.3-1.5x (matches observations)
+**Status:** SIMD Poisson integration completed, further optimizations pending. Current ~1.3-1.5x speedup limited by Amdahl's law (parallelizable fraction ~80%).
 
 #### High Priority
 
-##### 1. Improve Poisson Convergence for Non-Trivial Problems
-
-SIMD Poisson solvers don't fully converge to 1e-6 tolerance on challenging problems (e.g., sinusoidal RHS) within current iteration limits.
-
-**Tasks:**
-
-- [ ] Increase `POISSON_MAX_ITER` from current 1000/2000 to higher values
-- [ ] Implement adaptive tolerance based on problem scale
+- [ ] Increase `POISSON_MAX_ITER` or implement adaptive tolerance
 - [ ] Add optional multigrid preconditioner for faster convergence
-- [ ] Investigate Red-Black omega parameter tuning for specific problem classes
-
-##### 2. Performance Benchmarking in Release Mode
-
-See [Known Issues](#known-issues) — debug mode SIMD benchmarking.
-
-**Tasks:**
-
+- [ ] Investigate Red-Black omega parameter tuning
 - [ ] Create Release mode benchmark suite
-- [ ] Measure actual speedup vs scalar implementation
 - [ ] Profile to identify remaining bottlenecks
-- [ ] Compare against theoretical Amdahl's law predictions
-- [ ] Document performance characteristics for different grid sizes
-
-#### Medium Priority
-
-##### 3. OpenMP + SIMD Hybrid
 
-Combine thread parallelism with SIMD vectorization for maximum CPU utilization.
-
-**Current State:** Separate OMP and SIMD backends exist, but no hybrid implementation.
-
-**Tasks:**
+#### Medium Priority — OpenMP + SIMD Hybrid
 
 - [ ] Implement hybrid projection solver using OpenMP across rows + SIMD within rows
-- [ ] Use `#pragma omp parallel for` on outer loop with SIMD intrinsics in inner loop
-- [ ] Benchmark scaling efficiency (measure speedup vs threads)
+- [ ] Benchmark scaling efficiency
 - [ ] Compare against pure OMP and pure SIMD implementations
-- [ ] Handle load balancing for Red-Black coloring with OpenMP
-
-**Benefits:**
 
-- Jacobi allows full parallelization across rows
-- Red-Black allows parallelization within each color
-- Could achieve 4-8x speedup on modern multi-core CPUs
-
-**Reference Files:**
-
-- `lib/src/solvers/navier_stokes/omp/solver_projection_omp.c` - OMP implementation
-- `lib/src/solvers/navier_stokes/avx2/solver_projection_avx2.c` - SIMD implementation
-- `lib/src/solvers/linear/avx2/` - SIMD Poisson solvers
-
-#### Low Priority
-
-##### 4. Multigrid SIMD Implementation
-
-Achieve O(N) complexity vs O(N²) for iterative methods.
-
-**Benefits:**
-
-- Multigrid offers O(N) complexity vs O(N²) for Jacobi/SOR
-- Natural parallelism at each grid level
-- Can use SIMD at each level for additional speedup
-- Essential for large-scale 3D simulations
-
-**Tasks:**
+#### Low Priority — Multigrid SIMD
 
 - [ ] Implement multigrid V-cycle framework
 - [ ] Add restriction/prolongation operators
 - [ ] Use SIMD Jacobi/Red-Black as smoothers at each level
 - [ ] Benchmark convergence rate vs pure iterative methods
-- [ ] Validate against analytical solutions
-
-**Challenges:**
-
-- Complex implementation requiring careful design
-- Grid hierarchy management
-- Operator construction at multiple levels
-- Testing and validation complexity
-
-**Priority Justification:** Low priority because:
-
-- Current SIMD implementation adequate for 2D problems
-- Critical for 3D but 3D support itself is Phase 2
-- Better to implement multigrid after 3D infrastructure exists
 
 ---
 
@@ -1011,31 +338,9 @@ Achieve O(N) complexity vs O(N²) for iterative methods.
 
 **Goal:** Support complex geometries.
 
-### 3.1 3D Support (P0 - Critical) — **ACTIVE PRIORITY**
-
-**Status:** Complete (all 8 phases). See [3D Extension Plan](docs/technical-notes/3d-extension-plan.md) for full design.
-
-**Approach:** "2D as subset of 3D" — `nz=1` produces bit-identical results to current 2D code. No separate 2D/3D codepaths. All code becomes 3D-aware with branch-free solver loops using precomputed constants (`stride_z=0`, `inv_dz2=0.0` when `nz==1`).
-
-**Phases:**
+### 3.1 3D Support (P0 - Critical) ✅
 
-- [x] Phase 0: Introduce `IDX_2D`/`IDX_3D` indexing macros (`lib/include/cfd/core/indexing.h`), replace all inline `j*nx+i` patterns across ~50 files
-- [x] Phase 1: Extend core data structures (grid: `nz, z[], dz[], stride_z, inv_dz2, k_start, k_end`; flow_field: `w, nz`; BCs: `front/back`)
-- [x] Phase 2: Add 3D stencils (7-point) and update scalar CPU linear solvers
-- [x] Phase 3: Update scalar CPU NS solvers with w-momentum equation
-- [x] Phase 4: Extend boundary conditions for z-faces (all backends)
-- [x] Phase 5: Update SIMD backends (AVX2/NEON) for 3D
-- [x] Phase 6: Update OMP backends for 3D
-- [x] Phase 7: Update CUDA backend for 3D
-- [x] Phase 8: Update I/O (VTK 3D), examples, validation tests, and docs
-
-**Key design decisions:**
-
-- Branch-free inner loops: `stride_z=0` makes z-neighbor reads return same cell, z-terms vanish mathematically
-- Cache-optimized: k→j→i loop order, three-plane working set fits L2, OMP `collapse(2) schedule(static)`
-- Backward-compatible API: `grid_create(nx, ny, ...)` becomes wrapper for `grid_create_3d(nx, ny, 1, ...)`
-
-**Estimated effort:** ~90 files, 12-18 weeks across all phases
+**Status:** Complete. "2D as subset of 3D" approach — `nz=1` produces bit-identical results to previous 2D code. Branch-free solver loops using precomputed constants (`stride_z=0`, `inv_dz2=0.0`). All 8 phases done: indexing macros, core data structures, 3D stencils, NS solvers, BCs, SIMD, OMP, CUDA, I/O, examples, docs.
 
 ### 3.2 Unstructured Meshes (P1)
 
@@ -1080,19 +385,7 @@ Achieve O(N) complexity vs O(N²) for iterative methods.
 
 ### 4.2 Modular Backend Libraries (P1) ✅
 
-**Goal:** Split the monolithic library into separate per-backend libraries for flexible deployment.
-
-**Status:** Implemented in v0.1.5
-
-**Rationale:**
-
-- Users only link what they need (smaller binaries)
-- CUDA library can be loaded dynamically (plugin pattern)
-- Clear ABI boundaries between backends
-- Allows mixing compiler toolchains (e.g., nvcc for CUDA only)
-- Easier to add new backends (OpenCL, SYCL, etc.)
-
-**Library Structure:**
+**Status:** Implemented in v0.1.5. Split monolithic library into per-backend targets.
 
 | Library | CMake Target | Contents | Dependencies |
 | ------- | ------------ | -------- | ------------ |
@@ -1103,40 +396,16 @@ Achieve O(N) complexity vs O(N²) for iterative methods.
 | `cfd_cuda` | `CFD::CUDA` | CUDA GPU solvers | CFD::Core |
 | `cfd_library` | `CFD::Library` | Unified library (all backends) | All above |
 
-**Implementation Tasks:**
+**Still needed:**
 
-- [x] Split `lib/CMakeLists.txt` into multiple library targets
-- [x] Create separate source lists for each backend
-- [x] Define public/private header boundaries per library
-- [x] Add CMake aliases for clean target references
-- [x] Add comprehensive test suite (`test_modular_libraries.c`)
-- [x] Support shared library builds (both static and shared supported)
 - [ ] Update examples/tests to link against specific backends (optional)
 - [ ] Plugin loading system for dynamic backend selection
 
-**Known Limitations:** See [Known Issues](#known-issues) — modular library circular dependencies.
-
-**Usage Examples:**
-
-```cmake
-# Currently, use the unified library for full functionality
-target_link_libraries(my_app PRIVATE CFD::Library)
-
-# Future: Link only what you need (after refactoring)
-# target_link_libraries(my_app PRIVATE CFD::Core CFD::SIMD)
-```
+See [Known Issues](#known-issues) — modular library circular dependencies.
 
 ### 4.3 GPU Improvements (P2)
 
-**Implemented:**
-
-- [x] CUDA device detection and selection
-- [x] GPU device information queries (compute capability, memory)
-- [x] Projection method with Jacobi Poisson solver on GPU
-- [x] GPU memory management (persistent memory, basic async transfers)
-- [x] GPU boundary condition kernels
-- [x] Configurable GPU settings (block size, convergence tolerance)
-- [x] GPU solver statistics (kernel time, transfer time, iterations)
+**Implemented:** CUDA device detection, GPU projection with Jacobi Poisson, GPU memory management, GPU BC kernels, configurable GPU settings, GPU solver statistics.
 
 **Still needed:**
 
@@ -1168,13 +437,7 @@ target_link_libraries(my_app PRIVATE CFD::Library)
 
 ### 5.2 Modern VTK (P1)
 
-**Implemented:**
-
-- [x] VTK legacy ASCII format output
-- [x] Scalar field output (`write_vtk_output`)
-- [x] Vector field output (`write_vtk_vector_output`)
-- [x] Full flow field output (`write_vtk_flow_field`)
-- [x] Timestamped run directories (`write_vtk_*_run` functions)
+**Implemented:** VTK legacy ASCII format, scalar/vector/flow field output, timestamped run directories.
 
 **Still needed:**
 
@@ -1183,13 +446,9 @@ target_link_libraries(my_app PRIVATE CFD::Library)
 - [ ] Time series support
 - [ ] Binary encoding
 
-### 5.3 CSV Output (Implemented)
+### 5.3 CSV Output ✅
 
-- [x] Timeseries data (step, time, dt, velocity stats, pressure stats)
-- [x] Centerline profiles (horizontal/vertical)
-- [x] Global statistics (min/max/avg of all fields)
-- [x] Velocity magnitude columns
-- [x] Automatic header creation and append mode
+Implemented: timeseries data, centerline profiles, global statistics, velocity magnitude, automatic headers.
 
 ### 5.4 Restart Files (P1)
 
@@ -1208,146 +467,34 @@ target_link_libraries(my_app PRIVATE CFD::Library)
 
 **Goal:** Validate against reference solutions and provide comprehensive documentation.
 
-**Note:** API/robustness testing is in Phase 0.4. Mathematical accuracy validation (stencils, convergence, MMS) is in Phase 1.3. Linear solver validation is in Phase 1.2.
-
 ### 6.1 Benchmark Validation (P0 - Critical)
 
 #### 6.1.1 Lid-Driven Cavity Validation ✅
 
-**Status:** COMPLETE - Solver meets scientific tolerance (RMS ~0.04, target < 0.10)
-
-**Test files created:**
-
-- `tests/validation/test_cavity_setup.c` - Basic setup and BC tests (7 tests)
-- `tests/validation/test_cavity_flow.c` - Flow development and stability (8 tests)
-- `tests/validation/test_cavity_validation.c` - Conservation and Ghia comparison (5 tests)
-- `tests/validation/test_cavity_reference.c` - Reference-based regression tests (5 tests)
-- `tests/validation/lid_driven_cavity_common.h` - Shared utilities
-- `tests/validation/cavity_reference_data.h` - Ghia reference data
-- `docs/validation/lid_driven_cavity.md` - Validation methodology documentation
-
-**What Was Fixed (PR #139, Feb 2026):**
-
-1. **Switched Projection Solvers to CG:**
-   - [x] CPU projection: Red-Black SOR → CG_SCALAR ✅
-   - [x] OMP projection: Red-Black SOR → CG_OMP ✅ (initially CG_SCALAR in PR #139, upgraded to CG_OMP in PR #145)
-   - [x] AVX2 projection: Red-Black SOR → CG_SIMD ✅
-   - **Rationale:** CG has O(√κ) convergence vs SOR's O(n) - typically 20-64 iterations vs 100s-1000s
-
-2. **Increased Iteration Limits:**
-   - [x] max_iterations: 1000 → 5000 (accommodates CG on fine grids) ✅
-
-3. **Achieved Scientific Target:**
-   - [x] RMS_u: 0.0382 (target < 0.10) ✅
-   - [x] RMS_v: 0.0440 (target < 0.10) ✅
-   - [x] Grid convergence now monotonic: 17×17 (0.046) → 25×25 (0.037) → 33×33 (0.032) ✅
-
-**Backend Validation Complete (Feb 2026):**
-
-1. **Test ALL solver backends (systematic validation):** ✅
-   - [x] CPU scalar (explicit Euler, projection) ✅
-   - [x] AVX2/SIMD (explicit Euler, projection) ✅
-   - [x] OpenMP (explicit Euler, projection) ✅
-   - [x] CUDA GPU (projection Jacobi) - test created, runs when GPU available ✅
-   - [x] Each backend independently achieves accuracy target ✅
-
-   **Test file:** `tests/validation/test_cavity_backends.c`
-
-   **Results (33×33 grid, Re=100):**
-   - Projection (CPU): RMS_u=0.0382, RMS_v=0.0440 < 0.10 ✅
-   - Projection (AVX2): RMS_u=0.0382, RMS_v=0.0440 < 0.10 ✅
-   - Projection (OMP): RMS_u=0.0382, RMS_v=0.0440 < 0.10 ✅
-   - Explicit Euler (CPU): RMS_u=0.0957, RMS_v=0.1284 < 0.15 ✅
-   - Explicit Euler (AVX2): RMS_u=0.0954, RMS_v=0.1312 < 0.15 ✅
-   - Explicit Euler (OMP): RMS_u=0.0957, RMS_v=0.1284 < 0.15 ✅
-   - Backend consistency: All 3 backends within 0.1% ✅
-
-   **Documentation:**
-   - `tests/validation/test_cavity_backends.c` - Comprehensive backend validation
-   - `docs/validation/cavity-backends-validation.md` - Test methodology and results
-
-2. **Verification that tests are honest:** ✅
-   - [x] Tests MUST fail if RMS >= target (no loose tolerances) ✅
-   - [x] Tests compare computed values at EXACT Ghia sample points ✅
-   - [x] Tests report actual vs expected values transparently ✅
-   - [x] No "current baseline" workarounds - fix solver, not tolerance ✅
-
-   **Target enforcement:**
-   - Projection method: RMS < 0.10 (production solver, strict)
-   - Explicit Euler: RMS < 0.15 (simpler method, relaxed)
-   - Tests FAIL immediately if RMS >= target
-
-**Remaining Work (for 129×129 full validation):**
-
-- [x] Parallel CTest infrastructure for full validation (per-backend test entries, `ctest -j 4`)
-- [x] OMP thread oversubscription mitigation (`OMP_NUM_THREADS=2`)
+**Status:** COMPLETE — All backends validated at 33×33, Re=100. Projection RMS < 0.10, Explicit Euler RMS < 0.15, all backends within 0.1% of each other. Grid convergence now monotonic.
+
+**Remaining work (129×129 full validation):**
+
 - [ ] Confirm all backends pass at 129×129 on EC2 and record RMS values
 - [ ] Use CAVITY_FULL_VALIDATION=1 build flag
 
-**Acceptance Criteria (non-negotiable):**
+**Acceptance Criteria:**
 
 - RMS error vs Ghia < 0.10 for Re=100, 400, 1000
 - All 7 solver backends produce identical results (within 0.1%)
 - Grid convergence: error decreases monotonically with refinement
-- Tests run in parallel to complete in < 60 seconds
 
-#### 6.1.1.1 Grid Convergence Validation (P1)
+##### 6.1.1.1 Grid Convergence Validation (P1) ✅
 
-See [Known Issues](#known-issues) — grid convergence non-monotonic behavior. Action items and acceptance criteria tracked there.
+Resolved by switching projection backends from Red-Black SOR to CG. Grid convergence tests now enforce strict monotonicity with convergence rate reporting.
 
 #### 6.1.2 Taylor-Green Vortex Validation (P0) ✅
 
-**Analytical solution with known decay rate - ideal for full NS validation:**
-
-```c
-u(x,y,t) = cos(x) * sin(y) * exp(-2νt)
-v(x,y,t) = -sin(x) * cos(y) * exp(-2νt)
-p(x,y,t) = -0.25 * (cos(2x) + cos(2y)) * exp(-4νt)
-```
-
-**Tests implemented:**
-
-- [x] Verify velocity decay rate matches exp(-2νt)
-- [x] Verify kinetic energy decay: KE(t) = KE₀ * exp(-4νt)
-- [x] Test L2 error remains bounded
-- [x] Verify grid convergence (error decreases with refinement)
-- [x] Verify divergence-free constraint (incompressibility)
-- [x] Compare solver backends (CPU, OpenMP)
-- [x] Long-time stability tests
-- [x] Low viscosity stability tests
-
-**Files created:**
-
-- `tests/validation/test_taylor_green_vortex.c` - 9 validation tests
-- `tests/validation/taylor_green_reference.h` - Analytical solutions and test utilities
+Velocity/energy decay, L2 error, grid convergence, divergence-free, backend consistency, long-time and low-viscosity stability — all verified.
 
 #### 6.1.3 Poiseuille Flow Validation (P1) ✅
 
-**Analytical parabolic profile for channel flow:**
-
-```c
-u(y) = 4 * U_max * y * (H - y) / H²
-```
-
-**Tests implemented:**
-
-- [x] Velocity profile stability (analytical solution preserved by solver)
-- [x] Mass conservation verification (flux in = flux mid = flux out)
-- [x] Pressure gradient accuracy vs analytical dp/dx = -8μU_max/H²
-- [x] Inlet BC accuracy (parabolic profile applied exactly)
-
-**Results:**
-
-- Profile RMS error: <1% (tolerance 1%)
-- Mass flux conservation: <1% (tolerance 1%)
-- Pressure gradient error: <5% (tolerance 5%)
-- Inlet BC: machine precision
-
-**Strategy:** Initialize with analytical solution, run projection steps, verify stability.
-
-**Files created:**
-
-- `tests/validation/test_poiseuille_flow.c`
+Profile stability (<1% RMS), mass conservation (<1%), pressure gradient (<5% error), inlet BC (machine precision) — all verified.
 
 #### 6.1.4 Other Benchmarks (P2)
 
@@ -1358,17 +505,12 @@ u(y) = 4 * U_max * y * (H - y) / H²
 
 **Goal:** Full-length validation tests that run during releases (too slow for CI).
 
-**Rationale:** CI tests use reduced grid sizes and iteration counts for fast feedback (~30 seconds). Release validation uses full parameters for scientific accuracy verification.
-
 **CI vs Release Parameters:**
 
 | Test | CI Mode | Release Mode |
 |------|---------|--------------|
-| Cavity Flow Development | 17×17, 500 steps | 33×33, 5000 steps |
-| Cavity Re=100 Stability | 21×21, 500 steps | 33×33, 10000 steps |
+| Cavity Ghia Validation | 33×33, 5000 steps | 129×129, 50000 steps |
 | Cavity Re=400 Stability | 25×25, 500 steps | 65×65, 20000 steps |
-| Reynolds Dependency | 17×17, 400 steps | 33×33, 5000 steps |
-| Ghia Validation | 33×33, 5000 steps | 129×129, 50000 steps |
 | Grid Convergence | 17→25→33 | 33→65→129 |
 | Taylor-Green Vortex | 32×32, 200 steps | 128×128, 10000 steps |
 
@@ -1381,33 +523,6 @@ u(y) = 4 * U_max * y * (H - y) / H²
 - [ ] Cross-architecture consistency check (all backends produce identical results)
 - [ ] Memory usage and performance regression benchmarks
 
-**Implementation:**
-
-```bash
-# CI mode (default) - fast, reduced parameters
-cmake -DCAVITY_FULL_VALIDATION=0 ..
-ctest -R validation
-
-# Release mode - full scientific validation
-cmake -DCAVITY_FULL_VALIDATION=1 ..
-ctest -R validation --timeout 3600
-```
-
-**Files to create:**
-
-- `tests/validation/test_ghia_full.c` - Full 129×129 Ghia validation
-- `tests/validation/test_taylor_green_full.c` - Extended Taylor-Green decay
-- `tests/validation/test_release_validation.c` - All backends consistency check
-- `.github/workflows/release-validation.yml` - GitHub Actions workflow for releases
-
-**Acceptance Criteria:**
-
-- All tests pass with full parameters
-- RMS error vs Ghia < 0.05 at 129×129 grid
-- Taylor-Green decay rate within 1% of analytical
-- All solver backends produce identical results (within floating-point tolerance)
-- Total runtime < 30 minutes on release CI runner
-
 ### 6.2 Convergence Studies (P1)
 
 - [ ] Grid independence studies for benchmark cases
@@ -1427,20 +542,7 @@ ctest -R validation --timeout 3600
 
 ### 6.4 Examples (P1)
 
-**Implemented (12 examples):**
-
-- [x] `minimal_example.c` - Simplest usage, basic setup
-- [x] `basic_simulation.c` - Standard incompressible Navier-Stokes
-- [x] `animated_flow_simulation.c` - Time-stepping with visualization
-- [x] `simple_animated_flow.c` - Simpler animation variant
-- [x] `velocity_visualization.c` - Output velocity field visualization
-- [x] `performance_comparison.c` - Compare different solvers
-- [x] `runtime_comparison.c` - Detailed timing comparisons
-- [x] `solver_selection.c` - Demonstrate solver registry and selection
-- [x] `custom_boundary_conditions.c` - Example BC usage
-- [x] `custom_source_terms.c` - Source term implementation
-- [x] `csv_data_export.c` - CSV output examples
-- [x] `lid_driven_cavity.c` - Lid-driven cavity with Dirichlet BCs
+**Implemented:** 12 examples (minimal, basic simulation, animated flow, velocity visualization, performance/runtime comparison, solver selection, custom BCs, custom source terms, CSV export, lid-driven cavity).
 
 **Still needed:**
 
@@ -1468,44 +570,13 @@ Train models in Python (PyTorch/JAX), deploy inference in C for:
 ### 7.1 Weight Format & Loader (P3)
 
 - [ ] Define binary weight format (.cfdnn)
-
-  ```c
-  typedef struct {
-      uint32_t magic;           // "CFDN"
-      uint32_t version;
-      uint32_t num_layers;
-      uint32_t input_dim;
-      uint32_t output_dim;
-      // Layer descriptors follow
-  } cfdnn_header_t;
-  ```
-
 - [ ] JSON metadata support for model info
 - [ ] Weight loading API
 
-  ```c
-  cfdnn_model_t* cfdnn_load(const char* path);
-  void cfdnn_free(cfdnn_model_t* model);
-  ```
-
 ### 7.2 Layer Implementations (P3)
 
 - [ ] Dense (fully connected) layer
-
-  ```c
-  void cfdnn_dense_forward(
-      const double* input, size_t in_features,
-      const double* weights, const double* bias,
-      double* output, size_t out_features
-  );
-  ```
-
-- [ ] Activation functions
-  - [ ] ReLU, Leaky ReLU
-  - [ ] Tanh, Sigmoid
-  - [ ] GELU (for transformers)
-  - [ ] Swish/SiLU
-
+- [ ] Activation functions (ReLU, Leaky ReLU, Tanh, Sigmoid, GELU, Swish/SiLU)
 - [ ] Batch normalization
 - [ ] Layer normalization
 - [ ] Dropout (inference mode = identity)
@@ -1517,15 +588,6 @@ Train models in Python (PyTorch/JAX), deploy inference in C for:
 - [ ] NEON equivalents for ARM
 - [ ] Cache-optimized tiling for large layers
 
-```c
-// Example: SIMD dense layer
-void cfdnn_dense_forward_avx2(
-    const double* input, size_t in_features,
-    const double* weights, const double* bias,
-    double* output, size_t out_features
-);
-```
-
 ### 7.4 Model Architectures (P3)
 
 | Architecture | Status | Notes |
@@ -1536,47 +598,12 @@ void cfdnn_dense_forward_avx2(
 
 ### 7.5 Inference API (P3)
 
-```c
-// High-level inference API
-typedef struct {
-    size_t nx, ny;
-    double Re;
-    double* boundary_conditions;  // Flattened BC values
-} cfdnn_input_t;
-
-typedef struct {
-    double* u;   // Velocity x-component (nx * ny)
-    double* v;   // Velocity y-component (nx * ny)
-    double* p;   // Pressure field (nx * ny)
-} cfdnn_output_t;
-
-cfd_status_t cfdnn_predict(
-    const cfdnn_model_t* model,
-    const cfdnn_input_t* input,
-    cfdnn_output_t* output
-);
-
-// Batch inference for parameter sweeps
-cfd_status_t cfdnn_predict_batch(
-    const cfdnn_model_t* model,
-    const cfdnn_input_t* inputs, size_t batch_size,
-    cfdnn_output_t* outputs
-);
-```
+- [ ] High-level `cfdnn_predict()` and `cfdnn_predict_batch()` API
+- [ ] Model loading/freeing lifecycle
 
 ### 7.6 Hybrid Solver Integration (P3)
 
 - [ ] Use ML prediction as initial guess for iterative solver
-
-  ```c
-  // ML-accelerated projection method
-  cfd_status_t ns_solve_projection_ml_init(
-      const cfdnn_model_t* surrogate,
-      ns_solver_t* solver,
-      // ... other params
-  );
-  ```
-
 - [ ] Adaptive switching: ML for steady-state estimate, CFD for accuracy
 - [ ] Uncertainty quantification (ensemble models)
 
@@ -1610,131 +637,29 @@ cfd_status_t cfdnn_predict_batch(
 
 ### Rationale
 
-Alternative to full C inference (Phase 7). Best of both worlds:
-
-- Python handles training, model management, and orchestration
-- C provides optimized compute kernels callable from Python
-- Lower implementation effort than full C inference engine
-- Easier to support new architectures (just add kernels)
+Alternative to full C inference (Phase 7). Python handles training/orchestration, C provides optimized compute kernels. Lower effort, easier to support new architectures.
 
 ### 8.1 Optimized Compute Kernels (P3)
 
-Expose SIMD-optimized operations for ML workloads via Python bindings:
-
-```c
-// Matrix operations optimized for neural network inference
-cfd_status_t cfd_matmul(
-    const double* A, size_t A_rows, size_t A_cols,
-    const double* B, size_t B_cols,
-    double* C  // Output: A_rows x B_cols
-);
-
-cfd_status_t cfd_matmul_add_bias(
-    const double* A, size_t A_rows, size_t A_cols,
-    const double* B, size_t B_cols,
-    const double* bias,  // B_cols
-    double* C
-);
-
-// Activation functions (in-place)
-cfd_status_t cfd_relu(double* data, size_t n);
-cfd_status_t cfd_tanh(double* data, size_t n);
-cfd_status_t cfd_gelu(double* data, size_t n);
-```
+- [ ] SIMD-optimized matrix operations (`cfd_matmul`, `cfd_matmul_add_bias`)
+- [ ] Activation functions (ReLU, Tanh, GELU — in-place)
 
 ### 8.2 Python-Callable Kernels (P3)
 
-Expose kernels to Python via bindings:
-
-```python
-from cfd_kernels import matmul, relu, gelu
-
-# Use C kernels in custom PyTorch module
-class CFDAcceleratedLayer(torch.nn.Module):
-    def forward(self, x):
-        # Offload to SIMD-optimized C kernel
-        out = matmul(x.numpy(), self.weight.numpy())
-        return torch.from_numpy(relu(out))
-```
+- [ ] Python bindings for C compute kernels
 
 ### 8.3 Physics Residual Kernels (P3)
 
-Optimized physics loss computation for PINN training:
-
-```c
-// Compute Navier-Stokes residuals efficiently
-cfd_status_t cfd_ns_residual(
-    const double* u, const double* v, const double* p,
-    size_t nx, size_t ny, double dx, double dy, double Re,
-    double* continuity_residual,
-    double* momentum_x_residual,
-    double* momentum_y_residual
-);
-
-// Finite difference operators (2nd order central)
-cfd_status_t cfd_gradient_x(const double* f, size_t nx, size_t ny, double dx, double* df_dx);
-cfd_status_t cfd_gradient_y(const double* f, size_t nx, size_t ny, double dy, double* df_dy);
-cfd_status_t cfd_laplacian(const double* f, size_t nx, size_t ny, double dx, double dy, double* lap_f);
-```
-
-Python usage:
-
-```python
-from cfd_kernels import ns_residual
-
-# Fast physics loss for PINN training
-def physics_loss(model, coords, Re):
-    u, v, p = model(coords)
-    cont, mom_x, mom_y = ns_residual(u, v, p, nx, ny, dx, dy, Re)
-    return (cont**2 + mom_x**2 + mom_y**2).mean()
-```
+- [ ] Optimized NS residual computation for PINN training
+- [ ] Finite difference operators (gradient, laplacian)
 
 ### 8.4 Data Generation Acceleration (P3)
 
-Fast CFD simulation for training data generation:
-
-```c
-// Batch simulation for dataset generation
-cfd_status_t cfd_generate_samples(
-    const cfd_sample_params_t* params, size_t n_samples,
-    cfd_sample_result_t* results,
-    int n_threads  // OpenMP parallelization
-);
-```
-
-```python
-from cfd_ml import generate_training_data
-
-# Generate 1000 cavity flow samples in parallel
-samples = generate_training_data(
-    problem="cavity",
-    Re_range=(10, 200),
-    n_samples=1000,
-    n_threads=8  # Uses OpenMP in C
-)
-```
+- [ ] Batch simulation API for dataset generation with OpenMP parallelization
 
 ### 8.5 Memory-Mapped Data Sharing (P3)
 
-Zero-copy data sharing between C and Python/NumPy:
-
-```c
-// Create memory-mapped buffer accessible from Python
-cfd_buffer_t* cfd_create_shared_buffer(size_t size);
-double* cfd_buffer_data(cfd_buffer_t* buf);
-void cfd_buffer_free(cfd_buffer_t* buf);
-```
-
-```python
-from cfd_memory import SharedBuffer
-
-# Zero-copy data sharing
-buf = SharedBuffer(shape=(128, 128))
-arr = buf.as_numpy()  # No copy, direct memory access
-
-# Pass to C functions without copying
-cfd.run_simulation(output_buffer=buf)
-```
+- [ ] Zero-copy shared buffers between C and Python/NumPy
 
 ### Comparison: Phase 7 vs Phase 8
 
@@ -1747,68 +672,17 @@ cfd.run_simulation(output_buffer=buf)
 | Performance | Highest | High (kernel-level) |
 | Flexibility | Fixed architectures | Any PyTorch model |
 
-### Phase 8 Success Criteria
-
-- Kernels provide >2x speedup over NumPy equivalents
-- Zero-copy memory sharing works with NumPy arrays
-- Physics residual kernel matches Python implementation to 1e-10
-- Dataset generation scales linearly with thread count
-
 ---
 
 ## Version Milestones
 
-### v0.1.0 - Foundation (Released)
-
-- [x] Boundary condition abstraction layer with runtime backend selection
-- [x] Neumann and Periodic boundary conditions (all backends)
-- [x] SIMD Poisson solvers (Jacobi, Red-Black SOR)
-- [x] GPU Jacobi Poisson solver
-- [x] Comprehensive test suite
-- [x] 11 example programs
-- [x] VTK and CSV output
-
-### v0.1.x - Boundary Condition Improvements
-
-**Completed:**
-
-- [x] Dirichlet (fixed value) boundary conditions
-- [x] No-slip wall boundary conditions
-- [x] Inlet velocity boundary conditions (uniform, parabolic, custom profiles)
-- [x] Outlet boundary conditions (zero-gradient, convective)
-- [x] Conjugate Gradient (CG) Krylov solver
-- [x] Lid-driven cavity validation (Ghia et al. 1982 benchmark)
-- [x] Cross-architecture solver validation (CPU, AVX2, OpenMP, CUDA)
-- [x] Per-architecture Ghia validation tests
-- [x] Basic documentation (OMP solvers, API constants)
-
-### v0.1.5
-
-- [x] OpenMP solver variants documented (explicit_euler_omp, projection_omp)
-- [x] Updated solver type constants with NS_SOLVER_TYPE_ prefix
-- [x] Doxyfile version synchronized with VERSION file
-- [x] Ghia validation tests use appropriate step counts for Euler solvers
-
-### v0.1.6
-
-- [x] CI: Doxygen API documentation generation in release workflow
-- [x] CI: macOS OpenMP support via Homebrew libomp
-- [x] CI: Improved CUDA toolkit configuration for Windows builds
-- [x] CI: Fixed library path detection for modular builds
-
 ### v0.1.7 - Current Release
 
-- [x] Fix stretched grid formula to correctly span domain [xmin, xmax]
-- [x] Add tanh-based stretching for boundary layer clustering
-- [x] Add 16 unit tests for grid initialization functions
+- [x] Stretched grid formula fix, tanh-based stretching, grid unit tests
 
-### v0.2.0 - 3D Support (In Progress)
+### v0.2.0 - 3D Support
 
-- [x] Phase 0: IDX_2D/IDX_3D indexing macros, replace all inline indexing
-- [x] Phase 1: Extend grid, flow_field, BCs with z-dimension (nz=1 backward compatible)
-- [x] Phase 2: 3D stencils (7-point), scalar CPU linear solvers (Jacobi, SOR, Red-Black, CG, BiCGSTAB)
-- [x] Phase 3: NS solvers with w-momentum
-- [x] Phase 4: 3D boundary conditions (all backends)
+- [x] Phases 0–4: Indexing macros, core data structures, 3D stencils, NS solvers, BCs
 - [ ] Phase 5-7: SIMD, OMP, CUDA backends for 3D
 - [ ] Phase 8: 3D VTK output, examples, validation (Taylor-Green 3D, Poiseuille 3D)
 
@@ -1849,7 +723,7 @@ cfd.run_simulation(output_buffer=buf)
 ## Priority Legend
 
 | Priority | Meaning |
-|----------|---------|
+| -------- | ------- |
 | P0 | Critical - blocks v1.0 release |
 | P1 | Important - required for v1.0 |
 | P2 | Valuable - nice to have for v1.0 |
@@ -1871,7 +745,7 @@ When working on roadmap items:
 
 ---
 
-## References
+## References & Bibliography
 
 ### CFD Validation Benchmarks