diff --git a/ROADMAP.md b/ROADMAP.md index 9395130..c5dcd51 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -33,20 +33,16 @@ Each algorithm should have scalar (CPU) + SIMD + OMP variants. Track gaps here. | **N-S Solvers** | Explicit Euler | done | done | — | done | — | | | Projection | done | done | — | done | done | | | RK2 (Heun) | done | done | — | done | — | -| **Linear Solvers** | Jacobi | done | done | done | — | done²| +| **Linear Solvers** | Jacobi | done | done | done | — | — | | | SOR | done | done | done | — | — | | | Red-Black SOR | done | done | done | done | — | | | CG / PCG | done | done | done | done | — | -| | BiCGSTAB | done | done¹ | done¹ | — | — | +| | BiCGSTAB | done | done | done | — | — | +| **Boundary Conds** | All types | done | done | done | done | done | -¹ AVX2/NEON implementations include OpenMP parallelization internally. No separate OMP backend exists. -² GPU Jacobi is monolithic inside `solver_projection_jacobi_gpu.cu`, not a standalone reusable linear solver. +### What's Missing -### Critical Gaps - -- [x] 3D support complete (all phases), see Phase 3.1 -- [x] Symmetry planes now supported -- [ ] Only structured grids +- [ ] No 3D validation benchmarks yet - [ ] No turbulence models - [ ] Limited linear solvers (no multigrid) - [ ] No restart/checkpoint capability @@ -63,16 +59,8 @@ The OMP Red-Black SOR Poisson solver fails to converge on certain problem config Impact: -- OMP projection solver switched to CG_OMP (dedicated OMP CG solver) as workaround +- OMP projection solver switched to CG_OMP as workaround - Red-Black SOR remains available but unreliable for production use with OMP backend -- CG_OMP provides reliable convergence with fully parallelized primitives - -Root cause unknown — requires investigation of: - -- Omega parameter tuning for Neumann BCs (currently uses default 1.5) -- Parallel race conditions in red/black sweeps -- Boundary condition application in OMP implementation -- Comparison with working AVX2 Red-Black SOR implementation Action items: @@ -82,30 +70,17 @@ Action items: - [ ] Add convergence diagnostics (residual history logging) - [ ] Consider switch to Chebyshev acceleration or SSOR -Workaround: Use CG or switch to AVX2/CPU backends for production. - -**Grid Convergence Non-Monotonic Behavior (P1)** +**Grid Convergence Non-Monotonic Behavior (P1)** ✅ RESOLVED -Current grid convergence tests use relaxed tolerance (`prev_error + 0.08`) because RMS error does not strictly decrease with grid refinement when using the scalar Red-Black SOR Poisson solver. - -Observed behavior: - -- 17×17: RMS ~0.056 -- 25×25: RMS ~0.037 -- 33×33: RMS ~0.094 (worse than 25×25!) - -Root cause: The scalar Poisson solver has accuracy limitations at larger grid sizes that prevent proper convergence. +Root cause: Red-Black SOR Poisson solver had insufficient convergence on larger grids. Switching all projection backends to CG (Conjugate Gradient) resolved the issue. Grid convergence is now strictly monotonic (17×17: 0.046, 25×25: 0.037, 33×33: 0.032). Action items: -- [ ] Investigate why 33×33 produces worse results than 25×25 -- [ ] May need more Poisson iterations for larger grids -- [ ] Consider using SIMD Red-Black SOR or multigrid for better accuracy -- [ ] Remove relaxed tolerance (`+ 0.08`) from grid convergence tests -- [ ] Ensure RMS(33×33) < RMS(25×25) < RMS(17×17) -- [ ] Add strict grid convergence test that FAILs if RMS increases with refinement - -Acceptance criteria: RMS error strictly decreases with each grid refinement level; convergence order approaches O(h²) asymptotically. +- [x] Investigate why 33×33 produces worse results than 25×25 — Red-Black SOR Poisson solver +- [x] May need more Poisson iterations for larger grids — solved by switch to CG +- [x] Consider using SIMD Red-Black SOR or multigrid for better accuracy — CG sufficient +- [x] Remove relaxed tolerance from grid convergence tests +- [x] Add strict grid convergence test that FAILs if RMS increases with refinement #### Limitations @@ -123,27 +98,17 @@ For the uniform-grid Laplacian with constant coefficients, the Jacobi preconditi **Modular Library Circular Dependencies (Section 4.2)** -The modular libraries have circular dependencies: `cfd_scalar`/`cfd_simd` call `poisson_solve()` (defined in `cfd_api`), while `cfd_api` links against `cfd_scalar`/`cfd_simd`. - -Current solution: - -- On Linux: GNU linker groups (`-Wl,--start-group` ... `-Wl,--end-group`) resolve circular references -- On Windows/macOS: Linker automatically handles multiple passes -- Shared builds: Recompile sources into unified shared library - -Future: Consider weak symbols, conditional registration at runtime, or plugin architecture with dynamic loading. +The modular libraries have circular dependencies: `cfd_scalar`/`cfd_simd` call `poisson_solve()` (defined in `cfd_api`), while `cfd_api` links against `cfd_scalar`/`cfd_simd`. Resolved on Linux with linker groups; Windows/macOS handle automatically. Future: consider weak symbols or plugin architecture. **OMP Loop Variable int Overflow for Large Grids (P3)** Status: Deferred — low risk, no practical impact yet -OMP backends cast `size_t` loop variables to `int` for MSVC OpenMP 2.0 compatibility. For grids where `nx * ny > INT_MAX` (~2.1 billion), this overflows. Requires grids larger than ~46K × 46K (~16 GB per field), beyond current practical scope. - -Action items: +OMP backends cast `size_t` loop variables to `int` for MSVC OpenMP 2.0 compatibility. Overflows for grids where `nx * ny > INT_MAX` (~46K × 46K). - [ ] Audit all OMP backends for `size_t` → `int` casts - [ ] Add `CFD_ASSERT(nx * ny <= INT_MAX)` guards if targeting large grids -- [ ] Consider requiring OpenMP 3.0+ (unsigned loop vars) when dropping MSVC OMP 2.0 support +- [ ] Consider requiring OpenMP 3.0+ when dropping MSVC OMP 2.0 support **Debug Mode SIMD Benchmarking (Section 1.10)** @@ -153,179 +118,26 @@ Current tests run in Debug mode where SIMD may be slower than scalar due to lack **Spectral Radius Test BC Assumption (Section 1.2.3)** -The Jacobi spectral radius test uses Dirichlet BCs (p=0 on boundary) because the ρ = cos(πh) formula applies only to the Dirichlet problem. With Neumann BCs, the discrete Laplacian has a constant null space giving eigenvalue 1. The SOR optimal ω = 2/(1 + sin(πh)) also applies to Dirichlet BCs; with Neumann BCs optimal ω is typically lower (1.5-1.7). - -#### Resolved - -**Stretched Grid Formula Bug (FIXED in v0.1.7)** - -File: `lib/src/core/grid.c` (`grid_initialize_stretched`) - -Fixed the hyperbolic stretching formula using tanh-based stretching: - -```c -x[i] = xmin + (xmax - xmin) * (1.0 + tanh(beta * (2.0 * xi - 1.0)) / tanh(beta)) / 2.0 -``` - -Fixed behavior: Grid spans full domain, points cluster near boundaries, higher beta = more clustering, beta=0 falls back to uniform grid. - -Tests added: `tests/core/test_grid.c` with 16 unit tests. +The Jacobi spectral radius test uses Dirichlet BCs (p=0 on boundary) because the ρ = cos(πh) formula applies only to the Dirichlet problem. The SOR optimal ω = 2/(1 + sin(πh)) also applies to Dirichlet BCs; with Neumann BCs optimal ω is typically lower (1.5-1.7). --- -## Phase 0: Architecture & Robustness (P0 - Critical) - -**Goal:** Transform the codebase from a research prototype to a production-safe library. - -### 0.1 Safe Error Handling - -- [x] Replace all `exit()` calls with error code propagation -- [x] Implement `cfd_status_t` and `cfd_get_error_string()` -- [x] Thread-local error context for rich error reporting -- [x] Ensure resource cleanup on error paths (no leaks on failure) +## Phase 0: Architecture & Robustness (P0 - Critical) ✅ -### 0.2 Thread Safety & Global State +**Status:** COMPLETE -- [x] Remove static buffers in `utils.c` (path management) -- [x] Make `SolverRegistry` thread-safe or context-bound -- [x] Ensure `SimulationData` and solvers are re-entrant -- [x] Validate thread-safety with concurrent simulation tests +- [x] 0.1 Safe Error Handling — `cfd_status_t`, error propagation, resource cleanup +- [x] 0.2 Thread Safety — removed static buffers, thread-safe registry, re-entrant solvers +- [x] 0.3 API Robustness — input validation, logging callback, version header, symbol visibility +- [x] 0.4 API & Robustness Testing — comprehensive test suites for all subsystems +- [x] 0.5 Error Handling & Robustness — removed static `warned` flag, `run_simulation_step()` returns `cfd_status_t` +- [x] 0.6 Structured Logging & Diagnostics — `cfd_log()` API with levels, component tags, thread-safe callbacks -### 0.3 API Robustness +**0.6 Deferred items:** -- [x] Comprehensive input validation (check for NULL, NaN, invalid ranges) -- [x] Configurable logging callback (`cfd_set_log_callback`) -- [x] Version header (`cfd_version.h`) with version macros -- [x] Symbol visibility control (hide private symbols) & export headers -- [x] Library initialization/shutdown functions - -### 0.4 API & Robustness Testing - -**Implemented:** - -- [x] Core functionality tests (`tests/core/`) -- [x] Input validation tests (`test_input_validation.c`) -- [x] Error handling tests (`test_error_handling.c`) -- [x] Re-entrancy/thread-safety tests (`test_reentrancy.c`) -- [x] Solver tests organized by architecture (CPU, SIMD, OMP, GPU) -- [x] Simulation API tests (`tests/simulation/`) -- [x] I/O tests (VTK, CSV, output paths) -- [x] Physics validation tests (`test_physics_validation.c`) - -**Still needed:** - -- [x] Negative testing suite (more edge cases) -- [x] Memory leak checks (ASan in CI; Valgrind via local `ctest -T MemCheck`) -- [x] Add `TEST_FAIL_PRINTF` macro to eliminate repeated `snprintf` + `TEST_FAIL_MESSAGE` pattern (5+ occurrences, causes `-Wformat-truncation` warnings). Enable `UNITY_INCLUDE_PRINT_FORMATTED` or define a local helper macro. - -### 0.5 Error Handling & Robustness (P0) ✅ - -**Status:** COMPLETE (PR #139, Feb 2026) - -**Problem:** Projection solvers used thread-unsafe static `warned` flag for Poisson convergence failures, continuing execution with inaccurate pressure fields instead of returning errors. - -**Solution implemented:** - -- [x] Removed static `warned` flag from all projection solvers (CPU, OMP, AVX2) ✅ -- [x] Return `CFD_ERROR_MAX_ITER` immediately on Poisson convergence failure ✅ -- [x] **Breaking API change:** `run_simulation_step()` and `run_simulation_solve()` now return `cfd_status_t` instead of `void` ✅ -- [x] Updated solver registry wrappers to propagate errors properly ✅ -- [x] Updated all ~15 test files to check return values ✅ -- [x] Updated all examples to handle errors in loops ✅ -- [x] Updated README with new function signatures ✅ -- [x] Proper distinction between `CFD_ERROR_UNSUPPORTED` (backend unavailable) and other errors ✅ - -**Impact:** - -- Thread-safe error handling (no data races in OpenMP builds) -- Users can now detect and respond to convergence failures -- Better diagnostics for validation debugging -- Compile-time safety (void → cfd_status_t breaks code that ignores errors) - -**Files modified:** - -- `lib/include/cfd/api/simulation_api.h` - Function signature changes -- `lib/src/api/simulation_api.c` - Error propagation implementation -- `lib/src/api/solver_registry.c` - Wrapper error handling -- `lib/src/solvers/navier_stokes/{cpu,omp,avx2}/solver_projection*.c` - Return errors -- `tests/**/*.c` - Test updates for error checking -- `examples/*.c` - Example error handling -- `README.md` - API documentation updates - -### 0.6 Structured Logging & Diagnostics (P2) ✅ - -**Status:** Complete - -**Implementation Tasks:** - -- [x] Define `cfd_log()` API with log levels and component tags -- [x] Implement default console handler (stderr for WARNING/ERROR, stdout for DEBUG/INFO) -- [x] Add thread-safe logging (atomic global log level, per-thread callbacks) -- [x] Replace all `fprintf(stderr, ...)` calls with `cfd_log()` -- [x] Replace diagnostic `printf()` calls with `cfd_log()` -- [x] Add log filtering by level (suppress DEBUG in production) -- [x] Support custom log handlers via `cfd_set_log_callback_ex()` (with component) -- [ ] Add log filtering by component (e.g., only show "boundary" logs) — deferred -- [ ] Add timestamps and colored output — deferred -- [ ] Add structured data API for metrics (convergence stats, timings) — deferred - -**API added:** - -```c -// Log levels (renumbered to include DEBUG) -typedef enum { - CFD_LOG_LEVEL_DEBUG = 0, - CFD_LOG_LEVEL_INFO = 1, - CFD_LOG_LEVEL_WARNING = 2, - CFD_LOG_LEVEL_ERROR = 3 -} cfd_log_level_t; - -// Core logging function with printf-style formatting -void cfd_log(cfd_log_level_t level, const char* component, - const char* fmt, ...); - -// Global log level control (atomic, thread-safe) -void cfd_set_log_level(cfd_log_level_t level); -cfd_log_level_t cfd_get_log_level(void); - -// Extended callback with component tag -typedef void (*cfd_log_callback_ex_t)(cfd_log_level_t level, - const char* component, const char* message); -void cfd_set_log_callback_ex(cfd_log_callback_ex_t callback); - -// Convenience macros -CFD_LOG_DEBUG(component, ...) -CFD_LOG_INFO(component, ...) -CFD_LOG_WARNING(component, ...) -CFD_LOG_ERROR(component, ...) -``` - -**Design decisions:** - -- Default log level is INFO (DEBUG suppressed unless explicitly enabled) -- Per-thread legacy callback takes priority over global extended callback -- `cfd_error()` always sets thread-local error state regardless of log level filter -- Convergence verbose output uses `CFD_LOG_DEBUG` with `if (params->verbose)` guard -- Component tags: "gpu", "boundary", "poisson", "projection", "solver", "simd" -- Default output format: `LEVEL [component]: message` - -**Files modified:** - -- `lib/include/cfd/core/logging.h` — Extended API with DEBUG level, `cfd_log()`, macros -- `lib/src/core/logging.c` — `cfd_log()` implementation, atomic log level, extended callback -- `lib/src/api/solver_registry.c` — 5 fprintf → CFD_LOG_ERROR/WARNING -- `lib/src/boundary/boundary_conditions.c` — 1 fprintf → CFD_LOG_ERROR -- `lib/src/solvers/linear/linear_solver.c` — 2 fprintf/printf → CFD_LOG_ERROR/DEBUG -- `lib/src/solvers/linear/simd/linear_solver_simd_dispatch.c` — 1 fprintf → CFD_LOG_DEBUG -- `lib/src/solvers/linear/cpu/linear_solver_cg.c` — 1 printf → CFD_LOG_DEBUG -- `lib/src/solvers/linear/cpu/linear_solver_bicgstab.c` — 1 printf → CFD_LOG_DEBUG -- `lib/src/solvers/linear/avx2/linear_solver_cg_avx2.c` — 1 printf → CFD_LOG_DEBUG -- `lib/src/solvers/linear/neon/linear_solver_cg_neon.c` — 1 printf → CFD_LOG_DEBUG -- `lib/src/solvers/linear/omp/linear_solver_cg_omp.c` — 1 printf → CFD_LOG_DEBUG -- `lib/src/solvers/navier_stokes/avx2/solver_explicit_euler_avx2.c` — 4 printf → CFD_LOG_INFO -- `lib/src/solvers/navier_stokes/avx2/solver_projection_avx2.c` — 1 fprintf → CFD_LOG_WARNING -- `lib/src/solvers/navier_stokes/avx2/solver_rk2_avx2.c` — 2 printf → CFD_LOG_INFO -- `tests/core/test_logging.c` — 11 new tests (16 total) +- [ ] Log filtering by component (e.g., only show "boundary" logs) +- [ ] Timestamps and colored output +- [ ] Structured data API for metrics (convergence stats, timings) --- @@ -333,129 +145,23 @@ CFD_LOG_ERROR(component, ...) **Goal:** Make the solver practically usable for real problems. -### 1.1 Boundary Conditions (P0 - Critical) - -- [x] Boundary condition abstraction layer (`boundary_conditions.h/.c`) -- [x] Runtime backend selection (Scalar, SIMD, OpenMP, GPU) -- [x] Neumann (zero-gradient) boundary conditions -- [x] Periodic boundary conditions -- [x] SIMD-optimized BC application (AVX2) -- [x] OpenMP-parallelized BC application -- [x] CUDA GPU BC kernels -- [x] Dirichlet (fixed value) boundary conditions -- [x] No-slip wall conditions -- [x] Inlet velocity specification (uniform, parabolic, custom profiles) -- [x] Outlet (zero-gradient/convective) -- [x] Symmetry planes -- [x] Moving wall boundaries (via Dirichlet BCs, see lid-driven cavity example) -- [x] Time-varying boundary conditions - -**Implemented files:** - -- `lib/include/cfd/boundary/boundary_conditions.h` - Public API with backend selection -- `lib/include/cfd/boundary/boundary_conditions_gpu.cuh` - GPU API declarations -- `lib/src/boundary/boundary_conditions.c` - Public API dispatcher -- `lib/src/boundary/cpu/boundary_conditions_scalar.c` - Scalar implementation -- `lib/src/boundary/cpu/boundary_conditions_outlet_scalar.c` - Scalar outlet BC -- `lib/src/boundary/omp/boundary_conditions_omp.c` - OpenMP parallelization -- `lib/src/boundary/omp/boundary_conditions_outlet_omp.c` - OpenMP outlet BC -- `lib/src/boundary/avx2/boundary_conditions_avx2.c` - AVX2 SIMD optimizations -- `lib/src/boundary/avx2/boundary_conditions_inlet_avx2.c` - AVX2 inlet BC -- `lib/src/boundary/avx2/boundary_conditions_outlet_avx2.c` - AVX2 outlet BC -- `lib/src/boundary/neon/boundary_conditions_neon.c` - NEON SIMD optimizations -- `lib/src/boundary/neon/boundary_conditions_inlet_neon.c` - NEON inlet BC -- `lib/src/boundary/neon/boundary_conditions_outlet_neon.c` - NEON outlet BC -- `lib/src/boundary/simd/boundary_conditions_simd_dispatch.c` - SIMD runtime dispatch -- `lib/src/boundary/gpu/boundary_conditions_gpu.cu` - CUDA kernels -- `lib/src/boundary/boundary_conditions_internal.h` - Internal declarations -- `lib/src/boundary/boundary_conditions_inlet_common.h` - Shared inlet BC helpers -- `lib/src/boundary/boundary_conditions_outlet_common.h` - Shared outlet BC helpers -- `lib/src/boundary/boundary_conditions_time.h` - Time modulation helpers -- `lib/src/boundary/cpu/boundary_conditions_inlet_time_scalar.c` - Time-varying inlet scalar implementation -- `tests/core/test_boundary_conditions_symmetry.c` - Symmetry BC unit tests - -#### 1.1.1 Boundary Conditions Code Refactoring (P2) ✅ - -**Refactoring Complete.** Reduced ~500 lines of duplicated code across BC backends using parameterized header templates. - -1. **Consolidate Inlet BC Implementations (Priority 1)** - - [x] Deleted 3 redundant inlet files (OMP/AVX2/NEON) — all delegate to scalar - - **Savings:** 196 lines removed - -2. **Extract Common Outlet SIMD Template (Priority 2)** - - [x] Created `boundary_conditions_outlet_simd.h` parameterized by SIMD intrinsics - - [x] AVX2/NEON outlet files reduced from ~135 lines each to ~33 lines - - **Savings:** 99 lines removed - -3. **Templatize OMP vs Scalar (Priority 3)** - - [x] Created `boundary_conditions_core_impl.h` with token-pasting macros - - [x] Scalar and OMP files include shared template with `BC_CORE_USE_OMP` flag - - **Savings:** 29 lines removed - -4. **Unify AVX2 and NEON (Priority 4)** - - [x] Created `boundary_conditions_simd_impl.h` parameterized by STORE/LOAD/BROADCAST/VEC_TYPE/WIDTH/MASK - - [x] AVX2/NEON main BC files reduced from ~218 lines each to ~55 lines - - **Savings:** 181 lines removed - -**Files created:** - -- `lib/src/boundary/boundary_conditions_core_impl.h` -- `lib/src/boundary/boundary_conditions_outlet_simd.h` -- `lib/src/boundary/boundary_conditions_simd_impl.h` - -**Files deleted:** - -- `lib/src/boundary/omp/boundary_conditions_inlet_omp.c` -- `lib/src/boundary/avx2/boundary_conditions_inlet_avx2.c` -- `lib/src/boundary/neon/boundary_conditions_inlet_neon.c` - -**Total savings:** ~505 lines removed across 4 priorities +### 1.1 Boundary Conditions (P0 - Critical) ✅ -### 1.2 Linear Solvers (P0 - Critical) +All BC types implemented across all backends (Scalar, AVX2, NEON, OMP, GPU): Dirichlet, Neumann, Periodic, No-slip, Inlet (uniform/parabolic/custom), Outlet, Symmetry, Moving wall, Time-varying. Code refactored with shared templates (~505 lines removed). -**Implemented:** +### 1.2 Linear Solvers (P0 - Critical) -- [x] SOR (Successive Over-Relaxation) - CPU baseline -- [x] Jacobi SIMD (`poisson_jacobi_simd.c`) - AVX2 vectorized, fully parallelizable -- [x] Red-Black SOR SIMD (`poisson_redblack_simd.c`) - AVX2 with SOR convergence rate -- [x] GPU Jacobi Poisson solver (`solver_projection_jacobi_gpu.cu`) -- [x] Integrate SIMD Poisson into projection solver (`solver_projection_simd.c`) +**Implemented:** Jacobi, SOR, Red-Black SOR, CG/PCG, BiCGSTAB (all with SIMD backends). CG is the default Poisson solver for all projection methods. **Still needed:** -- [x] Solver abstraction interface -- [x] Conjugate Gradient (CG) for SPD systems (scalar, AVX2, NEON, OMP backends) - - **Note:** CG is now the default Poisson solver for all projection methods (PR #139) - - CPU uses CG_SCALAR, AVX2 uses CG_SIMD, OMP uses CG_OMP for reliable O(√κ) convergence -- [x] BiCGSTAB for non-symmetric systems ✅ - - [x] BiCGSTAB Scalar (CPU) ✅ - - [x] BiCGSTAB SIMD (AVX2/NEON with integrated OpenMP parallelization) ✅ - - **Note:** AVX2 and NEON implementations include OpenMP parallelization via `#pragma omp parallel for` in inner loops. There is no separate `POISSON_BACKEND_OMP` backend for BiCGSTAB. - - **Files created (BiCGSTAB backends):** - - - `lib/src/solvers/linear/avx2/linear_solver_bicgstab_avx2.c` - AVX2 SIMD implementation with OpenMP - - `lib/src/solvers/linear/neon/linear_solver_bicgstab_neon.c` - NEON SIMD implementation with OpenMP - - `tests/math/test_bicgstab_avx2.c` - AVX2 vs scalar consistency test - - `tests/math/test_bicgstab_neon.c` - NEON vs scalar consistency test - - Updated files: - - - `lib/src/solvers/linear/simd/linear_solver_simd_dispatch.c` - Added BiCGSTAB dispatcher - - `lib/src/solvers/linear/linear_solver_internal.h` - Added BiCGSTAB SIMD factory declaration - - [ ] GMRES (Generalized Minimal Residual) for non-symmetric systems - [ ] GMRES scalar - [ ] GMRES AVX2 - [ ] GMRES NEON - [ ] GMRES OMP - [ ] GMRES GPU -- [x] Jacobi (diagonal) preconditioner for CG (scalar, AVX2, NEON backends) - [ ] SSOR (Symmetric SOR) preconditioner -- [x] SOR SIMD variants (Block SOR — see docs/technical-notes/block-sor-simd.md) - - [x] SOR AVX2 - - [x] SOR NEON - [ ] GPU (CUDA) linear solver backends - [ ] Standalone Jacobi GPU (refactor from monolithic `solver_projection_jacobi_gpu.cu`) - [ ] Red-Black SOR GPU @@ -467,86 +173,7 @@ CFD_LOG_ERROR(component, ...) - [ ] AMG preconditioner (for use with CG/GMRES/BiCGSTAB) - [ ] Performance benchmarking in Release mode -**Note:** See [Known Issues](#known-issues) — SIMD Poisson strict tolerance limitation. - -#### 1.2.1 Poisson Solver Accuracy Tests ✅ - -- [x] Zero RHS test (solution should remain constant) -- [x] Uniform RHS test (quadratic solution) -- [x] Sinusoidal RHS test with analytical solution comparison -- [x] Convergence rate verification for all Poisson variants (SOR, Jacobi, Red-Black) -- [x] Residual convergence tracking - -**Files created:** - -- `tests/math/test_poisson_accuracy.c` - 15 tests covering all accuracy requirements - -#### 1.2.2 Laplacian Operator Validation ✅ - -**Test the discrete Laplacian against manufactured solutions:** - -- [x] Manufactured solution: p = sin(πx)sin(πy) → ∇²p = -2π²p -- [x] Verify 2nd-order accuracy O(dx²) with grid refinement -- [x] Compare CPU and CG implementations (SIMD compared when available) -- [x] Verify stencil symmetry (self-adjoint property) - -**Files created:** - -- `tests/math/test_laplacian_accuracy.c` - 4 tests covering all accuracy requirements - -**Files tested:** `lib/src/solvers/linear/cpu/linear_solver_cg.c` - `apply_laplacian()` - -#### 1.2.3 Linear Solver Convergence Validation ✅ - -**Verify convergence rates match theory:** - -- [x] Jacobi: spectral radius ρ = cos(πh) verified to <1% accuracy (Dirichlet BCs) -- [x] SOR: over-relaxation (ω > 1) converges faster than Gauss-Seidel -- [x] Red-Black SOR: comparable convergence to standard SOR (ratio 0.5-2.0) -- [x] CG: convergence in O(√κ) iterations verified - -**Files created:** - -- `tests/math/test_linear_solver_convergence.c` - 6 tests covering convergence properties - -**Note:** See [Known Issues](#known-issues) — spectral radius test BC assumption. - -**Files still to create (future work):** - -- `src/solvers/linear/multigrid.c` - -#### 1.2.4 Preconditioned Conjugate Gradient (PCG) ✅ - -**Jacobi (diagonal) preconditioner for CG solver:** - -- [x] Added `poisson_precond_type_t` enum (NONE, JACOBI) to `poisson_solver.h` -- [x] Added `preconditioner` field to `poisson_solver_params_t` -- [x] Implemented PCG algorithm in scalar CG solver -- [x] Implemented PCG in AVX2 backend with SIMD-vectorized preconditioner application -- [x] Implemented PCG in NEON backend with SIMD-vectorized preconditioner application -- [x] Backward compatible: POISSON_PRECOND_NONE (default) gives standard CG behavior - -**PCG Algorithm:** - -``` -z = M⁻¹r (apply preconditioner) -p = z (search direction from preconditioned residual) -ρ = (r, z) (instead of (r, r)) -``` - -**Note:** See [Known Issues](#known-issues) — Jacobi preconditioner on uniform grids. - -**Files modified:** - -- `lib/include/cfd/solvers/poisson_solver.h` - Added preconditioner enum and params field -- `lib/src/solvers/linear/linear_solver.c` - Default params initialization -- `lib/src/solvers/linear/cpu/linear_solver_cg.c` - Scalar PCG implementation -- `lib/src/solvers/linear/avx2/linear_solver_cg_avx2.c` - AVX2 PCG implementation -- `lib/src/solvers/linear/neon/linear_solver_cg_neon.c` - NEON PCG implementation - -**Files created:** - -- `tests/math/test_pcg_convergence.c` - 4 tests verifying PCG correctness and consistency +**Completed sub-sections:** Poisson Accuracy Tests (1.2.1), Laplacian Validation (1.2.2), Convergence Validation (1.2.3), PCG (1.2.4). ### 1.3 Numerical Schemes (P1) @@ -554,88 +181,9 @@ p = z (search direction from preconditioned residual) - [ ] Central differencing with delayed correction - [ ] High-resolution TVD schemes (Van Leer, Superbee) - [ ] Gradient limiters (Barth-Jespersen, Venkatakrishnan) - -#### 1.3.1 Finite Difference Stencil Tests ✅ - -Unit tests for individual stencil operations: - -- [x] First derivative (central difference) - verify O(h²) accuracy -- [x] Second derivative - verify O(h²) accuracy -- [x] 2D Laplacian (5-point stencil) - verify O(h²) accuracy -- [x] Divergence operator - verify O(h²) accuracy -- [x] Gradient operator - verify O(h²) accuracy - -**Test approach:** Use smooth analytical functions (e.g., `sin(kx)*sin(ky)`), compute numerical derivatives using shared stencil implementations, compare to analytical derivatives, verify error scaling. - -**Files created:** - -- `lib/include/cfd/math/stencils.h` - Shared stencil implementations (header-only) -- `tests/math/test_finite_differences.c` - 9 tests verifying O(h²) convergence - -**Future work:** - - [ ] Migrate solver code to use `cfd/math/stencils.h` (currently inline implementations) - - `solver_explicit_euler.c`, `solver_projection.c`, `linear_solver_jacobi.c`, etc. - - This ensures tests exercise the exact production code paths - -#### 1.3.2 Convergence Order Verification ✅ - -- [x] Spatial convergence tests (h-refinement: 16→32→64→128) -- [x] Temporal convergence tests (dt-refinement) -- [x] Automated order-of-accuracy computation -- [x] Verify super-linear spatial convergence (~1.5 order, BC-limited) - -**Results:** See [Known Issues](#known-issues) — convergence order BC-limited. Spatial rate > 1.4 achieved, error decreases monotonically with refinement. - -**Files created:** - -- `tests/math/test_convergence_order.c` - -#### 1.3.3 Method of Manufactured Solutions (MMS) ✅ - -- [x] Define manufactured velocity/pressure fields with known derivatives -- [x] Compute analytical source terms from Navier-Stokes substitution -- [x] Run solver with manufactured source terms -- [x] Compare numerical to manufactured solution -- [x] Verify convergence order - -**Manufactured solution (Modified Taylor-Green):** - -```c -u(x,y,t) = cos(x) * sin(y) * exp(-αt) -v(x,y,t) = -sin(x) * cos(y) * exp(-αt) -Source: f = (2ν - α) · u_exact // With α = ν -``` - -**Implementation:** - -- Added `ns_source_func_t` callback to `ns_solver_params_t` for custom source terms -- Propagated callback to all solver variants: CPU explicit Euler (via `compute_source_terms()`), CPU projection, AVX2 explicit Euler (scalar path), AVX2 projection, OMP explicit Euler, OMP projection, OMP RK2 (already via helper) -- Created `tests/math/test_mms.c` with 5 tests: - 1. Source callback mechanism verification (constant source α=0 for robust direction check) - 2. Spatial convergence (Euler) — verified O(h^1.5+), monotonic with 1% tolerance - 3. Spatial convergence (RK2) — verified O(h^1.5+), monotonic with 1% tolerance - 4. Temporal convergence (Euler) — verified error decreases - 5. Temporal convergence (RK2) — verified better rate than Euler - -#### 1.3.4 Divergence-Free Constraint Validation ✅ - -**Verify projection method enforces incompressibility:** - -- [x] Measure max|∇·u| after projection step (should be bounded) -- [x] Test with various initial velocity fields (sinusoidal, Taylor-Green, vortex pair) -- [x] Verify all projection backends (CPU, AVX2, OMP, GPU) - -**Results:** - -- Divergence computation verified against analytically div-free fields (<1e-10) -- All backends produce consistent results (within 10%) -- Divergence stays bounded (<10.0) - matches existing solver behavior -- GPU test skipped when CUDA not available -**Files created:** - -- `tests/math/test_divergence_free.c` +**Completed sub-sections:** Stencil Tests (1.3.1), Convergence Order (1.3.2), MMS (1.3.3), Divergence-Free (1.3.4). ### 1.4 Steady-State Solver (P1) @@ -644,24 +192,15 @@ Source: f = (2ν - α) · u_exact // With α = ν - [ ] Pseudo-transient continuation - [ ] Convergence acceleration (relaxation) -**Files to create:** - -- `include/cfd/solvers/steady_state.h` -- `src/solvers/steady/simple.c` -- `src/solvers/steady/simplec.c` -- `src/solvers/steady/piso.c` - ### 1.5 Time Integration (P1) -Each time integrator requires a scalar (CPU) reference implementation first, then backend variants. -See `/add-ns-time-integrator` command for the cross-backend workflow. +Each time integrator requires a scalar (CPU) reference implementation first, then backend variants. See `/add-ns-time-integrator` for the cross-backend workflow. + +**Implemented:** RK2 (Heun's method) — all CPU/AVX2/OMP backends, O(dt²) verified. -**Algorithms:** +**Still needed:** -- [x] RK2 (Heun's method) — scalar - - [x] RK2 AVX2/SIMD (`rk2_avx2`, `rk2_optimized`) - - [x] RK2 OpenMP (`rk2_omp`) - - [ ] RK2 CUDA (`rk2_gpu`) +- [ ] RK2 CUDA (`rk2_gpu`) - [ ] RK4 (classical Runge-Kutta) — scalar - [ ] RK4 AVX2+OMP (`rk4_optimized`) - [ ] RK4 OpenMP (`rk4_omp`) @@ -671,24 +210,6 @@ See `/add-ns-time-integrator` command for the cross-backend workflow. - [ ] BDF2 (backward differentiation) - [ ] Adaptive time stepping with error control -**RK2 Implementation:** - -- O(dt²) temporal accuracy verified via self-convergence test (ratio ≈ 4.0) -- Uses periodic stencil indexing in RHS evaluation to avoid ghost-cell order reduction -- BCs applied only after full RK2 step (not between stages) - -**Files created:** - -- `lib/src/solvers/navier_stokes/cpu/solver_rk2.c` -- `tests/solvers/navier_stokes/cpu/test_solver_rk2.c` - -**Files modified:** - -- `lib/include/cfd/solvers/navier_stokes_solver.h` — added `NS_SOLVER_TYPE_RK2` -- `lib/src/api/solver_registry.c` — registered RK2 factory -- `lib/CMakeLists.txt` — added source file -- `CMakeLists.txt` — added test entries - ### 1.6 Restart/Checkpoint (P1) - [ ] Binary checkpoint format @@ -706,12 +227,6 @@ Solve F(x)=0 where F is nonlinear. Required for steady-state Navier-Stokes. - [ ] Line search and globalization - [ ] Nonlinear solver abstraction interface -**Files to create:** - -- `include/cfd/solvers/nonlinear_solver.h` -- `src/solvers/nonlinear/newton.c` -- `src/solvers/nonlinear/picard.c` - ### 1.8 Eigenvalue Solvers (P3) Find eigenvalues/eigenvectors for stability analysis. @@ -721,49 +236,16 @@ Find eigenvalues/eigenvectors for stability analysis. - [ ] Arnoldi iteration - [ ] Stability analysis framework -**Files to create:** - -- `include/cfd/solvers/eigen_solver.h` -- `src/solvers/eigen/power_iteration.c` -- `src/solvers/eigen/arnoldi.c` - ### 1.9 Derived Fields (P2) **Status:** OpenMP implemented, SIMD and CUDA pending -**Current Implementation:** - -Located in `lib/src/core/derived_fields.c`: - -- [x] OpenMP parallelization for velocity magnitude computation -- [x] OpenMP parallel reduction for field statistics (min/max/avg) -- [x] Threshold-based parallelization (OMP_THRESHOLD = 1000 cells) -- [x] Fallback to sequential for small grids +**Current:** OpenMP parallelization for velocity magnitude and field statistics with threshold-based parallelization (OMP_THRESHOLD = 1000 cells). **Still needed:** #### SIMD Optimization (AVX2/NEON) -Best for: All grid sizes, especially when combined with OpenMP - -**Implementation approach:** - -```c -// Velocity magnitude with AVX2 (4 doubles per iteration) -for (size_t i = 0; i < n; i += 4) { - __m256d u = _mm256_loadu_pd(&field->u[i]); - __m256d v = _mm256_loadu_pd(&field->v[i]); - __m256d u2 = _mm256_mul_pd(u, u); - __m256d v2 = _mm256_mul_pd(v, v); - __m256d sum = _mm256_add_pd(u2, v2); - __m256d mag = _mm256_sqrt_pd(sum); - _mm256_storeu_pd(&vel_mag[i], mag); -} -// Handle remainder elements -``` - -**Tasks:** - - [ ] Add AVX2 implementation for velocity magnitude - [ ] Add NEON implementation for velocity magnitude (ARM64) - [ ] Combine with OpenMP (`#pragma omp simd` or manual intrinsics) @@ -771,193 +253,38 @@ for (size_t i = 0; i < n; i += 4) { - [ ] Add runtime CPU feature detection - [ ] Benchmark against scalar+OpenMP version -**Benefits:** - -- No thread overhead (works on small grids) -- Combines well with OpenMP for multi-core -- Reference: existing SIMD solvers in `lib/src/solvers/linear/avx2/` - -**Challenges:** - -- Architecture-specific code paths needed -- Remainder handling for non-aligned sizes -- Statistics reductions more complex with SIMD - #### CUDA GPU Acceleration -Best for: Very large grids (100k+ cells), GPU available - -**Implementation approach:** - -```cuda -__global__ void compute_velocity_magnitude_kernel( - double* __restrict__ vel_mag, - const double* __restrict__ u, - const double* __restrict__ v, - size_t n) -{ - size_t i = blockIdx.x * blockDim.x + threadIdx.x; - if (i < n) { - vel_mag[i] = sqrt(u[i] * u[i] + v[i] * v[i]); - } -} -``` - -**Tasks:** - - [ ] Add CUDA kernel for velocity magnitude - [ ] Add parallel reduction for statistics (use CUB library or custom) - [ ] Share GPU memory with CUDA solvers (avoid CPU<->GPU transfers) - [ ] Add threshold logic (only use GPU for large grids) - [ ] Benchmark transfer overhead vs compute benefit -**Benefits:** - -- Massive parallelism for large grids -- Can share GPU memory with CUDA solvers -- Reference: existing CUDA infrastructure in `lib/src/solvers/gpu/` - -**Challenges:** - -- GPU memory transfer overhead for small grids -- More complex reduction implementation -- Only beneficial if data already on GPU - -**Performance Thresholds (Approximate):** - -| Grid Size | Recommended Approach | -|------------------|----------------------------------------------| -| < 1,000 cells | Sequential (overhead not worth it) | -| 1,000 - 10,000 | OpenMP (2-4 threads) ✅ Implemented | -| 10,000 - 100,000 | OpenMP + SIMD | -| > 100,000 | CUDA (if GPU available) or OpenMP + SIMD | - -**Note:** These operations are memory-bound, not compute-bound: - -- Velocity magnitude: Read 2 arrays, write 1 array -- Statistics: Read 1 array, compute 3 values - -For memory-bound operations: - -- SIMD helps with cache line utilization -- OpenMP helps with memory bandwidth aggregation across cores -- CUDA helps only if data is already on GPU - -**Files to Modify:** - -- `lib/src/core/derived_fields.c` - Add SIMD/CUDA variants -- `lib/src/core/derived_fields_simd.c` - New file for SIMD implementation -- `lib/src/core/derived_fields_gpu.cu` - New file for CUDA kernels -- `tests/io/test_csv_output.c` - Verify correctness after optimization - ### 1.10 SIMD Projection Solver Optimization (P2) -**Status:** SIMD Poisson integration completed (December 2024), further optimizations pending - -**Current State:** - -- [x] SIMD Poisson solvers implemented (Jacobi and Red-Black SOR with AVX2) -- [x] SIMD Poisson integrated into projection solver (`solver_projection_simd.c`) -- [x] Full projection method with SIMD corrector step -- [x] All tests pass with results matching scalar implementation -- [x] `u_new` buffer reused as temp for Poisson solver (in-place Red-Black SOR) - -**Reference:** See [docs/technical-notes/simd-optimization-analysis.md](docs/technical-notes/simd-optimization-analysis.md) for detailed technical analysis. - -**Performance Analysis:** - -Current SIMD implementation provides ~1.3-1.5x speedup with Poisson solver being 70-80% of total runtime. Speedup limited by Amdahl's law: - -- Parallelizable fraction: ~80% (Poisson + Corrector) -- SIMD speedup: 1.5-2x (Red-Black has gather/scatter overhead) -- Expected overall: 1.3-1.5x (matches observations) +**Status:** SIMD Poisson integration completed, further optimizations pending. Current ~1.3-1.5x speedup limited by Amdahl's law (parallelizable fraction ~80%). #### High Priority -##### 1. Improve Poisson Convergence for Non-Trivial Problems - -SIMD Poisson solvers don't fully converge to 1e-6 tolerance on challenging problems (e.g., sinusoidal RHS) within current iteration limits. - -**Tasks:** - -- [ ] Increase `POISSON_MAX_ITER` from current 1000/2000 to higher values -- [ ] Implement adaptive tolerance based on problem scale +- [ ] Increase `POISSON_MAX_ITER` or implement adaptive tolerance - [ ] Add optional multigrid preconditioner for faster convergence -- [ ] Investigate Red-Black omega parameter tuning for specific problem classes - -##### 2. Performance Benchmarking in Release Mode - -See [Known Issues](#known-issues) — debug mode SIMD benchmarking. - -**Tasks:** - +- [ ] Investigate Red-Black omega parameter tuning - [ ] Create Release mode benchmark suite -- [ ] Measure actual speedup vs scalar implementation - [ ] Profile to identify remaining bottlenecks -- [ ] Compare against theoretical Amdahl's law predictions -- [ ] Document performance characteristics for different grid sizes - -#### Medium Priority - -##### 3. OpenMP + SIMD Hybrid -Combine thread parallelism with SIMD vectorization for maximum CPU utilization. - -**Current State:** Separate OMP and SIMD backends exist, but no hybrid implementation. - -**Tasks:** +#### Medium Priority — OpenMP + SIMD Hybrid - [ ] Implement hybrid projection solver using OpenMP across rows + SIMD within rows -- [ ] Use `#pragma omp parallel for` on outer loop with SIMD intrinsics in inner loop -- [ ] Benchmark scaling efficiency (measure speedup vs threads) +- [ ] Benchmark scaling efficiency - [ ] Compare against pure OMP and pure SIMD implementations -- [ ] Handle load balancing for Red-Black coloring with OpenMP - -**Benefits:** -- Jacobi allows full parallelization across rows -- Red-Black allows parallelization within each color -- Could achieve 4-8x speedup on modern multi-core CPUs - -**Reference Files:** - -- `lib/src/solvers/navier_stokes/omp/solver_projection_omp.c` - OMP implementation -- `lib/src/solvers/navier_stokes/avx2/solver_projection_avx2.c` - SIMD implementation -- `lib/src/solvers/linear/avx2/` - SIMD Poisson solvers - -#### Low Priority - -##### 4. Multigrid SIMD Implementation - -Achieve O(N) complexity vs O(N²) for iterative methods. - -**Benefits:** - -- Multigrid offers O(N) complexity vs O(N²) for Jacobi/SOR -- Natural parallelism at each grid level -- Can use SIMD at each level for additional speedup -- Essential for large-scale 3D simulations - -**Tasks:** +#### Low Priority — Multigrid SIMD - [ ] Implement multigrid V-cycle framework - [ ] Add restriction/prolongation operators - [ ] Use SIMD Jacobi/Red-Black as smoothers at each level - [ ] Benchmark convergence rate vs pure iterative methods -- [ ] Validate against analytical solutions - -**Challenges:** - -- Complex implementation requiring careful design -- Grid hierarchy management -- Operator construction at multiple levels -- Testing and validation complexity - -**Priority Justification:** Low priority because: - -- Current SIMD implementation adequate for 2D problems -- Critical for 3D but 3D support itself is Phase 2 -- Better to implement multigrid after 3D infrastructure exists --- @@ -1011,31 +338,9 @@ Achieve O(N) complexity vs O(N²) for iterative methods. **Goal:** Support complex geometries. -### 3.1 3D Support (P0 - Critical) — **ACTIVE PRIORITY** - -**Status:** Complete (all 8 phases). See [3D Extension Plan](docs/technical-notes/3d-extension-plan.md) for full design. - -**Approach:** "2D as subset of 3D" — `nz=1` produces bit-identical results to current 2D code. No separate 2D/3D codepaths. All code becomes 3D-aware with branch-free solver loops using precomputed constants (`stride_z=0`, `inv_dz2=0.0` when `nz==1`). - -**Phases:** +### 3.1 3D Support (P0 - Critical) ✅ -- [x] Phase 0: Introduce `IDX_2D`/`IDX_3D` indexing macros (`lib/include/cfd/core/indexing.h`), replace all inline `j*nx+i` patterns across ~50 files -- [x] Phase 1: Extend core data structures (grid: `nz, z[], dz[], stride_z, inv_dz2, k_start, k_end`; flow_field: `w, nz`; BCs: `front/back`) -- [x] Phase 2: Add 3D stencils (7-point) and update scalar CPU linear solvers -- [x] Phase 3: Update scalar CPU NS solvers with w-momentum equation -- [x] Phase 4: Extend boundary conditions for z-faces (all backends) -- [x] Phase 5: Update SIMD backends (AVX2/NEON) for 3D -- [x] Phase 6: Update OMP backends for 3D -- [x] Phase 7: Update CUDA backend for 3D -- [x] Phase 8: Update I/O (VTK 3D), examples, validation tests, and docs - -**Key design decisions:** - -- Branch-free inner loops: `stride_z=0` makes z-neighbor reads return same cell, z-terms vanish mathematically -- Cache-optimized: k→j→i loop order, three-plane working set fits L2, OMP `collapse(2) schedule(static)` -- Backward-compatible API: `grid_create(nx, ny, ...)` becomes wrapper for `grid_create_3d(nx, ny, 1, ...)` - -**Estimated effort:** ~90 files, 12-18 weeks across all phases +**Status:** Complete. "2D as subset of 3D" approach — `nz=1` produces bit-identical results to previous 2D code. Branch-free solver loops using precomputed constants (`stride_z=0`, `inv_dz2=0.0`). All 8 phases done: indexing macros, core data structures, 3D stencils, NS solvers, BCs, SIMD, OMP, CUDA, I/O, examples, docs. ### 3.2 Unstructured Meshes (P1) @@ -1080,19 +385,7 @@ Achieve O(N) complexity vs O(N²) for iterative methods. ### 4.2 Modular Backend Libraries (P1) ✅ -**Goal:** Split the monolithic library into separate per-backend libraries for flexible deployment. - -**Status:** Implemented in v0.1.5 - -**Rationale:** - -- Users only link what they need (smaller binaries) -- CUDA library can be loaded dynamically (plugin pattern) -- Clear ABI boundaries between backends -- Allows mixing compiler toolchains (e.g., nvcc for CUDA only) -- Easier to add new backends (OpenCL, SYCL, etc.) - -**Library Structure:** +**Status:** Implemented in v0.1.5. Split monolithic library into per-backend targets. | Library | CMake Target | Contents | Dependencies | | ------- | ------------ | -------- | ------------ | @@ -1103,40 +396,16 @@ Achieve O(N) complexity vs O(N²) for iterative methods. | `cfd_cuda` | `CFD::CUDA` | CUDA GPU solvers | CFD::Core | | `cfd_library` | `CFD::Library` | Unified library (all backends) | All above | -**Implementation Tasks:** +**Still needed:** -- [x] Split `lib/CMakeLists.txt` into multiple library targets -- [x] Create separate source lists for each backend -- [x] Define public/private header boundaries per library -- [x] Add CMake aliases for clean target references -- [x] Add comprehensive test suite (`test_modular_libraries.c`) -- [x] Support shared library builds (both static and shared supported) - [ ] Update examples/tests to link against specific backends (optional) - [ ] Plugin loading system for dynamic backend selection -**Known Limitations:** See [Known Issues](#known-issues) — modular library circular dependencies. - -**Usage Examples:** - -```cmake -# Currently, use the unified library for full functionality -target_link_libraries(my_app PRIVATE CFD::Library) - -# Future: Link only what you need (after refactoring) -# target_link_libraries(my_app PRIVATE CFD::Core CFD::SIMD) -``` +See [Known Issues](#known-issues) — modular library circular dependencies. ### 4.3 GPU Improvements (P2) -**Implemented:** - -- [x] CUDA device detection and selection -- [x] GPU device information queries (compute capability, memory) -- [x] Projection method with Jacobi Poisson solver on GPU -- [x] GPU memory management (persistent memory, basic async transfers) -- [x] GPU boundary condition kernels -- [x] Configurable GPU settings (block size, convergence tolerance) -- [x] GPU solver statistics (kernel time, transfer time, iterations) +**Implemented:** CUDA device detection, GPU projection with Jacobi Poisson, GPU memory management, GPU BC kernels, configurable GPU settings, GPU solver statistics. **Still needed:** @@ -1168,13 +437,7 @@ target_link_libraries(my_app PRIVATE CFD::Library) ### 5.2 Modern VTK (P1) -**Implemented:** - -- [x] VTK legacy ASCII format output -- [x] Scalar field output (`write_vtk_output`) -- [x] Vector field output (`write_vtk_vector_output`) -- [x] Full flow field output (`write_vtk_flow_field`) -- [x] Timestamped run directories (`write_vtk_*_run` functions) +**Implemented:** VTK legacy ASCII format, scalar/vector/flow field output, timestamped run directories. **Still needed:** @@ -1183,13 +446,9 @@ target_link_libraries(my_app PRIVATE CFD::Library) - [ ] Time series support - [ ] Binary encoding -### 5.3 CSV Output (Implemented) +### 5.3 CSV Output ✅ -- [x] Timeseries data (step, time, dt, velocity stats, pressure stats) -- [x] Centerline profiles (horizontal/vertical) -- [x] Global statistics (min/max/avg of all fields) -- [x] Velocity magnitude columns -- [x] Automatic header creation and append mode +Implemented: timeseries data, centerline profiles, global statistics, velocity magnitude, automatic headers. ### 5.4 Restart Files (P1) @@ -1208,146 +467,34 @@ target_link_libraries(my_app PRIVATE CFD::Library) **Goal:** Validate against reference solutions and provide comprehensive documentation. -**Note:** API/robustness testing is in Phase 0.4. Mathematical accuracy validation (stencils, convergence, MMS) is in Phase 1.3. Linear solver validation is in Phase 1.2. - ### 6.1 Benchmark Validation (P0 - Critical) #### 6.1.1 Lid-Driven Cavity Validation ✅ -**Status:** COMPLETE - Solver meets scientific tolerance (RMS ~0.04, target < 0.10) - -**Test files created:** - -- `tests/validation/test_cavity_setup.c` - Basic setup and BC tests (7 tests) -- `tests/validation/test_cavity_flow.c` - Flow development and stability (8 tests) -- `tests/validation/test_cavity_validation.c` - Conservation and Ghia comparison (5 tests) -- `tests/validation/test_cavity_reference.c` - Reference-based regression tests (5 tests) -- `tests/validation/lid_driven_cavity_common.h` - Shared utilities -- `tests/validation/cavity_reference_data.h` - Ghia reference data -- `docs/validation/lid_driven_cavity.md` - Validation methodology documentation - -**What Was Fixed (PR #139, Feb 2026):** - -1. **Switched Projection Solvers to CG:** - - [x] CPU projection: Red-Black SOR → CG_SCALAR ✅ - - [x] OMP projection: Red-Black SOR → CG_OMP ✅ (initially CG_SCALAR in PR #139, upgraded to CG_OMP in PR #145) - - [x] AVX2 projection: Red-Black SOR → CG_SIMD ✅ - - **Rationale:** CG has O(√κ) convergence vs SOR's O(n) - typically 20-64 iterations vs 100s-1000s - -2. **Increased Iteration Limits:** - - [x] max_iterations: 1000 → 5000 (accommodates CG on fine grids) ✅ - -3. **Achieved Scientific Target:** - - [x] RMS_u: 0.0382 (target < 0.10) ✅ - - [x] RMS_v: 0.0440 (target < 0.10) ✅ - - [x] Grid convergence now monotonic: 17×17 (0.046) → 25×25 (0.037) → 33×33 (0.032) ✅ - -**Backend Validation Complete (Feb 2026):** - -1. **Test ALL solver backends (systematic validation):** ✅ - - [x] CPU scalar (explicit Euler, projection) ✅ - - [x] AVX2/SIMD (explicit Euler, projection) ✅ - - [x] OpenMP (explicit Euler, projection) ✅ - - [x] CUDA GPU (projection Jacobi) - test created, runs when GPU available ✅ - - [x] Each backend independently achieves accuracy target ✅ - - **Test file:** `tests/validation/test_cavity_backends.c` - - **Results (33×33 grid, Re=100):** - - Projection (CPU): RMS_u=0.0382, RMS_v=0.0440 < 0.10 ✅ - - Projection (AVX2): RMS_u=0.0382, RMS_v=0.0440 < 0.10 ✅ - - Projection (OMP): RMS_u=0.0382, RMS_v=0.0440 < 0.10 ✅ - - Explicit Euler (CPU): RMS_u=0.0957, RMS_v=0.1284 < 0.15 ✅ - - Explicit Euler (AVX2): RMS_u=0.0954, RMS_v=0.1312 < 0.15 ✅ - - Explicit Euler (OMP): RMS_u=0.0957, RMS_v=0.1284 < 0.15 ✅ - - Backend consistency: All 3 backends within 0.1% ✅ - - **Documentation:** - - `tests/validation/test_cavity_backends.c` - Comprehensive backend validation - - `docs/validation/cavity-backends-validation.md` - Test methodology and results - -2. **Verification that tests are honest:** ✅ - - [x] Tests MUST fail if RMS >= target (no loose tolerances) ✅ - - [x] Tests compare computed values at EXACT Ghia sample points ✅ - - [x] Tests report actual vs expected values transparently ✅ - - [x] No "current baseline" workarounds - fix solver, not tolerance ✅ - - **Target enforcement:** - - Projection method: RMS < 0.10 (production solver, strict) - - Explicit Euler: RMS < 0.15 (simpler method, relaxed) - - Tests FAIL immediately if RMS >= target - -**Remaining Work (for 129×129 full validation):** - -- [x] Parallel CTest infrastructure for full validation (per-backend test entries, `ctest -j 4`) -- [x] OMP thread oversubscription mitigation (`OMP_NUM_THREADS=2`) +**Status:** COMPLETE — All backends validated at 33×33, Re=100. Projection RMS < 0.10, Explicit Euler RMS < 0.15, all backends within 0.1% of each other. Grid convergence now monotonic. + +**Remaining work (129×129 full validation):** + - [ ] Confirm all backends pass at 129×129 on EC2 and record RMS values - [ ] Use CAVITY_FULL_VALIDATION=1 build flag -**Acceptance Criteria (non-negotiable):** +**Acceptance Criteria:** - RMS error vs Ghia < 0.10 for Re=100, 400, 1000 - All 7 solver backends produce identical results (within 0.1%) - Grid convergence: error decreases monotonically with refinement -- Tests run in parallel to complete in < 60 seconds -#### 6.1.1.1 Grid Convergence Validation (P1) +##### 6.1.1.1 Grid Convergence Validation (P1) ✅ -See [Known Issues](#known-issues) — grid convergence non-monotonic behavior. Action items and acceptance criteria tracked there. +Resolved by switching projection backends from Red-Black SOR to CG. Grid convergence tests now enforce strict monotonicity with convergence rate reporting. #### 6.1.2 Taylor-Green Vortex Validation (P0) ✅ -**Analytical solution with known decay rate - ideal for full NS validation:** - -```c -u(x,y,t) = cos(x) * sin(y) * exp(-2νt) -v(x,y,t) = -sin(x) * cos(y) * exp(-2νt) -p(x,y,t) = -0.25 * (cos(2x) + cos(2y)) * exp(-4νt) -``` - -**Tests implemented:** - -- [x] Verify velocity decay rate matches exp(-2νt) -- [x] Verify kinetic energy decay: KE(t) = KE₀ * exp(-4νt) -- [x] Test L2 error remains bounded -- [x] Verify grid convergence (error decreases with refinement) -- [x] Verify divergence-free constraint (incompressibility) -- [x] Compare solver backends (CPU, OpenMP) -- [x] Long-time stability tests -- [x] Low viscosity stability tests - -**Files created:** - -- `tests/validation/test_taylor_green_vortex.c` - 9 validation tests -- `tests/validation/taylor_green_reference.h` - Analytical solutions and test utilities +Velocity/energy decay, L2 error, grid convergence, divergence-free, backend consistency, long-time and low-viscosity stability — all verified. #### 6.1.3 Poiseuille Flow Validation (P1) ✅ -**Analytical parabolic profile for channel flow:** - -```c -u(y) = 4 * U_max * y * (H - y) / H² -``` - -**Tests implemented:** - -- [x] Velocity profile stability (analytical solution preserved by solver) -- [x] Mass conservation verification (flux in = flux mid = flux out) -- [x] Pressure gradient accuracy vs analytical dp/dx = -8μU_max/H² -- [x] Inlet BC accuracy (parabolic profile applied exactly) - -**Results:** - -- Profile RMS error: <1% (tolerance 1%) -- Mass flux conservation: <1% (tolerance 1%) -- Pressure gradient error: <5% (tolerance 5%) -- Inlet BC: machine precision - -**Strategy:** Initialize with analytical solution, run projection steps, verify stability. - -**Files created:** - -- `tests/validation/test_poiseuille_flow.c` +Profile stability (<1% RMS), mass conservation (<1%), pressure gradient (<5% error), inlet BC (machine precision) — all verified. #### 6.1.4 Other Benchmarks (P2) @@ -1358,17 +505,12 @@ u(y) = 4 * U_max * y * (H - y) / H² **Goal:** Full-length validation tests that run during releases (too slow for CI). -**Rationale:** CI tests use reduced grid sizes and iteration counts for fast feedback (~30 seconds). Release validation uses full parameters for scientific accuracy verification. - **CI vs Release Parameters:** | Test | CI Mode | Release Mode | |------|---------|--------------| -| Cavity Flow Development | 17×17, 500 steps | 33×33, 5000 steps | -| Cavity Re=100 Stability | 21×21, 500 steps | 33×33, 10000 steps | +| Cavity Ghia Validation | 33×33, 5000 steps | 129×129, 50000 steps | | Cavity Re=400 Stability | 25×25, 500 steps | 65×65, 20000 steps | -| Reynolds Dependency | 17×17, 400 steps | 33×33, 5000 steps | -| Ghia Validation | 33×33, 5000 steps | 129×129, 50000 steps | | Grid Convergence | 17→25→33 | 33→65→129 | | Taylor-Green Vortex | 32×32, 200 steps | 128×128, 10000 steps | @@ -1381,33 +523,6 @@ u(y) = 4 * U_max * y * (H - y) / H² - [ ] Cross-architecture consistency check (all backends produce identical results) - [ ] Memory usage and performance regression benchmarks -**Implementation:** - -```bash -# CI mode (default) - fast, reduced parameters -cmake -DCAVITY_FULL_VALIDATION=0 .. -ctest -R validation - -# Release mode - full scientific validation -cmake -DCAVITY_FULL_VALIDATION=1 .. -ctest -R validation --timeout 3600 -``` - -**Files to create:** - -- `tests/validation/test_ghia_full.c` - Full 129×129 Ghia validation -- `tests/validation/test_taylor_green_full.c` - Extended Taylor-Green decay -- `tests/validation/test_release_validation.c` - All backends consistency check -- `.github/workflows/release-validation.yml` - GitHub Actions workflow for releases - -**Acceptance Criteria:** - -- All tests pass with full parameters -- RMS error vs Ghia < 0.05 at 129×129 grid -- Taylor-Green decay rate within 1% of analytical -- All solver backends produce identical results (within floating-point tolerance) -- Total runtime < 30 minutes on release CI runner - ### 6.2 Convergence Studies (P1) - [ ] Grid independence studies for benchmark cases @@ -1427,20 +542,7 @@ ctest -R validation --timeout 3600 ### 6.4 Examples (P1) -**Implemented (12 examples):** - -- [x] `minimal_example.c` - Simplest usage, basic setup -- [x] `basic_simulation.c` - Standard incompressible Navier-Stokes -- [x] `animated_flow_simulation.c` - Time-stepping with visualization -- [x] `simple_animated_flow.c` - Simpler animation variant -- [x] `velocity_visualization.c` - Output velocity field visualization -- [x] `performance_comparison.c` - Compare different solvers -- [x] `runtime_comparison.c` - Detailed timing comparisons -- [x] `solver_selection.c` - Demonstrate solver registry and selection -- [x] `custom_boundary_conditions.c` - Example BC usage -- [x] `custom_source_terms.c` - Source term implementation -- [x] `csv_data_export.c` - CSV output examples -- [x] `lid_driven_cavity.c` - Lid-driven cavity with Dirichlet BCs +**Implemented:** 12 examples (minimal, basic simulation, animated flow, velocity visualization, performance/runtime comparison, solver selection, custom BCs, custom source terms, CSV export, lid-driven cavity). **Still needed:** @@ -1468,44 +570,13 @@ Train models in Python (PyTorch/JAX), deploy inference in C for: ### 7.1 Weight Format & Loader (P3) - [ ] Define binary weight format (.cfdnn) - - ```c - typedef struct { - uint32_t magic; // "CFDN" - uint32_t version; - uint32_t num_layers; - uint32_t input_dim; - uint32_t output_dim; - // Layer descriptors follow - } cfdnn_header_t; - ``` - - [ ] JSON metadata support for model info - [ ] Weight loading API - ```c - cfdnn_model_t* cfdnn_load(const char* path); - void cfdnn_free(cfdnn_model_t* model); - ``` - ### 7.2 Layer Implementations (P3) - [ ] Dense (fully connected) layer - - ```c - void cfdnn_dense_forward( - const double* input, size_t in_features, - const double* weights, const double* bias, - double* output, size_t out_features - ); - ``` - -- [ ] Activation functions - - [ ] ReLU, Leaky ReLU - - [ ] Tanh, Sigmoid - - [ ] GELU (for transformers) - - [ ] Swish/SiLU - +- [ ] Activation functions (ReLU, Leaky ReLU, Tanh, Sigmoid, GELU, Swish/SiLU) - [ ] Batch normalization - [ ] Layer normalization - [ ] Dropout (inference mode = identity) @@ -1517,15 +588,6 @@ Train models in Python (PyTorch/JAX), deploy inference in C for: - [ ] NEON equivalents for ARM - [ ] Cache-optimized tiling for large layers -```c -// Example: SIMD dense layer -void cfdnn_dense_forward_avx2( - const double* input, size_t in_features, - const double* weights, const double* bias, - double* output, size_t out_features -); -``` - ### 7.4 Model Architectures (P3) | Architecture | Status | Notes | @@ -1536,47 +598,12 @@ void cfdnn_dense_forward_avx2( ### 7.5 Inference API (P3) -```c -// High-level inference API -typedef struct { - size_t nx, ny; - double Re; - double* boundary_conditions; // Flattened BC values -} cfdnn_input_t; - -typedef struct { - double* u; // Velocity x-component (nx * ny) - double* v; // Velocity y-component (nx * ny) - double* p; // Pressure field (nx * ny) -} cfdnn_output_t; - -cfd_status_t cfdnn_predict( - const cfdnn_model_t* model, - const cfdnn_input_t* input, - cfdnn_output_t* output -); - -// Batch inference for parameter sweeps -cfd_status_t cfdnn_predict_batch( - const cfdnn_model_t* model, - const cfdnn_input_t* inputs, size_t batch_size, - cfdnn_output_t* outputs -); -``` +- [ ] High-level `cfdnn_predict()` and `cfdnn_predict_batch()` API +- [ ] Model loading/freeing lifecycle ### 7.6 Hybrid Solver Integration (P3) - [ ] Use ML prediction as initial guess for iterative solver - - ```c - // ML-accelerated projection method - cfd_status_t ns_solve_projection_ml_init( - const cfdnn_model_t* surrogate, - ns_solver_t* solver, - // ... other params - ); - ``` - - [ ] Adaptive switching: ML for steady-state estimate, CFD for accuracy - [ ] Uncertainty quantification (ensemble models) @@ -1610,131 +637,29 @@ cfd_status_t cfdnn_predict_batch( ### Rationale -Alternative to full C inference (Phase 7). Best of both worlds: - -- Python handles training, model management, and orchestration -- C provides optimized compute kernels callable from Python -- Lower implementation effort than full C inference engine -- Easier to support new architectures (just add kernels) +Alternative to full C inference (Phase 7). Python handles training/orchestration, C provides optimized compute kernels. Lower effort, easier to support new architectures. ### 8.1 Optimized Compute Kernels (P3) -Expose SIMD-optimized operations for ML workloads via Python bindings: - -```c -// Matrix operations optimized for neural network inference -cfd_status_t cfd_matmul( - const double* A, size_t A_rows, size_t A_cols, - const double* B, size_t B_cols, - double* C // Output: A_rows x B_cols -); - -cfd_status_t cfd_matmul_add_bias( - const double* A, size_t A_rows, size_t A_cols, - const double* B, size_t B_cols, - const double* bias, // B_cols - double* C -); - -// Activation functions (in-place) -cfd_status_t cfd_relu(double* data, size_t n); -cfd_status_t cfd_tanh(double* data, size_t n); -cfd_status_t cfd_gelu(double* data, size_t n); -``` +- [ ] SIMD-optimized matrix operations (`cfd_matmul`, `cfd_matmul_add_bias`) +- [ ] Activation functions (ReLU, Tanh, GELU — in-place) ### 8.2 Python-Callable Kernels (P3) -Expose kernels to Python via bindings: - -```python -from cfd_kernels import matmul, relu, gelu - -# Use C kernels in custom PyTorch module -class CFDAcceleratedLayer(torch.nn.Module): - def forward(self, x): - # Offload to SIMD-optimized C kernel - out = matmul(x.numpy(), self.weight.numpy()) - return torch.from_numpy(relu(out)) -``` +- [ ] Python bindings for C compute kernels ### 8.3 Physics Residual Kernels (P3) -Optimized physics loss computation for PINN training: - -```c -// Compute Navier-Stokes residuals efficiently -cfd_status_t cfd_ns_residual( - const double* u, const double* v, const double* p, - size_t nx, size_t ny, double dx, double dy, double Re, - double* continuity_residual, - double* momentum_x_residual, - double* momentum_y_residual -); - -// Finite difference operators (2nd order central) -cfd_status_t cfd_gradient_x(const double* f, size_t nx, size_t ny, double dx, double* df_dx); -cfd_status_t cfd_gradient_y(const double* f, size_t nx, size_t ny, double dy, double* df_dy); -cfd_status_t cfd_laplacian(const double* f, size_t nx, size_t ny, double dx, double dy, double* lap_f); -``` - -Python usage: - -```python -from cfd_kernels import ns_residual - -# Fast physics loss for PINN training -def physics_loss(model, coords, Re): - u, v, p = model(coords) - cont, mom_x, mom_y = ns_residual(u, v, p, nx, ny, dx, dy, Re) - return (cont**2 + mom_x**2 + mom_y**2).mean() -``` +- [ ] Optimized NS residual computation for PINN training +- [ ] Finite difference operators (gradient, laplacian) ### 8.4 Data Generation Acceleration (P3) -Fast CFD simulation for training data generation: - -```c -// Batch simulation for dataset generation -cfd_status_t cfd_generate_samples( - const cfd_sample_params_t* params, size_t n_samples, - cfd_sample_result_t* results, - int n_threads // OpenMP parallelization -); -``` - -```python -from cfd_ml import generate_training_data - -# Generate 1000 cavity flow samples in parallel -samples = generate_training_data( - problem="cavity", - Re_range=(10, 200), - n_samples=1000, - n_threads=8 # Uses OpenMP in C -) -``` +- [ ] Batch simulation API for dataset generation with OpenMP parallelization ### 8.5 Memory-Mapped Data Sharing (P3) -Zero-copy data sharing between C and Python/NumPy: - -```c -// Create memory-mapped buffer accessible from Python -cfd_buffer_t* cfd_create_shared_buffer(size_t size); -double* cfd_buffer_data(cfd_buffer_t* buf); -void cfd_buffer_free(cfd_buffer_t* buf); -``` - -```python -from cfd_memory import SharedBuffer - -# Zero-copy data sharing -buf = SharedBuffer(shape=(128, 128)) -arr = buf.as_numpy() # No copy, direct memory access - -# Pass to C functions without copying -cfd.run_simulation(output_buffer=buf) -``` +- [ ] Zero-copy shared buffers between C and Python/NumPy ### Comparison: Phase 7 vs Phase 8 @@ -1747,68 +672,17 @@ cfd.run_simulation(output_buffer=buf) | Performance | Highest | High (kernel-level) | | Flexibility | Fixed architectures | Any PyTorch model | -### Phase 8 Success Criteria - -- Kernels provide >2x speedup over NumPy equivalents -- Zero-copy memory sharing works with NumPy arrays -- Physics residual kernel matches Python implementation to 1e-10 -- Dataset generation scales linearly with thread count - --- ## Version Milestones -### v0.1.0 - Foundation (Released) - -- [x] Boundary condition abstraction layer with runtime backend selection -- [x] Neumann and Periodic boundary conditions (all backends) -- [x] SIMD Poisson solvers (Jacobi, Red-Black SOR) -- [x] GPU Jacobi Poisson solver -- [x] Comprehensive test suite -- [x] 11 example programs -- [x] VTK and CSV output - -### v0.1.x - Boundary Condition Improvements - -**Completed:** - -- [x] Dirichlet (fixed value) boundary conditions -- [x] No-slip wall boundary conditions -- [x] Inlet velocity boundary conditions (uniform, parabolic, custom profiles) -- [x] Outlet boundary conditions (zero-gradient, convective) -- [x] Conjugate Gradient (CG) Krylov solver -- [x] Lid-driven cavity validation (Ghia et al. 1982 benchmark) -- [x] Cross-architecture solver validation (CPU, AVX2, OpenMP, CUDA) -- [x] Per-architecture Ghia validation tests -- [x] Basic documentation (OMP solvers, API constants) - -### v0.1.5 - -- [x] OpenMP solver variants documented (explicit_euler_omp, projection_omp) -- [x] Updated solver type constants with NS_SOLVER_TYPE_ prefix -- [x] Doxyfile version synchronized with VERSION file -- [x] Ghia validation tests use appropriate step counts for Euler solvers - -### v0.1.6 - -- [x] CI: Doxygen API documentation generation in release workflow -- [x] CI: macOS OpenMP support via Homebrew libomp -- [x] CI: Improved CUDA toolkit configuration for Windows builds -- [x] CI: Fixed library path detection for modular builds - ### v0.1.7 - Current Release -- [x] Fix stretched grid formula to correctly span domain [xmin, xmax] -- [x] Add tanh-based stretching for boundary layer clustering -- [x] Add 16 unit tests for grid initialization functions +- [x] Stretched grid formula fix, tanh-based stretching, grid unit tests -### v0.2.0 - 3D Support (In Progress) +### v0.2.0 - 3D Support -- [x] Phase 0: IDX_2D/IDX_3D indexing macros, replace all inline indexing -- [x] Phase 1: Extend grid, flow_field, BCs with z-dimension (nz=1 backward compatible) -- [x] Phase 2: 3D stencils (7-point), scalar CPU linear solvers (Jacobi, SOR, Red-Black, CG, BiCGSTAB) -- [x] Phase 3: NS solvers with w-momentum -- [x] Phase 4: 3D boundary conditions (all backends) +- [x] Phases 0–4: Indexing macros, core data structures, 3D stencils, NS solvers, BCs - [ ] Phase 5-7: SIMD, OMP, CUDA backends for 3D - [ ] Phase 8: 3D VTK output, examples, validation (Taylor-Green 3D, Poiseuille 3D) @@ -1849,7 +723,7 @@ cfd.run_simulation(output_buffer=buf) ## Priority Legend | Priority | Meaning | -|----------|---------| +| -------- | ------- | | P0 | Critical - blocks v1.0 release | | P1 | Important - required for v1.0 | | P2 | Valuable - nice to have for v1.0 | @@ -1871,7 +745,7 @@ When working on roadmap items: --- -## References +## References & Bibliography ### CFD Validation Benchmarks