Skip to content

PRISM segfaults when constrained_transport != no AND divergence_correction = poisson (heap corruption in q_name) #11

@szaghi

Description

@szaghi

PRISM segfaults when constrained_transport != no AND divergence_correction = poisson

Summary

Running PRISM-CPU with the two settings

[numerics]
constrained_transport    = DB        ; or D, B, BD
divergence_correction    = poisson

both active produces a SIGSEGV during initialization (release build, prism-cpu-gnu, GCC 15.2.0 + OpenMPI 5.0.10, 2 MPI ranks). Either toggle alone runs to completion. The release-mode crash surfaces inside __GI___libc_realloc (heap corruption), so the Fortran frames are absent without debug info. A debug build (prism-cpu-gnu-debug) catches a different and separate latent bug — see "Secondary finding" below.

This bug blocks adding a regression test for the FNL impose_ct_correction kernel refactor done in issue #10 Step 4 — the only kernel that the CT-correction code path exercises is currently unreached by any committed input.

Acceptance criteria

  • A PRISM-CPU run with constrained_transport = DB AND divergence_correction = poisson (otherwise identical to src/tests/prism/regression/rmf/input.ini) completes 10 iterations without crash.
  • The fix does not regress rmf (CPU + FNL digests + residuals).
  • A new regression case under src/tests/prism/regression/rmf-ct/ exercises the code path; golden captured per the standard CI-authority procedure.
  • The pre-existing WENO debug-mode crash (secondary finding below) is filed separately or fixed in the same PR with a note.

Reproduction

From repo root, with module load openmpi/5.0.10-gnu15.2.0 (or any GNU+OpenMPI toolchain matching the regression harness):

fobis build --mode prism-cpu-gnu

mkdir -p /tmp/ct-repro && cd /tmp/ct-repro
cp /path/to/adam/src/tests/prism/regression/rmf/input.ini .
sed -i 's/constrained_transport    = no/constrained_transport    = DB/' input.ini
sed -i 's/divergence_correction    = no/divergence_correction    = poisson/' input.ini
sed -i 's/output_basename        = rmf_regression/output_basename        = ct_repro/' input.ini

mpirun -np 2 /path/to/adam/exe/adam_prism_cpu

Observed:

[mpi-00000]  nodes number:   +1
[mpi-00001]  nodes number:   +1
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
#0  0x...      at sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1  0x...      in musable
#2  0x...      in __GI___libc_realloc
#3  ...        (no Fortran symbols in release build)
prterun ... rank 1 ... exited on signal 11 (Segmentation fault)

The single-toggle variants run cleanly:

constrained_transport divergence_correction Result
no no (= base rmf) OK — 10 iters
DB no OK — 10 iters
no poisson OK — 10 iters
DB poisson SIGSEGV in init

Root cause (primary hypothesis, needs confirmation)

The most likely culprit is an array-bound overrun in prism_common_object%initialize's contained io_initialize, at src/app/prism/common/adam_prism_common_object.F90:232-280.

The variable-name arrays self%q_name, self%dq_name, self%div_name, self%curl_name are allocated with size physics%nv at adam_prism_common_object.F90:125-128 (inside allocate_common):

allocate(self%q_name(   self%nv))
allocate(self%dq_name(  self%nv))
allocate(self%div_name( self%nv))
allocate(self%curl_name(self%nv))

physics%nv is computed in prism_physics_object%initialize at adam_prism_physics_object.F90:296-389. With either constrained_transport = DB alone OR divergence_correction = poisson alone, nv_cl = 0 and nv = 9 (verified by stdout: number of variables in q (nv): +9 in both single-toggle runs).

Then io_initialize at line 239-280 builds the names array using numerics%constrained_transport_D/B directly (NOT physics%nv_cl):

associate(add_phi=>numerics%constrained_transport_D, &
          add_psi=>numerics%constrained_transport_B)
             q_name = ['Dx ','Dy ','Dz ','Bx ','By ','Bz ']
if (add_phi) q_name = [q_name, 'phi']                ! +1 if CT_D
if (add_psi) q_name = [q_name, 'psi']                ! +1 if CT_B
             q_name = [q_name, 'Jx ','Jy ','Jz ']
if (add_rho) q_name = [q_name, 'rho']
do v=1, size(q_name,dim=1)
   self%q_name(v) = trim(q_name(v))                  ! writes v=10, v=11 → OOB
enddo

With constrained_transport = DB, q_name has 11 entries; the loop writes into self%q_name(10) and self%q_name(11), which are outside the allocated 9-entry array. Same pattern repeats for dq_name, div_name, curl_name.

This is silent heap corruption. Why it appears benign with CT=DB alone and lethal in combination with divergence_correction=poisson is timing-dependent — the corrupted memory happens to be touched (allocated/realloc'd) later only in the combined path. The bug is the same in both; one variant just gets lucky.

The structural fix is to derive add_phi/add_psi from physics%nv_cl so the names array size matches the allocated state-vector size — or symmetrically, make physics%nv track constrained_transport_* so nv_cl adds the right number of slots when CT is active.

Inspecting adam_prism_physics_object.F90:296-389 shows that nv_cl IS bumped under case(DIV_CORR_VAR_HYPER) (hyperbolic divergence correction) — see lines 315-369 where nv_cl=1 for one of CT_D/CT_B set and nv_cl=2 for both — but NOT under case(DIV_CORR_VAR_POISS) (line 297-314, nv_cl=0 unconditionally) and NOT under case default (line 371-389, also nv_cl=0). So CT only contributes to nv when hyperbolic correction is selected. With Poisson correction or no correction, q has no slots for phi/psi even though CT writes them into the names array.

This is a longstanding latent inconsistency: it suggests CT correction with Poisson divergence cleaning was never actually validated end-to-end, OR phi/psi are not supposed to be persistent state variables under Poisson and the names array is the only place they leak.

Fix plan

Step 1 — Confirm the root cause

  • Reproduce locally per "Reproduction" above and add a Fortran-mode debug trap: build with prism-cpu-gnu-debug after also addressing the secondary WENO debug bug below (or set -fcheck=bounds selectively on adam_prism_common_object). Inspect the backtrace at the OOB write in io_initialize.
  • Verify the hypothesis by adding error stop if size(q_name) > self%nv immediately before the loop — this should fire on CT=DB regardless of divergence-correction choice.
  • If the hypothesis confirms, the bug class is "PRISM has two ground-truth sources for nv (physics vs numerics) that disagree under Poisson + CT".

Step 2 — Decide the fix shape

Three options, in increasing scope:

  1. Minimal: gate add_phi/add_psi on physics%nv_cl rather than numerics%constrained_transport_*. Two-line change in io_initialize. Documents the relationship "phi/psi names exist only when phi/psi slots exist in q". Does NOT add phi/psi to q under Poisson — i.e. CT under Poisson silently runs as CT under "no correction" for the purposes of state storage. May break correctness if impose_ct_correction reads/writes from q(7,...) / q(8,...) slots that under Poisson hold Jx/Jy/Jz. Audit impose_ct_correction against var_Jx/Jy/Jz for slot collisions before choosing this.

  2. Structural: extend nv_cl bookkeeping to Poisson. In adam_prism_physics_object.F90:297-314 (case(DIV_CORR_VAR_POISS)), apply the same nv_cl = 0/1/2 logic that DIV_CORR_VAR_HYPER uses at lines 315-369, with var_Jx/Jy/Jz shifted accordingly. This makes q carry phi/psi slots under Poisson + CT, matching the names array shape. Requires verifying that downstream consumers (Poisson solver, RK update, IO) handle the wider state vector. Higher blast radius but architecturally cleaner.

  3. Comprehensive: refactor nv derivation to a single source. Have numerics own the add_phi/add_psi flags AND physics%nv_cl derive from them with no select case on correction type. Eliminates the divergence at the root. Largest blast radius.

Recommended starting point: option 1 with a select case guard or error stop if impose_ct_correction ever runs under Poisson, to make the silent gap explicit. Escalate to option 2 only when a use case for CT+Poisson is committed.

Step 3 — Test coverage

  • Add src/tests/prism/regression/rmf-ct/ with the smaller of the working CT combinations (CT=DB + divergence_correction=hyperbolic, if that one runs — needs verification, not yet tested in this investigation).
  • Capture goldens per the standard CI-authority procedure documented in src/tests/prism/regression/README.md.
  • Confirm the new case exercises impose_ct_correction by adding a runtime print or counter in the kernel for the first run.
  • Wire the new case into the existing run.sh discovery loop (no change needed — it auto-discovers any <case>/input.ini).

Out of scope

  • The hyperbolic-correction path (divergence_correction = hyperbolic) is not investigated here. It probably works because the existing nv_cl bookkeeping covers it.
  • This issue does NOT address whether CT correction is numerically correct, only that the setup path doesn't crash.

Secondary finding (file separately if not bundled)

A pre-existing bug surfaces under any debug-mode build (prism-cpu-gnu-debug, gfortran 15.2.0): even the base rmf regression input segfaults at src/lib/common/adam_weno_object.F90:140 inside the pure function description when called from weno%initialize at line 272. The backtrace:

#1  __adam_weno_object_MOD_description     at src/lib/common/adam_weno_object.F90:140
#2  __adam_weno_object_MOD_initialize      at src/lib/common/adam_weno_object.F90:272
#3  __adam_realm_object_MOD_initialize     at src/lib/common/adam_realm_object.F90:447

Line 140 builds a string via desc//mpih%myrankstr//'...'. The most likely root cause is that description is declared pure but reads the module-scope variable mpih%myrankstr (adam_mpih_global's singleton), which is illegal under strict Fortran 2008 purity rules and may trigger gfortran's bounds/sanity checks to mishandle the access. This blocks the use of debug-mode builds for ANY PRISM investigation, including the primary bug above — which is why this issue's reproduction sticks to release mode despite the resulting opacity in the backtrace.

Fix: either drop the pure annotation on description (the function clearly has side-effecting dependencies on module state), or refactor description to take myrankstr as an explicit dummy argument. Two-line change either way.

Estimated effort

  • Confirmation + minimal fix (option 1): half a session.
  • Structural fix (option 2): 1-2 sessions plus golden recapture for any existing tests that change shape.
  • Regression case + golden capture: half a session once option 1/2 is chosen.
  • Secondary WENO debug-mode fix: 30 minutes.

References

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions