PRISM segfaults when constrained_transport != no AND divergence_correction = poisson
Summary
Running PRISM-CPU with the two settings
[numerics]
constrained_transport = DB ; or D, B, BD
divergence_correction = poisson
both active produces a SIGSEGV during initialization (release build, prism-cpu-gnu, GCC 15.2.0 + OpenMPI 5.0.10, 2 MPI ranks). Either toggle alone runs to completion. The release-mode crash surfaces inside __GI___libc_realloc (heap corruption), so the Fortran frames are absent without debug info. A debug build (prism-cpu-gnu-debug) catches a different and separate latent bug — see "Secondary finding" below.
This bug blocks adding a regression test for the FNL impose_ct_correction kernel refactor done in issue #10 Step 4 — the only kernel that the CT-correction code path exercises is currently unreached by any committed input.
Acceptance criteria
Reproduction
From repo root, with module load openmpi/5.0.10-gnu15.2.0 (or any GNU+OpenMPI toolchain matching the regression harness):
fobis build --mode prism-cpu-gnu
mkdir -p /tmp/ct-repro && cd /tmp/ct-repro
cp /path/to/adam/src/tests/prism/regression/rmf/input.ini .
sed -i 's/constrained_transport = no/constrained_transport = DB/' input.ini
sed -i 's/divergence_correction = no/divergence_correction = poisson/' input.ini
sed -i 's/output_basename = rmf_regression/output_basename = ct_repro/' input.ini
mpirun -np 2 /path/to/adam/exe/adam_prism_cpu
Observed:
[mpi-00000] nodes number: +1
[mpi-00001] nodes number: +1
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
#0 0x... at sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1 0x... in musable
#2 0x... in __GI___libc_realloc
#3 ... (no Fortran symbols in release build)
prterun ... rank 1 ... exited on signal 11 (Segmentation fault)
The single-toggle variants run cleanly:
constrained_transport |
divergence_correction |
Result |
no |
no (= base rmf) |
OK — 10 iters |
DB |
no |
OK — 10 iters |
no |
poisson |
OK — 10 iters |
DB |
poisson |
SIGSEGV in init |
Root cause (primary hypothesis, needs confirmation)
The most likely culprit is an array-bound overrun in prism_common_object%initialize's contained io_initialize, at src/app/prism/common/adam_prism_common_object.F90:232-280.
The variable-name arrays self%q_name, self%dq_name, self%div_name, self%curl_name are allocated with size physics%nv at adam_prism_common_object.F90:125-128 (inside allocate_common):
allocate(self%q_name( self%nv))
allocate(self%dq_name( self%nv))
allocate(self%div_name( self%nv))
allocate(self%curl_name(self%nv))
physics%nv is computed in prism_physics_object%initialize at adam_prism_physics_object.F90:296-389. With either constrained_transport = DB alone OR divergence_correction = poisson alone, nv_cl = 0 and nv = 9 (verified by stdout: number of variables in q (nv): +9 in both single-toggle runs).
Then io_initialize at line 239-280 builds the names array using numerics%constrained_transport_D/B directly (NOT physics%nv_cl):
associate(add_phi=>numerics%constrained_transport_D, &
add_psi=>numerics%constrained_transport_B)
q_name = ['Dx ','Dy ','Dz ','Bx ','By ','Bz ']
if (add_phi) q_name = [q_name, 'phi'] ! +1 if CT_D
if (add_psi) q_name = [q_name, 'psi'] ! +1 if CT_B
q_name = [q_name, 'Jx ','Jy ','Jz ']
if (add_rho) q_name = [q_name, 'rho']
do v=1, size(q_name,dim=1)
self%q_name(v) = trim(q_name(v)) ! writes v=10, v=11 → OOB
enddo
With constrained_transport = DB, q_name has 11 entries; the loop writes into self%q_name(10) and self%q_name(11), which are outside the allocated 9-entry array. Same pattern repeats for dq_name, div_name, curl_name.
This is silent heap corruption. Why it appears benign with CT=DB alone and lethal in combination with divergence_correction=poisson is timing-dependent — the corrupted memory happens to be touched (allocated/realloc'd) later only in the combined path. The bug is the same in both; one variant just gets lucky.
The structural fix is to derive add_phi/add_psi from physics%nv_cl so the names array size matches the allocated state-vector size — or symmetrically, make physics%nv track constrained_transport_* so nv_cl adds the right number of slots when CT is active.
Inspecting adam_prism_physics_object.F90:296-389 shows that nv_cl IS bumped under case(DIV_CORR_VAR_HYPER) (hyperbolic divergence correction) — see lines 315-369 where nv_cl=1 for one of CT_D/CT_B set and nv_cl=2 for both — but NOT under case(DIV_CORR_VAR_POISS) (line 297-314, nv_cl=0 unconditionally) and NOT under case default (line 371-389, also nv_cl=0). So CT only contributes to nv when hyperbolic correction is selected. With Poisson correction or no correction, q has no slots for phi/psi even though CT writes them into the names array.
This is a longstanding latent inconsistency: it suggests CT correction with Poisson divergence cleaning was never actually validated end-to-end, OR phi/psi are not supposed to be persistent state variables under Poisson and the names array is the only place they leak.
Fix plan
Step 1 — Confirm the root cause
Step 2 — Decide the fix shape
Three options, in increasing scope:
-
Minimal: gate add_phi/add_psi on physics%nv_cl rather than numerics%constrained_transport_*. Two-line change in io_initialize. Documents the relationship "phi/psi names exist only when phi/psi slots exist in q". Does NOT add phi/psi to q under Poisson — i.e. CT under Poisson silently runs as CT under "no correction" for the purposes of state storage. May break correctness if impose_ct_correction reads/writes from q(7,...) / q(8,...) slots that under Poisson hold Jx/Jy/Jz. Audit impose_ct_correction against var_Jx/Jy/Jz for slot collisions before choosing this.
-
Structural: extend nv_cl bookkeeping to Poisson. In adam_prism_physics_object.F90:297-314 (case(DIV_CORR_VAR_POISS)), apply the same nv_cl = 0/1/2 logic that DIV_CORR_VAR_HYPER uses at lines 315-369, with var_Jx/Jy/Jz shifted accordingly. This makes q carry phi/psi slots under Poisson + CT, matching the names array shape. Requires verifying that downstream consumers (Poisson solver, RK update, IO) handle the wider state vector. Higher blast radius but architecturally cleaner.
-
Comprehensive: refactor nv derivation to a single source. Have numerics own the add_phi/add_psi flags AND physics%nv_cl derive from them with no select case on correction type. Eliminates the divergence at the root. Largest blast radius.
Recommended starting point: option 1 with a select case guard or error stop if impose_ct_correction ever runs under Poisson, to make the silent gap explicit. Escalate to option 2 only when a use case for CT+Poisson is committed.
Step 3 — Test coverage
Out of scope
- The hyperbolic-correction path (
divergence_correction = hyperbolic) is not investigated here. It probably works because the existing nv_cl bookkeeping covers it.
- This issue does NOT address whether CT correction is numerically correct, only that the setup path doesn't crash.
Secondary finding (file separately if not bundled)
A pre-existing bug surfaces under any debug-mode build (prism-cpu-gnu-debug, gfortran 15.2.0): even the base rmf regression input segfaults at src/lib/common/adam_weno_object.F90:140 inside the pure function description when called from weno%initialize at line 272. The backtrace:
#1 __adam_weno_object_MOD_description at src/lib/common/adam_weno_object.F90:140
#2 __adam_weno_object_MOD_initialize at src/lib/common/adam_weno_object.F90:272
#3 __adam_realm_object_MOD_initialize at src/lib/common/adam_realm_object.F90:447
Line 140 builds a string via desc//mpih%myrankstr//'...'. The most likely root cause is that description is declared pure but reads the module-scope variable mpih%myrankstr (adam_mpih_global's singleton), which is illegal under strict Fortran 2008 purity rules and may trigger gfortran's bounds/sanity checks to mishandle the access. This blocks the use of debug-mode builds for ANY PRISM investigation, including the primary bug above — which is why this issue's reproduction sticks to release mode despite the resulting opacity in the backtrace.
Fix: either drop the pure annotation on description (the function clearly has side-effecting dependencies on module state), or refactor description to take myrankstr as an explicit dummy argument. Two-line change either way.
Estimated effort
- Confirmation + minimal fix (option 1): half a session.
- Structural fix (option 2): 1-2 sessions plus golden recapture for any existing tests that change shape.
- Regression case + golden capture: half a session once option 1/2 is chosen.
- Secondary WENO debug-mode fix: 30 minutes.
References
PRISM segfaults when
constrained_transport != noANDdivergence_correction = poissonSummary
Running PRISM-CPU with the two settings
both active produces a SIGSEGV during initialization (release build,
prism-cpu-gnu, GCC 15.2.0 + OpenMPI 5.0.10, 2 MPI ranks). Either toggle alone runs to completion. The release-mode crash surfaces inside__GI___libc_realloc(heap corruption), so the Fortran frames are absent without debug info. A debug build (prism-cpu-gnu-debug) catches a different and separate latent bug — see "Secondary finding" below.This bug blocks adding a regression test for the FNL
impose_ct_correctionkernel refactor done in issue #10 Step 4 — the only kernel that the CT-correction code path exercises is currently unreached by any committed input.Acceptance criteria
constrained_transport = DBANDdivergence_correction = poisson(otherwise identical tosrc/tests/prism/regression/rmf/input.ini) completes 10 iterations without crash.rmf(CPU + FNL digests + residuals).src/tests/prism/regression/rmf-ct/exercises the code path; golden captured per the standard CI-authority procedure.Reproduction
From repo root, with
module load openmpi/5.0.10-gnu15.2.0(or any GNU+OpenMPI toolchain matching the regression harness):Observed:
The single-toggle variants run cleanly:
constrained_transportdivergence_correctionnono(= basermf)DBnonopoissonDBpoissonRoot cause (primary hypothesis, needs confirmation)
The most likely culprit is an array-bound overrun in
prism_common_object%initialize's containedio_initialize, atsrc/app/prism/common/adam_prism_common_object.F90:232-280.The variable-name arrays
self%q_name,self%dq_name,self%div_name,self%curl_nameare allocated with sizephysics%nvatadam_prism_common_object.F90:125-128(insideallocate_common):physics%nvis computed inprism_physics_object%initializeatadam_prism_physics_object.F90:296-389. With eitherconstrained_transport = DBalone ORdivergence_correction = poissonalone,nv_cl = 0andnv = 9(verified by stdout:number of variables in q (nv): +9in both single-toggle runs).Then
io_initializeat line 239-280 builds the names array usingnumerics%constrained_transport_D/Bdirectly (NOTphysics%nv_cl):With
constrained_transport = DB,q_namehas 11 entries; the loop writes intoself%q_name(10)andself%q_name(11), which are outside the allocated 9-entry array. Same pattern repeats fordq_name,div_name,curl_name.This is silent heap corruption. Why it appears benign with
CT=DBalone and lethal in combination withdivergence_correction=poissonis timing-dependent — the corrupted memory happens to be touched (allocated/realloc'd) later only in the combined path. The bug is the same in both; one variant just gets lucky.The structural fix is to derive
add_phi/add_psifromphysics%nv_clso the names array size matches the allocated state-vector size — or symmetrically, makephysics%nvtrackconstrained_transport_*sonv_cladds the right number of slots when CT is active.Inspecting
adam_prism_physics_object.F90:296-389shows thatnv_clIS bumped undercase(DIV_CORR_VAR_HYPER)(hyperbolic divergence correction) — see lines 315-369 wherenv_cl=1for one of CT_D/CT_B set andnv_cl=2for both — but NOT undercase(DIV_CORR_VAR_POISS)(line 297-314,nv_cl=0unconditionally) and NOT undercase default(line 371-389, alsonv_cl=0). So CT only contributes tonvwhen hyperbolic correction is selected. With Poisson correction or no correction,qhas no slots forphi/psieven though CT writes them into the names array.This is a longstanding latent inconsistency: it suggests CT correction with Poisson divergence cleaning was never actually validated end-to-end, OR
phi/psiare not supposed to be persistent state variables under Poisson and the names array is the only place they leak.Fix plan
Step 1 — Confirm the root cause
prism-cpu-gnu-debugafter also addressing the secondary WENO debug bug below (or set-fcheck=boundsselectively onadam_prism_common_object). Inspect the backtrace at the OOB write inio_initialize.error stopifsize(q_name) > self%nvimmediately before the loop — this should fire onCT=DBregardless of divergence-correction choice.nv(physics vs numerics) that disagree under Poisson + CT".Step 2 — Decide the fix shape
Three options, in increasing scope:
Minimal: gate
add_phi/add_psionphysics%nv_clrather thannumerics%constrained_transport_*. Two-line change inio_initialize. Documents the relationship "phi/psi names exist only when phi/psi slots exist in q". Does NOT addphi/psito q under Poisson — i.e. CT under Poisson silently runs as CT under "no correction" for the purposes of state storage. May break correctness ifimpose_ct_correctionreads/writes fromq(7,...)/q(8,...)slots that under Poisson holdJx/Jy/Jz. Auditimpose_ct_correctionagainstvar_Jx/Jy/Jzfor slot collisions before choosing this.Structural: extend
nv_clbookkeeping to Poisson. Inadam_prism_physics_object.F90:297-314(case(DIV_CORR_VAR_POISS)), apply the samenv_cl = 0/1/2logic thatDIV_CORR_VAR_HYPERuses at lines 315-369, withvar_Jx/Jy/Jzshifted accordingly. This makesqcarryphi/psislots under Poisson + CT, matching the names array shape. Requires verifying that downstream consumers (Poisson solver, RK update, IO) handle the wider state vector. Higher blast radius but architecturally cleaner.Comprehensive: refactor
nvderivation to a single source. Havenumericsown theadd_phi/add_psiflags ANDphysics%nv_clderive from them with noselect caseon correction type. Eliminates the divergence at the root. Largest blast radius.Recommended starting point: option 1 with a
select caseguard orerror stopifimpose_ct_correctionever runs under Poisson, to make the silent gap explicit. Escalate to option 2 only when a use case for CT+Poisson is committed.Step 3 — Test coverage
src/tests/prism/regression/rmf-ct/with the smaller of the working CT combinations (CT=DB + divergence_correction=hyperbolic, if that one runs — needs verification, not yet tested in this investigation).src/tests/prism/regression/README.md.impose_ct_correctionby adding a runtime print or counter in the kernel for the first run.run.shdiscovery loop (no change needed — it auto-discovers any<case>/input.ini).Out of scope
divergence_correction = hyperbolic) is not investigated here. It probably works because the existingnv_clbookkeeping covers it.Secondary finding (file separately if not bundled)
A pre-existing bug surfaces under any debug-mode build (
prism-cpu-gnu-debug, gfortran 15.2.0): even the basermfregression input segfaults atsrc/lib/common/adam_weno_object.F90:140inside thepure function descriptionwhen called fromweno%initializeat line 272. The backtrace:Line 140 builds a string via
desc//mpih%myrankstr//'...'. The most likely root cause is thatdescriptionis declaredpurebut reads the module-scope variablempih%myrankstr(adam_mpih_global's singleton), which is illegal under strict Fortran 2008 purity rules and may trigger gfortran's bounds/sanity checks to mishandle the access. This blocks the use of debug-mode builds for ANY PRISM investigation, including the primary bug above — which is why this issue's reproduction sticks to release mode despite the resulting opacity in the backtrace.Fix: either drop the
pureannotation ondescription(the function clearly has side-effecting dependencies on module state), or refactordescriptionto takemyrankstras an explicit dummy argument. Two-line change either way.Estimated effort
References
grep -rinE "constrained_transport\s*=\s*(D|B|DB|BD)" src/tests/returns zero hits across the entiresrc/tests/tree.