Serialise libhdf5 entries in NetCDFWriter to avoid HDF5_jll thread-safety corruption#167
Serialise libhdf5 entries in NetCDFWriter to avoid HDF5_jll thread-safety corruption#167haakon-e wants to merge 1 commit into
NetCDFWriter to avoid HDF5_jll thread-safety corruption#167Conversation
HDF5_jll is built without `--enable-threadsafe` (libhdf5.settings: "Threadsafety: no"). NCDatasets and HDF5.jl each have their own ReentrantLock that serialises calls within their respective wrappers, but the two locks are independent. When a multi-threaded Julia process has one thread inside NCDatasets and another inside HDF5.jl — e.g. a ClimaDiagnostics NC append concurrent with a ClimaCore.InputOutput checkpoint write — both ccall into the same libhdf5 and corrupt its internal skip-list / property-list state. The failure mode is a "double free or corruption (fasttop)" + SIGSEGV in H5SL_insert / H5P_create_id inside libhdf5, or occasionally a silent deadlock, after ~20-40 NC write cycles. Fix: serialise every libhdf5 entry from NetCDFWriter through HDF5.API.liblock — the lock HDF5.jl already uses for its own thread-safety. One Julia-level barrier covers all libhdf5 calls in the process regardless of which wrapper made them. No perf impact at typical diagnostic cadences (microseconds of lock overhead per write). Adds HDF5 as a direct dependency (already a transitive dep via ClimaCore.InputOutput). Splits write_field! into an outer lock-acquiring wrapper and an inner _write_field_impl! so the existing body is unchanged.
ph-kev
left a comment
There was a problem hiding this comment.
I did a preliminary review.
I was wondering what was the use case in ClimaAtmos that led to the discovery of this bug?
| @@ -1,5 +1,6 @@ | |||
| margin = 80 | |||
| margin = 92 | |||
There was a problem hiding this comment.
What was the reason for wanting to update this? I am asking since it seems that the formatter didn't change anything unless it only affected the code that was added in this PR.
There was a problem hiding this comment.
Can revert. Just spotted it (different from other CliMA repo settings) and updated, but not needed in this PR.
There was a problem hiding this comment.
Let's remove it. It seems that there are inconsistencies with how the formatter is set up for the CliMA repos.
| - label: "Run thread-safety tests on CPU (8 threads)" | ||
| key: "cpu_mt_tests" |
There was a problem hiding this comment.
I think it would be better to move the test to runtests.jl and start Julia with more threads for the CPU and GPU tests on buildkite and maybe on GHA if that is possible too.
| end | ||
| end | ||
|
|
||
| function _write_field_impl!(writer::NetCDFWriter, field, diagnostic, u, p, t) |
There was a problem hiding this comment.
Can you add a docstring for this function? It doesn't need to be long, but it would be good to mention that the function is unsafe to call without a lock because of needing to access libhdf5.
| # Serialises every libhdf5 entry from this writer with every other | ||
| # libhdf5 user (HDF5.jl, etc.) in this Julia process. |
There was a problem hiding this comment.
I don't think this is true, since this solution only works for HDF5.jl and NCDatasets.jl. If there is another package that use libhdf5, then this solution won't work. If this is the case, can you update the note and said that this solution won't work if another package use libhdf5.
There was a problem hiding this comment.
I think this can also break if NCDatasets.jl is used elsewhere (e.g. by Oceananigans.jl) at the same time as HDF5.jl. This is more relevant for the coupled model (see CliMA/ClimaCoupler.jl#1700).
| if Threads.nthreads() < 2 | ||
| @info "Skipping MT race test — Threads.nthreads() < 2" |
There was a problem hiding this comment.
I think an early return is cleaner here
| if Threads.nthreads() < 2 | |
| @info "Skipping MT race test — Threads.nthreads() < 2" | |
| if Threads.nthreads() < 2 | |
| @info "Skipping MT race test — Threads.nthreads() < 2" | |
| return | |
| end |
and then, you can remove the else conditional.
| end | ||
| end | ||
|
|
||
| @testset "libhdf5 thread-safety lock" begin |
There was a problem hiding this comment.
This is duplicated unless this was intended. Can you remove this?
Thanks. I'm was running a cartesian simulation with ~16 columns in atmos where I saved data every 10 minutes (time step ~10 seconds) and reliably observed a segfault after ~40 writes. I have been able to reproduce with simpler examples (such as the attached test case) when I have a GPU, but not multi-threaded CPU. This remains a bit of a head scratcher. |
Summary
Fixes a Julia-multi-threaded crash in
NetCDFWriterwhen HDF5.jl and NCDatasets both make concurrent calls into the same non-threadsafelibhdf5shared object.Background
HDF5_jllbinaries on Julia's package manager are currently built without--enable-threadsafe. The ship config is:This is intentional. Enabling threadsafety is incompatible with enabling MPI parallel, which is the default build. But the consequence is that multiple Julia threads entering
libhdf5concurrently can corrupt the library's internal skip-list and property-list data structures.NCDatasets.jlandHDF5.jleach have their ownReentrantLock(NETCDF_LOCKandHDF5.API.liblock) that serialises calls within their respective wrappers. But the two locks are independent, so concurrent calls through different wrappers can still race.ClimaAtmos runs hit this:
ClimaDiagnostics.NetCDFWriterwrites NetCDF diagnostics every few seconds of simulated time (viaNCDatasets).ClimaAtmos.nan_checking_callbackwrites HDF5 checkpoints every 10 min of simulated time (viaClimaCore.InputOutput.HDF5Writer, which usesHDF5.jl).--threads=auto) schedules these on different threads.After 20–40 NC write cycles, we SIGSEGV with:
stack trace in
libhdf5.so→H5SL_insert/H5P_create_id.Reproducer
The PR adds
test/libhdf5_thread_safety.jl, which is both the regression test for this fix and a standalone reproducer. It drivesClimaDiagnostics.NetCDFWriterNC writes concurrently withClimaCore.InputOutput.HDF5Writercheckpoint writes fromThreads.nthreads()worker threads — exactly the cross-wrapper combination the productionnan_checking_callbackpath uses.From the root of a local
ClimaDiagnosticsclone:# Check out this branch, then: julia --project=. --threads=8 test/libhdf5_thread_safety.jlOn
main(pre-fix) this SIGSEGVs insidelibhdf5.so(H5SL_insert/H5P_create_id) well before the 40-cycle budget completes. On this branch it passes:The test also asserts
Writers.LIBHDF5_LOCK === HDF5.API.liblock, so a future refactor that silently pointsLIBHDF5_LOCKat a privateReentrantLock()fails this test deterministically even on a single-threaded CI box.Fix
Serialise every
libhdf5entry fromNetCDFWriterthrough the sameReentrantLockHDF5.jl already uses. This isHDF5.API.liblock— a public-enough API (used by HDF5.jl's own generated wrappers) that shadowing it in ClimaDiagnostics lets both wrappers share one barrier:Verified with the patched ClimaDiagnostics: 8-thread reproducer runs cleanly (no crash, no corruption).
Alternatives considered
--enable-threadsafevia Yggdrasil (incompatible with MPI parallel build; would require a new JLL variant). Worth pursuing independently; does not replace this PR.HDF5.API.liblockthrough and drop this shim. Filing a follow-up issue.This PR is the minimal correct fix that stays inside ClimaDiagnostics.
Dependencies
Adds
HDF5 = "0.17"as a direct dependency.HDF5.jlis already atransitive dependency via
ClimaCore.InputOutputin any ClimaDiagnosticsconsumer, so this does not add a new root package to the ecosystem.
Testing
test/libhdf5_thread_safety.jl: regression test; exercises the real cross-wrapper path.cpu_mt_testsruns the test with--threads=8on a CPU-only agent; existinggpu_testsstep was upgraded to--threads=8+slurm_cpus_per_task: 8to exercise the same race window on GPU.Perf impact
One
Base.lock/unlockacquisition per diagnostic write. At typical hourly diagnostic cadence over a 36 h sim that is 36 lock acquisitions totalling microseconds of overhead. Negligible.Files changed
Project.toml— addHDF5 v0.17to deps.src/netcdf_writer.jl— import HDF5; defineLIBHDF5_LOCK; splitwrite_field!into lock wrapper +_write_field_impl!; wrapsync(writer)body inlock(LIBHDF5_LOCK).NEWS.md— new v0.3.4 entry describing the fix.test/libhdf5_thread_safety.jl(new) — full regression test that drivesNetCDFWriterdiagnostic writes concurrently withClimaCore.InputOutput.HDF5Writercheckpoint writes fromThreads.nthreads()worker threads for 40 cycles. This is the exact cross-wrapper combination the production sim uses (thenan_checking_callbackpath). Before the fix, the test's@threadsloop SIGSEGVs inlibhdf5well before completion; after the fix, passes cleanly. Also assertsWriters.LIBHDF5_LOCK === HDF5.API.liblockso future regressions that silently point the lock at a privateReentrantLock()are caught.test/runtests.jl— include the new test file..buildkite/pipeline.yml:cpu_mt_tests— runsjulia --threads=8 test/libhdf5_thread_safety.jlon an 8-CPU agent. This step SIGSEGVs onmain(pre-fix) and passes on this PR. Guards the fix against regressions.gpu_testsstep — upgraded from default single-thread to--threads=8+slurm_cpus_per_task: 8, so the GPU CI path also exercises the MT race window. On GPU the shared lock is identical in effect (fix is device-agnostic).Follow-up upstream work (out of scope for this PR)
HDF5.jlto promoteAPI.liblockto a documented public export.NCDatasets.jlproposing an optional shared-lock kwarg.Yggdrasilrequesting a threadsafeHDF5_jllvariant for non-MPI users.