Skip to content

Serialise libhdf5 entries in NetCDFWriter to avoid HDF5_jll thread-safety corruption#167

Draft
haakon-e wants to merge 1 commit into
mainfrom
he/netcdfwriter-serialise-libhdf5
Draft

Serialise libhdf5 entries in NetCDFWriter to avoid HDF5_jll thread-safety corruption#167
haakon-e wants to merge 1 commit into
mainfrom
he/netcdfwriter-serialise-libhdf5

Conversation

@haakon-e
Copy link
Copy Markdown
Member

@haakon-e haakon-e commented Apr 24, 2026

Summary

Fixes a Julia-multi-threaded crash in NetCDFWriter when HDF5.jl and NCDatasets both make concurrent calls into the same non-threadsafe libhdf5 shared object.

Background

HDF5_jll binaries on Julia's package manager are currently built without --enable-threadsafe. The ship config is:

$ cat ~/.julia/artifacts/*/lib/libhdf5.settings
Features:
---------
Parallel HDF5: yes
# ...
Threadsafety: no
# ...

This is intentional. Enabling threadsafety is incompatible with enabling MPI parallel, which is the default build. But the consequence is that multiple Julia threads entering libhdf5 concurrently can corrupt the library's internal skip-list and property-list data structures.

NCDatasets.jl and HDF5.jl each have their own ReentrantLock (NETCDF_LOCK and HDF5.API.liblock) that serialises calls within their respective wrappers. But the two locks are independent, so concurrent calls through different wrappers can still race.

ClimaAtmos runs hit this:

  • ClimaDiagnostics.NetCDFWriter writes NetCDF diagnostics every few seconds of simulated time (via NCDatasets).
  • ClimaAtmos.nan_checking_callback writes HDF5 checkpoints every 10 min of simulated time (via ClimaCore.InputOutput.HDF5Writer, which uses HDF5.jl).
  • Multi-threaded Julia (default --threads=auto) schedules these on different threads.

After 20–40 NC write cycles, we SIGSEGV with:

double free or corruption (fasttop)
[pid] signal 11 (1): Segmentation fault

stack trace in libhdf5.soH5SL_insert / H5P_create_id.

Reproducer

The PR adds test/libhdf5_thread_safety.jl, which is both the regression test for this fix and a standalone reproducer. It drives ClimaDiagnostics.NetCDFWriter NC writes concurrently with ClimaCore.InputOutput.HDF5Writer checkpoint writes from Threads.nthreads() worker threads — exactly the cross-wrapper combination the production nan_checking_callback path uses.

From the root of a local ClimaDiagnostics clone:

# Check out this branch, then:
julia --project=. --threads=8 test/libhdf5_thread_safety.jl

On main (pre-fix) this SIGSEGVs inside libhdf5.so (H5SL_insert / H5P_create_id) well before the 40-cycle budget completes. On this branch it passes:

[ Info: Survived 40 cycles × 8 threads. NC writes=320, HDF5.jl writes=…
Test Summary:                                                   | Pass  Total   Time
Concurrent NetCDFWriter + HDF5.jl writes do not corrupt libhdf5 |    1      1   28s

The test also asserts Writers.LIBHDF5_LOCK === HDF5.API.liblock, so a future refactor that silently points LIBHDF5_LOCK at a private ReentrantLock() fails this test deterministically even on a single-threaded CI box.

Fix

Serialise every libhdf5 entry from NetCDFWriter through the same ReentrantLock HDF5.jl already uses. This is HDF5.API.liblock — a public-enough API (used by HDF5.jl's own generated wrappers) that shadowing it in ClimaDiagnostics lets both wrappers share one barrier:

const LIBHDF5_LOCK = HDF5.API.liblock

function write_field!(writer::NetCDFWriter, field, diagnostic, u, p, t)
    ClimaComms.iamroot(ClimaComms.context(field)) || return nothing
    return lock(LIBHDF5_LOCK) do
        _write_field_impl!(writer, field, diagnostic, u, p, t)
    end
end

function _write_field_impl!(writer, field, diagnostic, u, p, t)
    # (original body unchanged)
end

function sync(writer::NetCDFWriter)
    lock(LIBHDF5_LOCK) do
        foreach(NCDatasets.sync, writer.unsynced_datasets)
        empty!(writer.unsynced_datasets)
    end
    return nothing
end

Verified with the patched ClimaDiagnostics: 8-thread reproducer runs cleanly (no crash, no corruption).

Alternatives considered

  1. Build HDF5_jll with --enable-threadsafe via Yggdrasil (incompatible with MPI parallel build; would require a new JLL variant). Worth pursuing independently; does not replace this PR.
  2. Make NCDatasets take an opt-in shared lock kwarg. Requires a PR to NCDatasets. Orthogonal to this PR; if accepted there, ClimaDiagnostics could pass HDF5.API.liblock through and drop this shim. Filing a follow-up issue.

This PR is the minimal correct fix that stays inside ClimaDiagnostics.

Dependencies

Adds HDF5 = "0.17" as a direct dependency. HDF5.jl is already a
transitive dependency via ClimaCore.InputOutput in any ClimaDiagnostics
consumer, so this does not add a new root package to the ecosystem.

Testing

  • test/libhdf5_thread_safety.jl: regression test; exercises the real cross-wrapper path.
  • New Buildkite step cpu_mt_tests runs the test with --threads=8 on a CPU-only agent; existing gpu_tests step was upgraded to --threads=8 + slurm_cpus_per_task: 8 to exercise the same race window on GPU.

Perf impact

One Base.lock/unlock acquisition per diagnostic write. At typical hourly diagnostic cadence over a 36 h sim that is 36 lock acquisitions totalling microseconds of overhead. Negligible.

Files changed

  • Project.toml — add HDF5 v0.17 to deps.
  • src/netcdf_writer.jl — import HDF5; define LIBHDF5_LOCK; split write_field! into lock wrapper + _write_field_impl!; wrap sync(writer) body in lock(LIBHDF5_LOCK).
  • NEWS.md — new v0.3.4 entry describing the fix.
  • test/libhdf5_thread_safety.jl (new) — full regression test that drives NetCDFWriter diagnostic writes concurrently with ClimaCore.InputOutput.HDF5Writer checkpoint writes from Threads.nthreads() worker threads for 40 cycles. This is the exact cross-wrapper combination the production sim uses (the nan_checking_callback path). Before the fix, the test's @threads loop SIGSEGVs in libhdf5 well before completion; after the fix, passes cleanly. Also asserts Writers.LIBHDF5_LOCK === HDF5.API.liblock so future regressions that silently point the lock at a private ReentrantLock() are caught.
  • test/runtests.jl — include the new test file.
  • .buildkite/pipeline.yml:
    • new step cpu_mt_tests — runs julia --threads=8 test/libhdf5_thread_safety.jl on an 8-CPU agent. This step SIGSEGVs on main (pre-fix) and passes on this PR. Guards the fix against regressions.
    • existing gpu_tests step — upgraded from default single-thread to --threads=8 + slurm_cpus_per_task: 8, so the GPU CI path also exercises the MT race window. On GPU the shared lock is identical in effect (fix is device-agnostic).

Follow-up upstream work (out of scope for this PR)

  • Issue on HDF5.jl to promote API.liblock to a documented public export.
  • Issue on NCDatasets.jl proposing an optional shared-lock kwarg.
  • Issue on Yggdrasil requesting a threadsafe HDF5_jll variant for non-MPI users.

HDF5_jll is built without `--enable-threadsafe` (libhdf5.settings:
"Threadsafety: no"). NCDatasets and HDF5.jl each have their own
ReentrantLock that serialises calls within their respective wrappers,
but the two locks are independent. When a multi-threaded Julia process
has one thread inside NCDatasets and another inside HDF5.jl — e.g. a
ClimaDiagnostics NC append concurrent with a ClimaCore.InputOutput
checkpoint write — both ccall into the same libhdf5 and corrupt its
internal skip-list / property-list state. The failure mode is a
"double free or corruption (fasttop)" + SIGSEGV in H5SL_insert /
H5P_create_id inside libhdf5, or occasionally a silent deadlock, after
~20-40 NC write cycles.

Fix: serialise every libhdf5 entry from NetCDFWriter through
HDF5.API.liblock — the lock HDF5.jl already uses for its own
thread-safety. One Julia-level barrier covers all libhdf5 calls in the
process regardless of which wrapper made them. No perf impact at
typical diagnostic cadences (microseconds of lock overhead per write).

Adds HDF5 as a direct dependency (already a transitive dep via
ClimaCore.InputOutput). Splits write_field! into an outer
lock-acquiring wrapper and an inner _write_field_impl! so the
existing body is unchanged.
Copy link
Copy Markdown
Member

@ph-kev ph-kev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a preliminary review.

I was wondering what was the use case in ClimaAtmos that led to the discovery of this bug?

Comment thread .JuliaFormatter.toml
@@ -1,5 +1,6 @@
margin = 80
margin = 92
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason for wanting to update this? I am asking since it seems that the formatter didn't change anything unless it only affected the code that was added in this PR.

Copy link
Copy Markdown
Member Author

@haakon-e haakon-e Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can revert. Just spotted it (different from other CliMA repo settings) and updated, but not needed in this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove it. It seems that there are inconsistencies with how the formatter is set up for the CliMA repos.

Comment thread .buildkite/pipeline.yml
Comment on lines +32 to +33
- label: "Run thread-safety tests on CPU (8 threads)"
key: "cpu_mt_tests"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to move the test to runtests.jl and start Julia with more threads for the CPU and GPU tests on buildkite and maybe on GHA if that is possible too.

Comment thread src/netcdf_writer.jl
end
end

function _write_field_impl!(writer::NetCDFWriter, field, diagnostic, u, p, t)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring for this function? It doesn't need to be long, but it would be good to mention that the function is unsafe to call without a lock because of needing to access libhdf5.

Comment thread src/netcdf_writer.jl
Comment on lines +569 to +570
# Serialises every libhdf5 entry from this writer with every other
# libhdf5 user (HDF5.jl, etc.) in this Julia process.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true, since this solution only works for HDF5.jl and NCDatasets.jl. If there is another package that use libhdf5, then this solution won't work. If this is the case, can you update the note and said that this solution won't work if another package use libhdf5.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can also break if NCDatasets.jl is used elsewhere (e.g. by Oceananigans.jl) at the same time as HDF5.jl. This is more relevant for the coupled model (see CliMA/ClimaCoupler.jl#1700).

Comment on lines +60 to +61
if Threads.nthreads() < 2
@info "Skipping MT race test — Threads.nthreads() < 2"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an early return is cleaner here

Suggested change
if Threads.nthreads() < 2
@info "Skipping MT race test — Threads.nthreads() < 2"
if Threads.nthreads() < 2
@info "Skipping MT race test — Threads.nthreads() < 2"
return
end

and then, you can remove the else conditional.

Comment thread test/writers.jl
end
end

@testset "libhdf5 thread-safety lock" begin
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated unless this was intended. Can you remove this?

@haakon-e
Copy link
Copy Markdown
Member Author

I was wondering what was the use case in ClimaAtmos that led to the discovery of this bug?

Thanks. I'm was running a cartesian simulation with ~16 columns in atmos where I saved data every 10 minutes (time step ~10 seconds) and reliably observed a segfault after ~40 writes. I have been able to reproduce with simpler examples (such as the attached test case) when I have a GPU, but not multi-threaded CPU. This remains a bit of a head scratcher.

@haakon-e haakon-e marked this pull request as draft April 24, 2026 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants