Skip to content

Fixes for nightly#803

Open
maleadt wants to merge 7 commits into
mainfrom
tb/nightly
Open

Fixes for nightly#803
maleadt wants to merge 7 commits into
mainfrom
tb/nightly

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented May 19, 2026

No description provided.

maleadt added 2 commits May 19, 2026 20:58
…21+.

The NVPTX back-end on LLVM 21 dropped its dependence on the legacy
nvvm.annotations metadata for maxntid/reqntid/minctasm/maxnreg; the
asm printer now reads function-level attributes that LLVM auto-upgrades
the annotations into at IR parse time. Modules built in-memory don't
go through that auto-upgrade, so emit the attributes ourselves on LLVM
21+. Also move the metadata emission ahead of optimization so the
AnnotationCache lookups done by NVVMIntrRangePass on older releases
don't latch onto a stale empty entry.
@maleadt
Copy link
Copy Markdown
Member Author

maleadt commented May 19, 2026

Hmm, I can't reproduce these failures locally...

maleadt added 5 commits May 19, 2026 22:26
ParallelTestRunner buffers each worker's stdio into an IOBuffer that's only
printed after the testset completes; an abrupt crash (the metal testset on
Julia nightly) loses the libjulia stderr that names the signal. Drop to a
single worker and enable --verbose so the crash lands on the parent process'
stderr and the 'started at' line identifies the testset in flight. Revert
once the underlying crash is diagnosed.
…nly.

The previous --jobs=1 debug commit didn't help: PTR still constructs the
Malt worker with monitor_stdout/stderr=false, drains the pipes into an
IOBuffer asynchronously, and prints the buffer only when the testset
completes. If the worker is killed abruptly, the libjulia signal/stack
trace lands on the worker's stderr while no one is reading, then the
pipe closes and the message is gone before PTR's @async reader is
scheduled.

Reopen the constructor and pass monitor_stdout=monitor_stderr=true so
Malt forwards directly to the parent. Also narrow the testsuite to the
metal testset so the failing run reaches the crash quickly. Revert all
of this once we have a signal line.
Previous attempts (forcing --jobs=1, then live-forwarding the worker's
stdio via monitor_stdout=monitor_stderr=true) still produced zero output
on the crash. Drop ParallelTestRunner/Malt altogether for this triage
run and include test/metal.jl directly in the parent process. Whatever
kills the worker — signal, libjulia abort, allocation failure, ulimit —
now hits the test driver itself and lands on the CI log's stderr with
nothing in between.
CI surfaced 'received signal: 11' from Pkg.test on the inline-runner
commit, but with no Julia-side stack trace. Preprocess metal.jl on
include and inject a 'entering testset: <name>' line on stderr before
each @testset so the last line in the CI log before the SIGSEGV pins
the crash to a single testset.
Previous CI run pinned the SIGSEGV to the 'byref primitives' testset
(last 'entering testset' marker; segv 4s later). Open up the two
byref testsets and announce each individual Metal.code_llvm call on
stderr so we can see whether the crash is the @eval, the non-kernel
compile, or the kernel=true compile.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant