Make rxode2 more thread safe by billdenney · Pull Request #992 · nlmixr2/rxode2

billdenney · 2026-03-02T22:52:32Z

This is part of an effort I'm trying to make focei estimation in nlmixr2 more thread safe.

…or thread safety In parallel ODE solving, multiple threads reading/writing the shared op->badSolve flag creates a race condition where one thread's failed solve can incorrectly prevent other threads from continuing. This introduces a localBadSolve variable in ind_indLin0, ind_liblsoda0, and ind_lsoda0 that tracks bad-solve state per-individual, and removes the shared op->badSolve check from iniSubject in par_solve.h. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

These shared fields are written from multiple threads during parallel ODE solving. Without synchronization this is undefined behavior in C/C++. Use #pragma omp atomic write to ensure thread-safe stores. Also refactor push*Dose allocation failure handling to use local flags instead of writing op->badSolve inside critical sections then reading it outside. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rf_errorcall is not thread-safe and will crash or corrupt R state when called from OpenMP worker threads. Replace with the same NA_REAL return pattern already used for backward==2 case, plus setting the badSolve flag atomically to signal the error after the parallel region. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The abort flag is read by all threads and written by the master thread without synchronization. Using volatile prevents the compiler from caching the value in a register, ensuring threads see updates promptly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rx_get_thread(mx) used <= which allowed index==mx (one past the end of per-thread arrays sized to mx). Changed to < for correct bounds. Also changed the overflow fallback from returning 0 (which collides with thread 0's slot) to clamping to mx-1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

linCmtB used a single global stan::math::linCmtStan object and associated Eigen matrices, making it unsafe for parallel execution. Refactored to use the same per-thread vector pattern as linCmtA: - Created linB_t struct holding all per-thread state - Changed __linCmtB from a single object to std::vector<linB_t> - Added ensureLinCmtB(nCores) called during solve setup - linCmtB() now indexes __linCmtB by omp_get_thread_num() - R-facing utility functions use element 0 (main thread) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move all 16 file-scope static variables in dop853.c into a dop853_ctx_t struct defined in dop853.h. Each call to dop853() now allocates the context on the stack, making the solver reentrant and safe for concurrent use from multiple threads. Changes: - Add dop853_ctx_t struct to dop853.h containing nfcn/nstep/naccpt/nrejct, hout/xold/xout, nrds/indir, yy1/k1-k10, and rcont1-rcont8 - dopcor() takes dop853_ctx_t* as first parameter, uses local aliases for array pointers to minimize diff in the computation body - contd8() becomes static and takes dop853_ctx_t* (never called externally since all callers use iout=0) - Remove accessor functions (nfcnRead, nstepRead, etc.) which were never called from outside dop853.c - Public dop853() API is unchanged - callers need no modifications This enables future parallelization of par_dop() with OpenMP, which was previously blocked by the comment "This part CAN be parallelized, if dop is thread safe..." Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve conflicts in src/handle_evid.cpp by keeping our thread-safe version (atomic badSolve flag + NA_REAL return) over main's Rf_errorcall calls, which are not safe from OpenMP parallel regions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The macro `#define J lcb.J` causes `lcb.J` on line 683 to expand to `lcb.lcb.J`, which fails because linB_t has no member named lcb. Use the macro name `J` directly instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mattfidler · 2026-03-03T02:07:55Z

Some of these are more dangerous like the dop optimizations.

Claude (at least for me), isn't exactly always doing the right things 😄

mattfidler · 2026-03-03T02:10:15Z

But they don't hurt. It just takes a review and testing I guess.

billdenney · 2026-03-03T02:19:44Z

Yeah, I looked at each and they seemed reasonable to me, but I'm not sure that they are. I have several projects now that need faster estimation, so I thought, what faster way than making nlmixr2 faster! 😄

mattfidler · 2026-03-03T02:41:06Z

Well all the pooled estimation methods are now multi-threaded in the development version.

This issue with focei may be in n1qn1, lbfgsb3, nlmixr2est, rxode2 or a combination of all the packages.

The fixes here may make the threading more stable, but rxode2 is already thread safe for saem, nlm (and relevant methods), and simple simulation.

john-harrold · 2026-03-03T03:10:59Z

I think the key is to make tests such that you feel you're adequately plumbing the depths. That's what I've been doing. I'll take a function that I want Claude to work on an just build a battery of tests for it -- I have claude make the tests :). Then I have it do what I want with those tests run at the end to make sure it didn't break anything. I think the key is it may be difficult to do this with all those. Though, like I said, you can have Claude make them.

mattfidler · 2026-03-03T03:51:45Z

For #991 I have re-prompted and it still doesn't get the code right yet... It has tests but it isn't producing the right behavior.

john-harrold · 2026-03-03T04:01:34Z

Are you using copilot? If you can, try pulling it to your personal computer and use claude code it's indescribably better. Copilot at Pfizer is still telling me to use R functions that dont exist.

mattfidler · 2026-03-03T04:27:26Z

I am using claude in the terminal. I will have to wait unitl tomorrow since I have maxed the usage again.

mattfidler · 2026-03-03T04:30:00Z

(You can see that Bill is also using claude. In github it shows Bill and I are co-authoring with Claude).

john-harrold · 2026-03-03T06:07:26Z

Sorry. You mentioned copilot in the meeting the other day.

I'll try and see if I can help tomorrow after work. Can you point out the tests you're using?

billdenney · 2026-03-03T13:24:19Z

And FYI, the reason I came here is that a crash with some of the work I'm having Claude do in nlmixr2est noted a crash in rxode2 (mainly around the unsuccessful integration flag). So, I thought that I'd do a general check rather than just the targeted check for the one issue already identified.

mattfidler · 2026-03-04T00:41:49Z

It seems more reasonable now, but it doens't pass the test suite. I am asking for it to fix it. We will see what happens.

mattfidler · 2026-03-05T02:56:16Z

This passes, asking for copilot review.

Copilot

Pull request overview

This PR is part of a broader effort to make rxode2 (and by extension nlmixr2 FOCEI estimation) more thread-safe for parallel ODE solving under OpenMP.

Changes:

Refactored dop853 from file-scope static variables to a per-call context struct (dop853_ctx_t), making it fully reentrant.
Refactored linCmtB from a single global object to a per-thread std::vector<linB_t>, matching the existing linCmtA threading pattern.
Replaced reads of the shared op->badSolve flag mid-solve with per-individual localBadSolve variables, and protected writes to op->badSolve/op->naTime with #pragma omp atomic write; replaced non-thread-safe Rf_errorcall calls in hot-path functions.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`src/rxomp.h`	Fixed off-by-one in thread index check (`<=` → `<`) and changed out-of-range fallback from slot 0 to last valid slot
`src/rxData.cpp`	Added call to `ensureLinCmtB(nCores)` before parallel solve to pre-allocate per-thread linCmtB state
`src/par_solve.h`	Removed early `op->badSolve` exit from `iniSubject` to prevent one thread's failure from blocking other threads' initialization
`src/par_solve.cpp`	Added `localBadSolve` variables, `volatile int abort`, and `_Pragma("omp atomic write")` in `badSolveExit` macro
`src/linCmt.cpp`	Replaced global `__linCmtB` singleton with a per-thread `std::vector<linB_t>`; added `ensureLinCmtB`
`src/handle_evid.cpp`	Replaced `Rf_errorcall` (not thread-safe) with atomic write to `op->badSolve` + return NA
`src/dop853.h`	Added `dop853_ctx_t` struct; removed declarations of deleted accessor functions
`src/dop853.c`	Moved all static solver state into a local `dop853_ctx_t` on the stack; deleted global accessor functions
`inst/include/rxode2parseHandleEvid.h`	Used `allocFailed` flag + `#pragma omp atomic write` instead of direct `op->badSolve` assignment inside/outside critical sections
`inst/include/rxode2parseGetTime.h`	Replaced direct writes to `op->badSolve`/`op->naTime` with atomic writes via temporary variables

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-05T03:02:15Z

src/linCmt.cpp

-#define yp        __linCmtByp
-#define g         __linCmtBg
+  // Per-thread linCmtB state (matching linCmtA pattern for thread safety)
+  linB_t &lcb = __linCmtB[omp_get_thread_num()];


The linCmtB function indexes into __linCmtB using omp_get_thread_num() directly, without any bounds check. If omp_get_thread_num() returns a value >= __linCmtB.size() (e.g., if the function is called from a context where more threads exist than nCores was set to, or if ensureLinCmtB was not called), this produces an out-of-bounds vector access and undefined behavior. The similar linCmtA function has the same pattern, but the safer rx_get_thread(op->cores) is used elsewhere in the codebase (e.g., in _setIndPointersByThread). Consider using rx_get_thread(__linCmtB.size()) or adding a bounds assertion here.

Suggested change

linB_t &lcb = __linCmtB[omp_get_thread_num()];

linB_t &lcb = __linCmtB[rx_get_thread(__linCmtB.size())];

Copilot · 2026-03-05T03:02:15Z

inst/include/rxode2parseGetTime.h

  if (ISNA(ret)) {
-    op->badSolve=1;
-    if (op->naTime == 0) {
-      op->naTime = 1 + 10*cmt;
-    }
+    int newBadSolve = 1;
+    int newNaTime = 1 + 10*cmt;
+#pragma omp atomic write
+    op->badSolve = newBadSolve;
+#pragma omp atomic write
+    op->naTime = newNaTime;


The old code had if (op->naTime == 0) { op->naTime = 1 + 10*cmt; } to preserve the first error's diagnostic code. The new code unconditionally overwrites op->naTime with every error occurrence. In parallel mode this means the error code used for the final error message (in rxode2_df.cpp:212-239) is non-deterministic — it may reflect a later/different error rather than the first one that triggered the bad solve. While op->badSolve just needs to be set once, the op->naTime diagnostic value ideally should preserve the first non-zero value to give a consistent error message. The same behavior change applies to getRate, getDur, and handleInfusionItem in this file.

Copilot · 2026-03-05T03:02:16Z

src/dop853.h

 Several functions provide access to different values :

 xRead   x value for which the solution has been computed (x=xend after
 	successful return).

 hRead   Predicted step size of the last accepted step (useful for a subsequent
 	call to dop853).

 nstepRead   Number of used steps.
 naccptRead  Number of accepted steps.
 nrejctRead  Number of rejected steps.
 nfcnRead    Number of function calls.


 */


 #include <stdio.h>
 #include <limits.h>

 typedef void (*FcnEqDiff)(int *nptr, double x, double *y, double *f);
 typedef void (*SolTrait)(long int nr, double xold, double x, double* y, int *nptr, int* irtrn);

+/* Context struct holding all solver state, enabling thread-safe reentrant calls.
+   Previously these were file-scope static variables in dop853.c. */
+typedef struct {
+  long int     nfcn, nstep, naccpt, nrejct;
+  double       hout, xold, xout;
+  int          nrds, *indir;
+  double       *yy1, *k1, *k2, *k3, *k4, *k5, *k6, *k7, *k8, *k9, *k10;
+  double       *rcont1, *rcont2, *rcont3, *rcont4;
+  double       *rcont5, *rcont6, *rcont7, *rcont8;
+} dop853_ctx_t;


The dop853.h documentation comment block (lines 160-171) still references the now-deleted accessor functions (xRead, hRead, nstepRead, naccptRead, nrejctRead, nfcnRead) and the comment at line 77 documents contd8(i,s) with the old two-argument signature, whereas the refactored contd8 is now static and takes three arguments (dop853_ctx_t *ctx, int ii, double x). This documentation section should be updated or removed to avoid confusion for future maintainers.

billdenney and others added 9 commits March 2, 2026 17:47

Merge main into parallel-updates

a7e3241

Resolve conflicts in src/handle_evid.cpp by keeping our thread-safe version (atomic badSolve flag + NA_REAL return) over main's Rf_errorcall calls, which are not safe from OpenMP parallel regions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mattfidler requested a review from Copilot March 5, 2026 02:54

Copilot started reviewing on behalf of mattfidler March 5, 2026 02:55 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

	linB_t &lcb = __linCmtB[omp_get_thread_num()];
	linB_t &lcb = __linCmtB[rx_get_thread(__linCmtB.size())];

Conversation

billdenney commented Mar 2, 2026

Uh oh!

mattfidler commented Mar 3, 2026

Uh oh!

mattfidler commented Mar 3, 2026

Uh oh!

billdenney commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattfidler commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

john-harrold commented Mar 3, 2026

Uh oh!

mattfidler commented Mar 3, 2026

Uh oh!

john-harrold commented Mar 3, 2026

Uh oh!

mattfidler commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattfidler commented Mar 3, 2026

Uh oh!

john-harrold commented Mar 3, 2026

Uh oh!

billdenney commented Mar 3, 2026

Uh oh!

mattfidler commented Mar 4, 2026

Uh oh!

mattfidler commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

billdenney commented Mar 3, 2026 •

edited

Loading

mattfidler commented Mar 3, 2026 •

edited

Loading

mattfidler commented Mar 3, 2026 •

edited

Loading