Conversation
…or thread safety In parallel ODE solving, multiple threads reading/writing the shared op->badSolve flag creates a race condition where one thread's failed solve can incorrectly prevent other threads from continuing. This introduces a localBadSolve variable in ind_indLin0, ind_liblsoda0, and ind_lsoda0 that tracks bad-solve state per-individual, and removes the shared op->badSolve check from iniSubject in par_solve.h. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These shared fields are written from multiple threads during parallel ODE solving. Without synchronization this is undefined behavior in C/C++. Use #pragma omp atomic write to ensure thread-safe stores. Also refactor push*Dose allocation failure handling to use local flags instead of writing op->badSolve inside critical sections then reading it outside. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rf_errorcall is not thread-safe and will crash or corrupt R state when called from OpenMP worker threads. Replace with the same NA_REAL return pattern already used for backward==2 case, plus setting the badSolve flag atomically to signal the error after the parallel region. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The abort flag is read by all threads and written by the master thread without synchronization. Using volatile prevents the compiler from caching the value in a register, ensuring threads see updates promptly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rx_get_thread(mx) used <= which allowed index==mx (one past the end of per-thread arrays sized to mx). Changed to < for correct bounds. Also changed the overflow fallback from returning 0 (which collides with thread 0's slot) to clamping to mx-1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
linCmtB used a single global stan::math::linCmtStan object and associated Eigen matrices, making it unsafe for parallel execution. Refactored to use the same per-thread vector pattern as linCmtA: - Created linB_t struct holding all per-thread state - Changed __linCmtB from a single object to std::vector<linB_t> - Added ensureLinCmtB(nCores) called during solve setup - linCmtB() now indexes __linCmtB by omp_get_thread_num() - R-facing utility functions use element 0 (main thread) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move all 16 file-scope static variables in dop853.c into a dop853_ctx_t struct defined in dop853.h. Each call to dop853() now allocates the context on the stack, making the solver reentrant and safe for concurrent use from multiple threads. Changes: - Add dop853_ctx_t struct to dop853.h containing nfcn/nstep/naccpt/nrejct, hout/xold/xout, nrds/indir, yy1/k1-k10, and rcont1-rcont8 - dopcor() takes dop853_ctx_t* as first parameter, uses local aliases for array pointers to minimize diff in the computation body - contd8() becomes static and takes dop853_ctx_t* (never called externally since all callers use iout=0) - Remove accessor functions (nfcnRead, nstepRead, etc.) which were never called from outside dop853.c - Public dop853() API is unchanged - callers need no modifications This enables future parallelization of par_dop() with OpenMP, which was previously blocked by the comment "This part CAN be parallelized, if dop is thread safe..." Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts in src/handle_evid.cpp by keeping our thread-safe version (atomic badSolve flag + NA_REAL return) over main's Rf_errorcall calls, which are not safe from OpenMP parallel regions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The macro `#define J lcb.J` causes `lcb.J` on line 683 to expand to `lcb.lcb.J`, which fails because linB_t has no member named lcb. Use the macro name `J` directly instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Some of these are more dangerous like the dop optimizations. Claude (at least for me), isn't exactly always doing the right things 😄 |
|
But they don't hurt. It just takes a review and testing I guess. |
|
Yeah, I looked at each and they seemed reasonable to me, but I'm not sure that they are. I have several projects now that need faster estimation, so I thought, what faster way than making nlmixr2 faster! 😄 |
|
Well all the pooled estimation methods are now multi-threaded in the development version. This issue with focei may be in The fixes here may make the threading more stable, but |
|
I think the key is to make tests such that you feel you're adequately plumbing the depths. That's what I've been doing. I'll take a function that I want Claude to work on an just build a battery of tests for it -- I have claude make the tests :). Then I have it do what I want with those tests run at the end to make sure it didn't break anything. I think the key is it may be difficult to do this with all those. Though, like I said, you can have Claude make them. |
|
For #991 I have re-prompted and it still doesn't get the code right yet... It has tests but it isn't producing the right behavior. |
|
Are you using copilot? If you can, try pulling it to your personal computer and use claude code it's indescribably better. Copilot at Pfizer is still telling me to use R functions that dont exist. |
|
I am using claude in the terminal. I will have to wait unitl tomorrow since I have maxed the usage again. |
|
(You can see that Bill is also using claude. In github it shows Bill and I are co-authoring with Claude). |
|
Sorry. You mentioned copilot in the meeting the other day. I'll try and see if I can help tomorrow after work. Can you point out the tests you're using? |
|
And FYI, the reason I came here is that a crash with some of the work I'm having Claude do in nlmixr2est noted a crash in rxode2 (mainly around the unsuccessful integration flag). So, I thought that I'd do a general check rather than just the targeted check for the one issue already identified. |
|
It seems more reasonable now, but it doens't pass the test suite. I am asking for it to fix it. We will see what happens. |
|
This passes, asking for copilot review. |
There was a problem hiding this comment.
Pull request overview
This PR is part of a broader effort to make rxode2 (and by extension nlmixr2 FOCEI estimation) more thread-safe for parallel ODE solving under OpenMP.
Changes:
- Refactored
dop853from file-scope static variables to a per-call context struct (dop853_ctx_t), making it fully reentrant. - Refactored
linCmtBfrom a single global object to a per-threadstd::vector<linB_t>, matching the existinglinCmtAthreading pattern. - Replaced reads of the shared
op->badSolveflag mid-solve with per-individuallocalBadSolvevariables, and protected writes toop->badSolve/op->naTimewith#pragma omp atomic write; replaced non-thread-safeRf_errorcallcalls in hot-path functions.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/rxomp.h |
Fixed off-by-one in thread index check (<= → <) and changed out-of-range fallback from slot 0 to last valid slot |
src/rxData.cpp |
Added call to ensureLinCmtB(nCores) before parallel solve to pre-allocate per-thread linCmtB state |
src/par_solve.h |
Removed early op->badSolve exit from iniSubject to prevent one thread's failure from blocking other threads' initialization |
src/par_solve.cpp |
Added localBadSolve variables, volatile int abort, and _Pragma("omp atomic write") in badSolveExit macro |
src/linCmt.cpp |
Replaced global __linCmtB singleton with a per-thread std::vector<linB_t>; added ensureLinCmtB |
src/handle_evid.cpp |
Replaced Rf_errorcall (not thread-safe) with atomic write to op->badSolve + return NA |
src/dop853.h |
Added dop853_ctx_t struct; removed declarations of deleted accessor functions |
src/dop853.c |
Moved all static solver state into a local dop853_ctx_t on the stack; deleted global accessor functions |
inst/include/rxode2parseHandleEvid.h |
Used allocFailed flag + #pragma omp atomic write instead of direct op->badSolve assignment inside/outside critical sections |
inst/include/rxode2parseGetTime.h |
Replaced direct writes to op->badSolve/op->naTime with atomic writes via temporary variables |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| #define yp __linCmtByp | ||
| #define g __linCmtBg | ||
| // Per-thread linCmtB state (matching linCmtA pattern for thread safety) | ||
| linB_t &lcb = __linCmtB[omp_get_thread_num()]; |
There was a problem hiding this comment.
The linCmtB function indexes into __linCmtB using omp_get_thread_num() directly, without any bounds check. If omp_get_thread_num() returns a value >= __linCmtB.size() (e.g., if the function is called from a context where more threads exist than nCores was set to, or if ensureLinCmtB was not called), this produces an out-of-bounds vector access and undefined behavior. The similar linCmtA function has the same pattern, but the safer rx_get_thread(op->cores) is used elsewhere in the codebase (e.g., in _setIndPointersByThread). Consider using rx_get_thread(__linCmtB.size()) or adding a bounds assertion here.
| linB_t &lcb = __linCmtB[omp_get_thread_num()]; | |
| linB_t &lcb = __linCmtB[rx_get_thread(__linCmtB.size())]; |
| if (ISNA(ret)) { | ||
| op->badSolve=1; | ||
| if (op->naTime == 0) { | ||
| op->naTime = 1 + 10*cmt; | ||
| } | ||
| int newBadSolve = 1; | ||
| int newNaTime = 1 + 10*cmt; | ||
| #pragma omp atomic write | ||
| op->badSolve = newBadSolve; | ||
| #pragma omp atomic write | ||
| op->naTime = newNaTime; |
There was a problem hiding this comment.
The old code had if (op->naTime == 0) { op->naTime = 1 + 10*cmt; } to preserve the first error's diagnostic code. The new code unconditionally overwrites op->naTime with every error occurrence. In parallel mode this means the error code used for the final error message (in rxode2_df.cpp:212-239) is non-deterministic — it may reflect a later/different error rather than the first one that triggered the bad solve. While op->badSolve just needs to be set once, the op->naTime diagnostic value ideally should preserve the first non-zero value to give a consistent error message. The same behavior change applies to getRate, getDur, and handleInfusionItem in this file.
| Several functions provide access to different values : | ||
|
|
||
| xRead x value for which the solution has been computed (x=xend after | ||
| successful return). | ||
|
|
||
| hRead Predicted step size of the last accepted step (useful for a subsequent | ||
| call to dop853). | ||
|
|
||
| nstepRead Number of used steps. | ||
| naccptRead Number of accepted steps. | ||
| nrejctRead Number of rejected steps. | ||
| nfcnRead Number of function calls. | ||
|
|
||
|
|
||
| */ | ||
|
|
||
|
|
||
| #include <stdio.h> | ||
| #include <limits.h> | ||
|
|
||
| typedef void (*FcnEqDiff)(int *nptr, double x, double *y, double *f); | ||
| typedef void (*SolTrait)(long int nr, double xold, double x, double* y, int *nptr, int* irtrn); | ||
|
|
||
| /* Context struct holding all solver state, enabling thread-safe reentrant calls. | ||
| Previously these were file-scope static variables in dop853.c. */ | ||
| typedef struct { | ||
| long int nfcn, nstep, naccpt, nrejct; | ||
| double hout, xold, xout; | ||
| int nrds, *indir; | ||
| double *yy1, *k1, *k2, *k3, *k4, *k5, *k6, *k7, *k8, *k9, *k10; | ||
| double *rcont1, *rcont2, *rcont3, *rcont4; | ||
| double *rcont5, *rcont6, *rcont7, *rcont8; | ||
| } dop853_ctx_t; |
There was a problem hiding this comment.
The dop853.h documentation comment block (lines 160-171) still references the now-deleted accessor functions (xRead, hRead, nstepRead, naccptRead, nrejctRead, nfcnRead) and the comment at line 77 documents contd8(i,s) with the old two-argument signature, whereas the refactored contd8 is now static and takes three arguments (dop853_ctx_t *ctx, int ii, double x). This documentation section should be updated or removed to avoid confusion for future maintainers.
This is part of an effort I'm trying to make focei estimation in nlmixr2 more thread safe.