Skip to content

Extending the orbit average phase and pseudo orbit averaging to 2x, multispecies simulations, and tandem mirrors#934

Open
Maxwell-Rosen wants to merge 40 commits intomainfrom
gk-oap-2x-tandem-multispecies
Open

Extending the orbit average phase and pseudo orbit averaging to 2x, multispecies simulations, and tandem mirrors#934
Maxwell-Rosen wants to merge 40 commits intomainfrom
gk-oap-2x-tandem-multispecies

Conversation

@Maxwell-Rosen
Copy link
Collaborator

@Maxwell-Rosen Maxwell-Rosen commented Jan 26, 2026

Feature

Summary

Purpose: The orbit average phase and pseudo orbit average scheme were first implemented for single species simulations. This must be generalized to kinetic electrons simulations, 2x, and tandem mirrors. This PR adds these capabilities

fixes #793

Implementation Details

Key changes:
gk_species_fdot_multiplier.c and gk_species_damping.c are modified to use an array_dg_find_peaks object. This is necessary to identify the locations of the walls, the tandem mirror peak, and the inner mirror peak. Critically, the maximum bmag and phi were previously defined at a single point, the mirror throat. Now, Bmag and phi are arrays in 2x, or a single-component array in 1x. These must be global arrays because peak finding is a global operation (each process must see global bmag and phi to find phi at the peaks of bmag).

array_dg_find_peaks.c is introduced as a new method inside core. This takes any DG array and finds the peaks. Currently, it is limited to 2D and 1D arrays, but this would be very complicated in 3D. It takes your 2D array and constructs N*1D arrays for the peaks. We can then project another 2D array onto the peaks of the initial 2D array. Comprehensive unit tests are added for this core method. GPU unit tests are included. The GPU implementation is a bit redundant since peak finding is recursive (and done on CPU) and the projection would be faster on CPU since it wouldn't have to launch a CUDA kernel, but Gkeyll is constructed in such a way that memory must lie on the GPU when running GPU code, so it's difficult to have this object always live on host and copy the neccisary information over. I tried this and it had difficulty interfacing with the loss cone mask module.

loss_cone_mask_gyrokinetic is adjusted for the changes necessary for 2x mirrors, tandem, and kinetic electrons. This meant a big overhaul of this updater so it's working with arrays, as well as extra arguments for tandem and kinetic electrons.

Dependencies: core/zero/array_dg_find_peaks.c

Automated testing: four regression tests are added

  1. gk_mirror_boltz_elc_poa_2x2v_p1
  2. gk_mirror_kinetic_elc_poa_1x2v_p1
  3. gk_mirror_tandem_boltz_elc_poa_1x2v
  4. gk_wham_kinetic_poa_1x2v_p1

LLM details: I relied heavily on LLMs for this PR. The find_peaks method is almost entirely written by LLMs, given a few examples. It's thoroughly tested to machine precision in the unit test and performs well.

Community Standards

  • Documentation has been updated.
  • My code follows the project's coding guidelines.
  • Changes to layer/zero should have a unit test, e.g., core/zero.

Testing:

  • I added a regression test to test this feature.
  • I added this feature to an existing regression test.
  • I added a unit test to test this feature.
  • Ran make check and unit tests all pass.
  • I ran the code with make valcheck, and it is clean.
  • I ran the code through computer-sanitizer on GPU, and it is clean.
  • I ran a few regression tests to ensure no apparent errors.
  • Tested and works on CPU.
  • Tested and works on multi-CPU.
  • Tested and works on GPU.
  • Tested and works on multi-GPU.

Additional Notes

Here is some evidence that the mask works correctly for the CFL condition and is different for the electrons and ions.

pgkyl gk_mirror_kinetic_elc_poa_1x2v_p1-ion_cflrate_10.gkyl sel --z0 0.0 pl --title "Ion Z=0" --xlabel "vpar" --ylabel "mu" &
pgkyl gk_mirror_kinetic_elc_poa_1x2v_p1-ion_cflrate_10.gkyl sel --z0 1.5 pl --title "Ion Z=1.5" --xlabel "vpar" --ylabel "mu" &
pgkyl gk_mirror_kinetic_elc_poa_1x2v_p1-elc_cflrate_10.gkyl sel --z0 0.0 pl --title "Elc Z=0.0" --xlabel "vpar" --ylabel "mu" &
pgkyl gk_mirror_kinetic_elc_poa_1x2v_p1-elc_cflrate_10.gkyl sel --z0 1.5 pl --title "Elc Z=1.5" --xlabel "vpar" --ylabel "mu" &
image

When comparing the 1x2v test of gk_wham_poa_1x2v_p1, it reproduces the exact same mask, even in parallel on GPU. All tests here are over 4 GPUs on perlmutter

pgkyl tests-main/gk_wham_poa_1x2v_p1-ion_fdot_multiplier_10.gkyl tests-branch/gk_wham_poa_1x2v_p1-ion_fdot_multiplier_10.gkyl ev "f[0] f[1] -" info
f[0] f[1] - (default#2)
├─ Time: 6.002000e-06
├─ Frame: 10
├─ Number of components: 1
├─ Number of dimensions: 3
├─ Grid: (uniform)
│  ├─ Dim 0: Num. cells: 288; Lower: -2.000000e+00; Upper: 2.000000e+00
│  ├─ Dim 1: Num. cells: 32; Lower: -1.000000e+00; Upper: 1.000000e+00
│  └─ Dim 2: Num. cells: 32; Lower: 0.000000e+00; Upper: 1.000000e+00
├─ Maximum: 0.000000e+00 at (0, 0, 0)
├─ Minimum: 0.000000e+00 at (0, 0, 0)
├─ geometry_type: 2
├─ geqdsk_sign_convention: 0

pgkyl tests-main/gk_wham_poa_1x2v_p1-ion_fdot_multiplier_10.gkyl tests-branch/gk_wham_poa_1x2v_p1-ion_fdot_multiplier_10.gkyl ev "f[0] f[1] -" info
f[0] f[1] - (default#2)
├─ Time: 6.002000e-06
├─ Frame: 10
├─ Number of components: 1
├─ Number of dimensions: 3
├─ Grid: (uniform)
│  ├─ Dim 0: Num. cells: 288; Lower: -2.000000e+00; Upper: 2.000000e+00
│  ├─ Dim 1: Num. cells: 32; Lower: -1.000000e+00; Upper: 1.000000e+00
│  └─ Dim 2: Num. cells: 32; Lower: 0.000000e+00; Upper: 1.000000e+00
├─ Maximum: 0.000000e+00 at (0, 0, 0)
├─ Minimum: 0.000000e+00 at (0, 0, 0)
├─ DG info:
│  ├─ Polynomial Order: 0
│  └─ Basis Type: serendipity (modal)
├─ Geometry info:
│  ├─ Type: GKYL_GEOMETRY_MIRROR
│  ├─ GEQDSK sign convention: 0

The tandem mirror loss mask is reproduced well.

$\mu_{\rm left} = \max_{i<k}\left(\frac{\frac{1}{2}m_s v_\parallel^2 + q_s(\phi(z) - \phi_{mi})}{B_{mi} - B(z)}\right)$
$\mu_{\rm right} = \max_{i>k}\left(\frac{\frac{1}{2}m_s v_\parallel^2 + q_s(\phi(z) - \phi_{mi})}{B_{mi} - B(z)}\right)$
$\mu = \min\left(\mu_{\rm left} ,\mu_{\rm right} \right)$

pgkyl gk_mirror_tandem_boltz_elc_poa_1x2v-bmag.gkyl interp pl --title "|B| for tandem mirror" --xlabel "Z"
image

pgkyl gk_mirror_tandem_boltz_elc_poa_1x2v-ion_fdot_multiplier_10.gkyl sel --z2 12 pl --xlabel "Z" --ylabel "computational vpar" --title "Loss cone mask of tandem mirror"
--c2p-vel does not work for this figure because it is c0.
image

Compute sanitizer clean on perlmutter
compute-sanitizer --tool memcheck --leak-check full ./cuda-build/gyrokinetic/creg/rt_gk_mirror_boltz_elc_poa_2x2v_p1 -g -s1
========= COMPUTE-SANITIZER

Running phase 0 @ t = 0.000000000e+00 ...
Taking time-step 1 at t = 0 ... dt = 0.000132681

Number of update calls 1
Number of forward-Euler calls 3
Number of RK stage-2 failures 0
Number of RK stage-3 failures 0
Number of write calls 9
Timing: [removed]
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 0 errors

rt_gk_mirror_boltz_elc_damped_1x2v_p1 runs as usual on 4 GPU.

This produces reasonable results for the 2x2v mirror
pgkyl gk_mirror_boltz_elc_poa_2x2v_p1-ion_fdot_multiplier_2.gkyl sel --z1 0.0 --z3 14 pl --xlabel "psi" --ylabel "vpar"
image
pgkyl gk_mirror_boltz_elc_poa_2x2v_p1-ion_fdot_multiplier_2.gkyl sel --z0 0 --z3 14 pl --xlabel "Z" --ylabel "vpar"
image

…e maximum magnetic field is determined. Bmag max is stored as a gkyl_array. Right now, we only do this for bmag, but we need to store phi as a 1d maximum array. I haven't decided on the final design for how 2x OAP simulations should be accomodated. Perhpas we need some general gkyl_dg_array_reduce methods that take a 2D array and turn it into a 1D array instead of a 0D number. This is a kind of reduction method, but it's not a total reduction. I tested the regression test included for a 2x2v boltzmann mirror and the output of the magnetic field looks correct. The current implementation evaluates bmag at cell corners, but we ideally should do the corners in Z, but the quadrature nodes in psi.
- Introduced a new header file `gkyl_array_dg_find_peaks.h` that defines a structure and functions for finding peaks (local maxima, minima, and boundary values) in DG fields.
- Implemented an internal structure in `gkyl_array_dg_find_peaks_priv.h` to manage peak finding operations, including storage for peak values and coordinates.
- Removed unused initialization and writing of `bmag_max` arrays in `gyrokinetic.c` to streamline the geometry setup process.
- Deleted the `gkyl_gk_geometry_bmag_max_init` and `gkyl_gk_geometry_bmag_max_release` functions from `gk_geometry.c` as they are no longer needed, simplifying the geometry management.
…es another DG array at the peaks of the initilized array. Add appropriate unit tests which pass to ctest_array_dg_find_peaks. Update gk_species_fdot_multiplier to use the project_on_peaks function with phi. Now, everything passed to loss_cone_mask_gyrokinetic is a gkyl_array. The loss_cone_mask is updated accordingly
…ount of compution we need for evaluating phi at its peak in the app. Unit tests pass. Regression tests look fine as well. They're all valgrind clean. I think the right way to do the paralellism is to do the peak finding on a global bmag, just like how it is done for the position_map, then when we evaluate phi, all processes evaluate it at this peak, however only one will return a true value. This process will broadcast the array to the rest of the processes
… method, which is just like find_peaks, but it computes the global maximum or minimum.
…nd regression tests are brought over from another branch. Unit tests for the array mask, loss cone mask, and the regression tests for the kinetic electron POA mirror are valgrind free
…grind clean. Regression test is added and produces reasonable results.
…formance

- Introduced a helper function `mkarr` to streamline array allocation for GPU and CPU.
- Removed the `gkyl_loss_cone_mask_gyrokinetic_Dbmag_quad_wall` function and integrated its logic into the main processing flow.
- Updated the GPU kernel `gkyl_loss_cone_mask_gyrokinetic_Dbmag_quad_cu_ker` to compute `Dbmag_quad` directly from `bmag_peak` instead of `bmag_max`.
- Enhanced tandem mirror support by adding handling for `bmag_peak` and `phi_m` in the GPU kernels.
- Simplified the logic for determining trapped particles in the `gkyl_loss_cone_mask_gyrokinetic_ker` and `gkyl_loss_cone_mask_gyrokinetic_quad_ker` functions.
- Improved readability and maintainability by restructuring conditional checks and variable assignments.
…plier. Add possibility of kinetic electrons and tandem mirrors. The damping regression test is failing, both here and on main. They are for different issues. Main fails because the loss_cone updater has an issue. Here, it fails because it's using scale_by_cell with a multi-component array. I'm not sure the right way to fix this
…ove the aspects about the cellwise evaluation and quadrature points because that breaks the array_scale_by_cell method which is used
… refactor this in the future, but it is just proof of concept for now to make sure it works correctly.
… arrays need to be passed to objects like the loss_cone_mask, where it expects these to be GPU arrays. It's just easier to have this module fit the archatecture of the rest of the code, rather than doing something different and copying between device and host. It wouldn't interface well. Claude generated most of the cuda code, with strong guidence from Maxwell
…h compute sanitizer with the array_find_peaks which was causing crashes in the loss cone mask. These issues are fixed. There was some funny business regarding the basis being on the host vs device.

Refactor the allocations in the GPU kernels to not be inside the kernels. Instead, it's allocated at init time. The GPU code pulses, which is odd, but it runs. The 2x2v POA regression test runs and is compute sanitizer clean. The other POA tests do not error either on GPU and are compute sanitizer clean.
…itting code to main and broke a unit test with the geometry enum changes
…hi_smooth_global array for improved performance and consistency across computations.
…ion of doing the kinetic electron and tandem mirror. The code is built again to make sure nothing is affected.
@Maxwell-Rosen Maxwell-Rosen added the enhancement New feature or request label Feb 24, 2026
@Maxwell-Rosen Maxwell-Rosen marked this pull request as ready for review February 24, 2026 18:57
@Maxwell-Rosen
Copy link
Collaborator Author

@JunoRavin and I have expressed some concerns over the GPU implementation of the find_peaks methods. It's a bit silly to compute here on the GPU, since we're nowhere close to saturating it. However, from a memory management standpoint, having everything on the GPU is simpler. The memory management organization is why I decided on the current implementation.

To ensure that we are not wasting time, I have timed the gkyl_array_dg_find_peaks_project_on_peak_idx function in the test
rt_gk_mirror_boltz_elc_poa_1x2v_p1 on 1 GPU on Perlmutter. I am summing up all the time spent in each function call over the entire regression test. From the find_peaks methods, only the project_on_peak is called inside the time loop since the advance function runs during initialization of the fdot multiplier to find where the peaks are.

Even inside this gk_species_fdot_multiplier_advance_loss_cone_mult function, the cost of the peak projection is small compared to the gkyl_loss_cone_mask_gyrokinetic_advance, so we should not be worried about inefficient use of the GPU. Accelerating this module will not accelerate the code in a meaningful way.

advance_loss_cone_mult accumulated timings over 621 calls (s):
  allgather        : 0.00580832
  find_peaks (max) : 0.00985657
  loss_cone_multiplier_advance      : 0.0727482
  scale_by_cell    : 0.00640832
  total            : 0.104605

In total, this simulation took 18 seconds and 0.02 seconds were spent on the gkyl_array_dg_find_peaks_project_on_peak_idx function. I would say this is a win.

…egression test (and in my production simulations)
…heme. There is no POA scheme in the run methods.
…nto the loss_cone_mask_advance, but the itterators are local, so they were itterating over the local ranges, which was offsetting phi when it is evaluated. It should instead be using the local phi. Furthermore, I cleaned up the unit tests a little.

I also realize that we should not be calling up->c2p_pos(xmu, xmu, up->c2p_pos_ctx);. This is because the spatial coordinates are only used to find if we are beyond the mirror throat and in the expander. The mirror throat is evaluated in computational coordinates, so we should compare computational to computational coordinates. I'm not sure if this is a mistake on main because in this branch, I re-did how the mirror throat is found using the find_peaks operator.
…erything is done in computational coordinates
@Maxwell-Rosen
Copy link
Collaborator Author

Maxwell-Rosen commented Mar 17, 2026

I have my first example of a 2x mirror of ions with a Boltzmann field, using the POA scheme. This was performed on 4 GPUs on Perlmutter. The simulation used 8 cells in psi, the sqrt_psi grid, $R_m = 5$, $\tau_{OAP} = 0.1 s$, $\tau_{FDP} = 5 \times 10^{-6} s$, 4 POA cycles, without ion-electron collisions, without time dilation. I loaded the initial condition from a 1x simulation for this geometry, projected along all field lines. This simulation is brief, but shows promise, as it does not crash.

image

Here is the CFL rate on the interior field line at the midplane. The --c2p-vel option gives an error for making this plot.
image
Here is the distribution function at this point
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DR] Pseudo-orbit average (POA) multiscale algorithm for magnetic mirrors

1 participant