Skip to content

Quda gauge flow#84

Open
leonhostetler wants to merge 19 commits intomilc-qcd:developfrom
leonhostetler:quda_gauge_flow
Open

Quda gauge flow#84
leonhostetler wants to merge 19 commits intomilc-qcd:developfrom
leonhostetler:quda_gauge_flow

Conversation

@leonhostetler
Copy link
Copy Markdown
Collaborator

This PR adds an interface to QUDA's gauge flow routine. This is an intermediate update for MILC gauge flow before a planned future update that will refactor the gauge flow so that it can be called as a function from other applications rather than only operating as a standalone application.

Recent updates to QUDA (lattice/quda#1555) added fourth order integrator, fixed a bug in the topological charge calculation, added the rectangle observable, and added support for anisotropic flow and smearing. With those updates, MILC can now usefully offload gauge flow to QUDA.

Caveats

  • Zeuthen flow is not yet supported by QUDA
  • Due to different normalization conventions, there is a factor of 3 between MILC and QUDA's plaquette and rectangle values.

Usage

The following MILC targets are supported:

  • wilson_flow: Wilson or Symanzik flow using third-order Runge-Kutta integrator
  • wilson_flow_bbb: Wilson or Symanzik flow using fourth-order Runge-Kutta integrator
  • wilson_flow_a: Anisotropic Wilson or Symanzik flow using third-order Runge-Kutta integrator
  • wilson_flow_bbb_a: Anisotropic Wilson or Symanzik flow using fourth-order Runge-Kutta integrator

Simply compile with WANTQUDA=true and provide QUDA, QIO, and QMP paths in the usual way.

Performance

0.09 fm ($64^3 \times 96$):

On 1 Big Red 200 CPU node (128 cores), 40 steps of Wilson flow took 1234s and 40 steps of Symanzik flow took 3993s during a single benchmarking run. On one DeltaAI GH200 GPU (1 GPU not 1 node!) the same took 15.6s and 45.9s respectively. This means (in some sense) that 1 GH200 $\sim$ 10125 CPU cores for Wilson flow and 1 GH200 $\sim$ 11135 CPU cores for Symanzik flow. This is an ideal case for the GPU running because the 0.09 fm lattice fits on a single GH200 (no inter-GPU or inter-node communication is needed) and almost completely fills the GPU. The QUDA test of course took advantage of a previously-generated tunecache.

0.04 fm ($144^3 \times 288$)

This is now a much larger lattice, so even the GPU case cannot escape the need for inter-node communication. GPU flow benchmarks on Perlmutter are compared to Perlmutter CPU production runs done in 2022. In 2022, 720 steps of Wilson flow performed on 64 CPU nodes took 2.97 hours, whereas today the same performed on 32 GPU nodes takes 10.6 minutes. In this comparison, 1 A100 $\sim$ 1080 CPU cores. For Symanzik flow, the CPU running was done on 128 nodes and 720 flow steps took 4.69 hours for one particular config, but only takes 15.2 minutes when run on 32 GPU nodes. In this comparison, 1 A100 $\sim$ 2360 CPU cores.

@leonhostetler leonhostetler requested a review from bazalesha April 22, 2025 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant