Quda gauge flow by leonhostetler · Pull Request #84 · milc-qcd/milc_qcd

leonhostetler · 2025-04-22T12:48:17Z

This PR adds an interface to QUDA's gauge flow routine. This is an intermediate update for MILC gauge flow before a planned future update that will refactor the gauge flow so that it can be called as a function from other applications rather than only operating as a standalone application.

Recent updates to QUDA (lattice/quda#1555) added fourth order integrator, fixed a bug in the topological charge calculation, added the rectangle observable, and added support for anisotropic flow and smearing. With those updates, MILC can now usefully offload gauge flow to QUDA.

Caveats

Zeuthen flow is not yet supported by QUDA
Due to different normalization conventions, there is a factor of 3 between MILC and QUDA's plaquette and rectangle values.

Usage

The following MILC targets are supported:

wilson_flow: Wilson or Symanzik flow using third-order Runge-Kutta integrator
wilson_flow_bbb: Wilson or Symanzik flow using fourth-order Runge-Kutta integrator
wilson_flow_a: Anisotropic Wilson or Symanzik flow using third-order Runge-Kutta integrator
wilson_flow_bbb_a: Anisotropic Wilson or Symanzik flow using fourth-order Runge-Kutta integrator

Simply compile with WANTQUDA=true and provide QUDA, QIO, and QMP paths in the usual way.

Performance

0.09 fm ($64^3 \times 96$):

On 1 Big Red 200 CPU node (128 cores), 40 steps of Wilson flow took 1234s and 40 steps of Symanzik flow took 3993s during a single benchmarking run. On one DeltaAI GH200 GPU (1 GPU not 1 node!) the same took 15.6s and 45.9s respectively. This means (in some sense) that 1 GH200 $\sim$ 10125 CPU cores for Wilson flow and 1 GH200 $\sim$ 11135 CPU cores for Symanzik flow. This is an ideal case for the GPU running because the 0.09 fm lattice fits on a single GH200 (no inter-GPU or inter-node communication is needed) and almost completely fills the GPU. The QUDA test of course took advantage of a previously-generated tunecache.

0.04 fm ($144^3 \times 288$)

This is now a much larger lattice, so even the GPU case cannot escape the need for inter-node communication. GPU flow benchmarks on Perlmutter are compared to Perlmutter CPU production runs done in 2022. In 2022, 720 steps of Wilson flow performed on 64 CPU nodes took 2.97 hours, whereas today the same performed on 32 GPU nodes takes 10.6 minutes. In this comparison, 1 A100 $\sim$ 1080 CPU cores. For Symanzik flow, the CPU running was done on 128 nodes and 720 flow steps took 4.69 hours for one particular config, but only takes 15.2 minutes when run on 32 GPU nodes. In this comparison, 1 A100 $\sim$ 2360 CPU cores.

…loop trace

leonhostetler and others added 19 commits February 15, 2025 08:30

Make wilson_flow compilable for QUDA

97752d3

Added simple interface to performWFlowQuda

a1c7dd8

Add symanzik flow to QUDA interface

1f0b715

Fix segfault and free mallocs

d244bf7

Added rectangle observable

cc4603f

Reorder rectangle paths so spatial is first

e94949a

Add 4th order quda integrator

f928af5

Add finalize_quda()

fa69a05

Properly free the lattice

6faba98

warm lattices are not supported

f4ef2a2

Fixed traces allocation

db8ffca

Fix compiler warnings

0eca4be

Enabled anisotropic QUDA gauge flow

5e28edd

Simplified fourth-order algorithm integration

345eff0

Added compute_rectangle flag

f31e647

Rectangle is now computed via own QUDA kernel instead of using gauge …

7f55e30

…loop trace

Merge branch 'milc-qcd:develop' into quda_gauge_flow

c079696

Print git version

10a2588

Merge branch 'milc-qcd:develop' into quda_gauge_flow

04be590

leonhostetler requested a review from bazalesha April 22, 2025 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quda gauge flow#84

Quda gauge flow#84
leonhostetler wants to merge 19 commits intomilc-qcd:developfrom
leonhostetler:quda_gauge_flow

leonhostetler commented Apr 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leonhostetler commented Apr 22, 2025

Caveats

Usage

Performance

0.09 fm ($64^3 \times 96$):

0.04 fm ($144^3 \times 288$)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant