Skip to content

Replace nested static_for lambdas with compile-time search helper#3600

Closed
tenpercent wants to merge 6 commits intodevelopfrom
mpodkory/find-transform-optimization
Closed

Replace nested static_for lambdas with compile-time search helper#3600
tenpercent wants to merge 6 commits intodevelopfrom
mpodkory/find-transform-optimization

Conversation

@tenpercent
Copy link
Contributor

@tenpercent tenpercent commented Jan 16, 2026

Summary

  • Add find_in_tuple_of_sequences compile-time search helper with O(1) template depth
  • Replace nested static_for lambdas in TensorDescriptor::GetTransformAndItsUpperDimension
  • Replace generate_tuple lambda in TensorDescriptor::InitializeElementSize with pack expansion
  • Apply same optimizations to TensorAdaptor

Motivation

The TensorDescriptor and TensorAdaptor classes had excessive template instantiation from:

  1. Nested static_for loops with lambdas (918 applier::operator() instantiations)
  2. generate_tuple with lambdas (78+ instantiations per class)

Why It Works

Each lambda creates a unique closure type, causing separate instantiations at every call site. The find_in_tuple_of_sequences helper uses O(1) template depth via pack expansion instead of O(N) nested static_for recursion, and named functors share a single type across all uses.

Results (example_grouped_conv_fwd_xdl_fp16)

Metric Before After Improvement
Template instantiation time 23.4s 19.1s 18% reduction
applier instantiations 1132 127 89% reduction
generate_tuple lambdas 178 96 46% reduction

Test Plan

  • Added 11 unit tests:
    • 5 tests for sequence_find_value
    • 6 tests for find_in_tuple_of_sequences
  • Waiting for full CI

PR Stack

This PR is part of the build time optimization effort (issue #3575). All PRs now target develop independently:

# PR Description Status
1 #3585 sequence_gen with __make_integer_seq Independent
2 #3628 generate_identity_sequences + named functors New (replaces #3588, #3589)
3 #3590 container_concat optimization Independent
4 #3596 O(1) pack expansion rewrites Independent
5 #3600 TensorDescriptor/TensorAdaptor lambda elimination This PR

Tracking issue: #3575

@tenpercent tenpercent marked this pull request as ready for review January 17, 2026 03:41
@tenpercent tenpercent force-pushed the mpodkory/recursive-to-pack-expansion branch from f5ada17 to 9942fd6 Compare January 17, 2026 03:51
@tenpercent tenpercent force-pushed the mpodkory/find-transform-optimization branch 2 times, most recently from e6040e1 to a565d87 Compare January 17, 2026 05:39
@shumway shumway linked an issue Jan 19, 2026 that may be closed by this pull request
@tenpercent tenpercent force-pushed the mpodkory/recursive-to-pack-expansion branch from c4d95f7 to 631df4f Compare January 21, 2026 23:57
@tenpercent tenpercent force-pushed the mpodkory/find-transform-optimization branch from 1d351ad to ec8e794 Compare January 21, 2026 23:57
@tenpercent tenpercent changed the base branch from mpodkory/recursive-to-pack-expansion to develop January 22, 2026 01:05
The GetTransformAndItsUpperDimension function used nested static_for
loops with lambdas to search for a hidden dimension in UpperDimensionIdss.
This caused 918 applier::operator() instantiations (81% of all applier
instantiations).

Replace with find_in_tuple_of_sequences helper that uses constexpr
array lookup and if-constexpr recursion, eliminating the lambda
instantiation overhead.

Results on example_grouped_conv_fwd_xdl_fp16:
- applier instantiations: 1132 -> 127 (89% reduction)
- TensorDescriptor instantiations: 2503 -> 664 (73% reduction)
- Template instantiation time: 23.4s -> 19.4s (17% reduction)
…tSize

The InitializeElementSize function used generate_tuple with a lambda to
compute visible dimension lengths. Each TensorDescriptor type created
a unique lambda type, causing 78 instantiations (385ms).

Replace with direct pack expansion using helper functions, eliminating
the lambda instantiation overhead entirely.

Results on example_grouped_conv_fwd_xdl_fp16:
- generate_tuple lambdas: 178 -> 100 (44% reduction)
- Template instantiation time: 19.5s -> 19.0s
TensorAdaptor has identical InitializeElementSize and
GetTransformAndItsUpperDimension patterns as TensorDescriptor.
Apply the same optimization:
- Replace nested static_for lambdas with find_in_tuple_of_sequences
- Replace generate_tuple lambda with pack expansion

Results: generate_tuple lambdas 100 -> 96 (4 events, 17ms eliminated)
@tenpercent tenpercent force-pushed the mpodkory/find-transform-optimization branch from ec8e794 to 83a76d7 Compare January 22, 2026 01:13
Detailed comments explain:
- sequence_find_value: Constexpr loop with O(1) template depth vs O(N) recursive
- find_in_tuple_of_sequences: Pack expansion instead of nested static_for loops
- Why constexpr search reduces template instantiations dramatically
- When to apply constexpr search patterns for compile-time operations
- Implementation details for each optimization approach

This documentation helps maintainers understand the compile-time search optimization
strategy without relying on specific benchmark numbers that may vary by use case.
@ammallya
Copy link
Contributor

ammallya commented Feb 3, 2026

Imported to ROCm/rocm-libraries

@ammallya ammallya closed this Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants