Replace nested static_for lambdas with compile-time search helper by tenpercent · Pull Request #3600 · ROCm/composable_kernel

tenpercent · 2026-01-16T21:52:19Z

Summary

Add find_in_tuple_of_sequences compile-time search helper with O(1) template depth
Replace nested static_for lambdas in TensorDescriptor::GetTransformAndItsUpperDimension
Replace generate_tuple lambda in TensorDescriptor::InitializeElementSize with pack expansion
Apply same optimizations to TensorAdaptor

Motivation

The TensorDescriptor and TensorAdaptor classes had excessive template instantiation from:

Nested static_for loops with lambdas (918 applier::operator() instantiations)
generate_tuple with lambdas (78+ instantiations per class)

Why It Works

Each lambda creates a unique closure type, causing separate instantiations at every call site. The find_in_tuple_of_sequences helper uses O(1) template depth via pack expansion instead of O(N) nested static_for recursion, and named functors share a single type across all uses.

Results (example_grouped_conv_fwd_xdl_fp16)

Metric	Before	After	Improvement
Template instantiation time	23.4s	19.1s	18% reduction
`applier` instantiations	1132	127	89% reduction
`generate_tuple` lambdas	178	96	46% reduction

Test Plan

Added 11 unit tests:
- 5 tests for sequence_find_value
- 6 tests for find_in_tuple_of_sequences
Waiting for full CI

PR Stack

This PR is part of the build time optimization effort (issue #3575). All PRs now target develop independently:

#	PR	Description	Status
1	#3585	sequence_gen with `__make_integer_seq`	Independent
2	#3628	generate_identity_sequences + named functors	New (replaces #3588, #3589)
3	#3590	container_concat optimization	Independent
4	#3596	O(1) pack expansion rewrites	Independent
5	#3600	TensorDescriptor/TensorAdaptor lambda elimination	This PR

Tracking issue: #3575

The GetTransformAndItsUpperDimension function used nested static_for loops with lambdas to search for a hidden dimension in UpperDimensionIdss. This caused 918 applier::operator() instantiations (81% of all applier instantiations). Replace with find_in_tuple_of_sequences helper that uses constexpr array lookup and if-constexpr recursion, eliminating the lambda instantiation overhead. Results on example_grouped_conv_fwd_xdl_fp16: - applier instantiations: 1132 -> 127 (89% reduction) - TensorDescriptor instantiations: 2503 -> 664 (73% reduction) - Template instantiation time: 23.4s -> 19.4s (17% reduction)

…tSize The InitializeElementSize function used generate_tuple with a lambda to compute visible dimension lengths. Each TensorDescriptor type created a unique lambda type, causing 78 instantiations (385ms). Replace with direct pack expansion using helper functions, eliminating the lambda instantiation overhead entirely. Results on example_grouped_conv_fwd_xdl_fp16: - generate_tuple lambdas: 178 -> 100 (44% reduction) - Template instantiation time: 19.5s -> 19.0s

TensorAdaptor has identical InitializeElementSize and GetTransformAndItsUpperDimension patterns as TensorDescriptor. Apply the same optimization: - Replace nested static_for lambdas with find_in_tuple_of_sequences - Replace generate_tuple lambda with pack expansion Results: generate_tuple lambdas 100 -> 96 (4 events, 17ms eliminated)

Detailed comments explain: - sequence_find_value: Constexpr loop with O(1) template depth vs O(N) recursive - find_in_tuple_of_sequences: Pack expansion instead of nested static_for loops - Why constexpr search reduces template instantiations dramatically - When to apply constexpr search patterns for compile-time operations - Implementation details for each optimization approach This documentation helps maintainers understand the compile-time search optimization strategy without relying on specific benchmark numbers that may vary by use case.

ammallya · 2026-02-03T22:01:52Z

Imported to ROCm/rocm-libraries

tenpercent marked this pull request as ready for review January 17, 2026 03:41

tenpercent requested review from Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, vidyasagar-amd and vpietila-amd as code owners January 17, 2026 03:41

tenpercent force-pushed the mpodkory/recursive-to-pack-expansion branch from f5ada17 to 9942fd6 Compare January 17, 2026 03:51

tenpercent force-pushed the mpodkory/find-transform-optimization branch 2 times, most recently from e6040e1 to a565d87 Compare January 17, 2026 05:39

shumway linked an issue Jan 19, 2026 that may be closed by this pull request

[Epic] Reduce CK / CKTile Build Times #3575

Closed

shumway removed a link to an issue Jan 19, 2026

[Epic] Reduce CK / CKTile Build Times #3575

Closed

tenpercent mentioned this pull request Jan 19, 2026

Add unit tests for template optimization helpers #3610

Closed

tenpercent force-pushed the mpodkory/recursive-to-pack-expansion branch from c4d95f7 to 631df4f Compare January 21, 2026 23:57

tenpercent force-pushed the mpodkory/find-transform-optimization branch from 1d351ad to ec8e794 Compare January 21, 2026 23:57

tenpercent changed the base branch from mpodkory/recursive-to-pack-expansion to develop January 22, 2026 01:05

tenpercent added 4 commits January 22, 2026 01:11

Add unit tests for sequence_find_value and find_in_tuple_of_sequences

83a76d7

tenpercent force-pushed the mpodkory/find-transform-optimization branch from ec8e794 to 83a76d7 Compare January 22, 2026 01:13

tenpercent added 2 commits January 22, 2026 02:56

Apply clang-format with -style=file

19a156a

tenpercent marked this pull request as draft January 22, 2026 18:49

cgmillette self-assigned this Jan 23, 2026

assistant-librarian bot mentioned this pull request Feb 3, 2026

Replace nested static_for lambdas with compile-time search helper ROCm/rocm-libraries#4287

Draft

2 tasks

ammallya closed this Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace nested static_for lambdas with compile-time search helper#3600

Replace nested static_for lambdas with compile-time search helper#3600
tenpercent wants to merge 6 commits intodevelopfrom
mpodkory/find-transform-optimization

tenpercent commented Jan 16, 2026 •

edited

Loading

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tenpercent commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Why It Works

Results (example_grouped_conv_fwd_xdl_fp16)

Test Plan

PR Stack

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tenpercent commented Jan 16, 2026 •

edited

Loading