Skip to content

[CK_TILE] Add pooling to ckTileEngine#3416

Open
aledudek wants to merge 9 commits intodevelopfrom
ckTileEnginePooling
Open

[CK_TILE] Add pooling to ckTileEngine#3416
aledudek wants to merge 9 commits intodevelopfrom
ckTileEnginePooling

Conversation

@aledudek
Copy link
Contributor

Proposed changes

Add pooling to ckTileEngine

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

@aledudek aledudek force-pushed the ckTileEnginePooling branch from b7bbf03 to 1bccd37 Compare December 16, 2025 10:40
@aosewski aosewski requested a review from Copilot December 18, 2025 10:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive pooling operations support to the ckTileEngine, implementing 2D and 3D pooling with configurable reduction operations (max, min, average). The implementation follows the existing GEMM module architecture with individual kernel compilation for improved build parallelism and testing capabilities.

Key Changes:

  • Added pooling kernel generation system with Python-based instance builder
  • Implemented benchmark infrastructure for pooling operations
  • Created configuration system for kernel parameters and GPU target selection

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tile_engine/ops/pooling/pool_profiler.hpp Profiler class for benchmarking pooling kernels with verification and performance metrics
tile_engine/ops/pooling/pool_instance_builder.py Python script to generate C++ kernel instances from JSON configurations
tile_engine/ops/pooling/pool_common.hpp Common type traits and data type definitions for pooling operations
tile_engine/ops/pooling/pool_benchmark_single.cpp Single kernel benchmark executable with argument parsing and verification
tile_engine/ops/pooling/pool_benchmark.py Python orchestration script for running and analyzing benchmark sweeps
tile_engine/ops/pooling/pool_benchmark.hpp Core data structures and enums for pooling benchmarks
tile_engine/ops/pooling/configs/default_config.json Default configuration for block sizes, warp configs, and traits
tile_engine/ops/pooling/README.md Comprehensive documentation for building, running, and configuring pooling benchmarks
tile_engine/ops/pooling/CMakeLists.txt CMake build system for generating and compiling individual pooling kernels
tile_engine/ops/gemm/README.md Added missing gpu_target argument to example command
tile_engine/ops/commons/test_benchmark.sh Extended test script to support both GEMM and pooling benchmarks
tile_engine/ops/CMakeLists.txt Added pooling subdirectory to build system

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"max": "ck_tile::ReduceOp::Max",
"min": "ck_tile::ReduceOp::Min",
"add": "ck_tile::ReduceOp::Add",
"avg": "ck_tile::ReduceOp::Add", # Average uses Add and divides later
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states 'Average uses Add and divides later' but doesn't explain where the division occurs. Consider clarifying that the division is performed elsewhere in the kernel or reference implementation to avoid confusion.

Suggested change
"avg": "ck_tile::ReduceOp::Add", # Average uses Add and divides later
"avg": "ck_tile::ReduceOp::Add", # Average pooling uses Add here; division by the element count is performed elsewhere in the pooling kernel/reference implementation.

Copilot uses AI. Check for mistakes.
input_right_pads};

// Run reference if verification is enabled
// (Reference computation would be added here based on pool dimension)
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete implementation: reference computation is mentioned in a comment but not implemented. If verification is supported via setting_.verify_, the reference computation code should be present. Either implement it or document why it's deferred.

Suggested change
// (Reference computation would be added here based on pool dimension)
//
// Note:
// A host-side reference implementation is not invoked here. Although
// `ck_tile/host/reference/reference_pool.hpp` is available, this
// profiler operates directly on device buffers with generic
// TensorShape/WindowShape types, and a matching reference path has
// not yet been wired up for this configuration. Verification of
// pooling results is therefore expected to be performed by the
// caller or by a higher-level test harness.
//
// To make this limitation visible, emit a one-time warning if
// verification was requested for this profiler.
if(setting_.verify_)
{
static bool warned = false;
if(!warned)
{
warned = true;
std::cerr << "Warning: PoolProfiler verification requested, but no "
"host reference pooling implementation is invoked in "
"this configuration. Results are not being verified "
"by PoolProfiler."
<< std::endl;
}
}

Copilot uses AI. Check for mistakes.
"Sz": stride_z,
"Sy": stride_y,
"Sx": stride_x,
"pool_dim": pool_dim,
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter 'pool_dim' is included in the params dictionary but is not a valid command-line parameter for the benchmark executables according to pool_benchmark_single.cpp (which doesn't define a 'pool_dim' argument). This will be silently ignored or could cause errors. The pool dimension is determined by the kernel itself (POOL_DIM constant), not passed as an argument.

Suggested change
"pool_dim": pool_dim,

Copilot uses AI. Check for mistakes.
--reduce_op ${reduce_op}
--config_json ${config_json}
--gen_single
--kernel_name "pool${pool_dim}d_${datatype}_${reduce_op}_${trait_for_filename}_${block_config}"
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent naming: the kernel_name constructed here includes pool_dim in the name (e.g., 'pool3d_...'), but the trait_for_filename explicitly excludes pool_dim. This creates confusion about which components are part of the filename versus the kernel identifier.

Copilot uses AI. Check for mistakes.
print(f"Running: {' '.join(cmd)}")

try:
result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout value of 120 seconds is hardcoded. Consider making this configurable via a parameter or constant, especially for larger problem sizes or slower GPUs where benchmarks might legitimately take longer.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +50
# Replace [Arch] with your GPU architecture (e.g., gfx90a, gfx942)
../script/cmake-ck-dev.sh ../ [Arch] -DPOOL_DATATYPE="fp16;fp32" -DPOOL_REDUCE_OP="max;avg"
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The placeholder '[Arch]' should specify valid architecture examples in the comment on line 49, but those examples (e.g., gfx90a, gfx942) are only mentioned in the comment and not repeated in the actual command example. Consider showing a concrete example like 'gfx942' in place of '[Arch]' or making the placeholder more explicit.

Suggested change
# Replace [Arch] with your GPU architecture (e.g., gfx90a, gfx942)
../script/cmake-ck-dev.sh ../ [Arch] -DPOOL_DATATYPE="fp16;fp32" -DPOOL_REDUCE_OP="max;avg"
# Replace gfx942 with your GPU architecture (e.g., gfx90a, gfx942)
../script/cmake-ck-dev.sh ../ gfx942 -DPOOL_DATATYPE="fp16;fp32" -DPOOL_REDUCE_OP="max;avg"

Copilot uses AI. Check for mistakes.
--config_json configs/user_provided_config.json \
--gen_all_individual
--gen_all_individual \
--gpu_target gfx942
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemm TileEngine target is just gfx942?
And this change is related anyway to the adding pooling to TileEngine?

Comment on lines +6 to +7
"block_n": {
"values": [1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If block_n is always 1 for pooling, does it make sense to hard code it in instance_builder or can theoretically select other values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a general question about the instance builder. In the GEMM tile engine, there’s a gemm_validation_utils.py module that is used by gemm_instance_builder to generate only valid instances that can later be compiled. However, it seems that for pooling, there isn’t a similar validation step. and, some checks are performed directly within pool_instance_builder. Would it make sense to adopt a similar structure for these tile engine operations to keep things same?

@ammallya
Copy link
Contributor

ammallya commented Feb 3, 2026

Error importing due to merge conflicts – please reopen the PR on ROCm/rocm-libraries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants