Conversation
b7bbf03 to
1bccd37
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive pooling operations support to the ckTileEngine, implementing 2D and 3D pooling with configurable reduction operations (max, min, average). The implementation follows the existing GEMM module architecture with individual kernel compilation for improved build parallelism and testing capabilities.
Key Changes:
- Added pooling kernel generation system with Python-based instance builder
- Implemented benchmark infrastructure for pooling operations
- Created configuration system for kernel parameters and GPU target selection
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tile_engine/ops/pooling/pool_profiler.hpp | Profiler class for benchmarking pooling kernels with verification and performance metrics |
| tile_engine/ops/pooling/pool_instance_builder.py | Python script to generate C++ kernel instances from JSON configurations |
| tile_engine/ops/pooling/pool_common.hpp | Common type traits and data type definitions for pooling operations |
| tile_engine/ops/pooling/pool_benchmark_single.cpp | Single kernel benchmark executable with argument parsing and verification |
| tile_engine/ops/pooling/pool_benchmark.py | Python orchestration script for running and analyzing benchmark sweeps |
| tile_engine/ops/pooling/pool_benchmark.hpp | Core data structures and enums for pooling benchmarks |
| tile_engine/ops/pooling/configs/default_config.json | Default configuration for block sizes, warp configs, and traits |
| tile_engine/ops/pooling/README.md | Comprehensive documentation for building, running, and configuring pooling benchmarks |
| tile_engine/ops/pooling/CMakeLists.txt | CMake build system for generating and compiling individual pooling kernels |
| tile_engine/ops/gemm/README.md | Added missing gpu_target argument to example command |
| tile_engine/ops/commons/test_benchmark.sh | Extended test script to support both GEMM and pooling benchmarks |
| tile_engine/ops/CMakeLists.txt | Added pooling subdirectory to build system |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "max": "ck_tile::ReduceOp::Max", | ||
| "min": "ck_tile::ReduceOp::Min", | ||
| "add": "ck_tile::ReduceOp::Add", | ||
| "avg": "ck_tile::ReduceOp::Add", # Average uses Add and divides later |
There was a problem hiding this comment.
The comment states 'Average uses Add and divides later' but doesn't explain where the division occurs. Consider clarifying that the division is performed elsewhere in the kernel or reference implementation to avoid confusion.
| "avg": "ck_tile::ReduceOp::Add", # Average uses Add and divides later | |
| "avg": "ck_tile::ReduceOp::Add", # Average pooling uses Add here; division by the element count is performed elsewhere in the pooling kernel/reference implementation. |
| input_right_pads}; | ||
|
|
||
| // Run reference if verification is enabled | ||
| // (Reference computation would be added here based on pool dimension) |
There was a problem hiding this comment.
Incomplete implementation: reference computation is mentioned in a comment but not implemented. If verification is supported via setting_.verify_, the reference computation code should be present. Either implement it or document why it's deferred.
| // (Reference computation would be added here based on pool dimension) | |
| // | |
| // Note: | |
| // A host-side reference implementation is not invoked here. Although | |
| // `ck_tile/host/reference/reference_pool.hpp` is available, this | |
| // profiler operates directly on device buffers with generic | |
| // TensorShape/WindowShape types, and a matching reference path has | |
| // not yet been wired up for this configuration. Verification of | |
| // pooling results is therefore expected to be performed by the | |
| // caller or by a higher-level test harness. | |
| // | |
| // To make this limitation visible, emit a one-time warning if | |
| // verification was requested for this profiler. | |
| if(setting_.verify_) | |
| { | |
| static bool warned = false; | |
| if(!warned) | |
| { | |
| warned = true; | |
| std::cerr << "Warning: PoolProfiler verification requested, but no " | |
| "host reference pooling implementation is invoked in " | |
| "this configuration. Results are not being verified " | |
| "by PoolProfiler." | |
| << std::endl; | |
| } | |
| } |
| "Sz": stride_z, | ||
| "Sy": stride_y, | ||
| "Sx": stride_x, | ||
| "pool_dim": pool_dim, |
There was a problem hiding this comment.
The parameter 'pool_dim' is included in the params dictionary but is not a valid command-line parameter for the benchmark executables according to pool_benchmark_single.cpp (which doesn't define a 'pool_dim' argument). This will be silently ignored or could cause errors. The pool dimension is determined by the kernel itself (POOL_DIM constant), not passed as an argument.
| "pool_dim": pool_dim, |
| --reduce_op ${reduce_op} | ||
| --config_json ${config_json} | ||
| --gen_single | ||
| --kernel_name "pool${pool_dim}d_${datatype}_${reduce_op}_${trait_for_filename}_${block_config}" |
There was a problem hiding this comment.
Inconsistent naming: the kernel_name constructed here includes pool_dim in the name (e.g., 'pool3d_...'), but the trait_for_filename explicitly excludes pool_dim. This creates confusion about which components are part of the filename versus the kernel identifier.
| print(f"Running: {' '.join(cmd)}") | ||
|
|
||
| try: | ||
| result = subprocess.run(cmd, capture_output=True, text=True, timeout=120) |
There was a problem hiding this comment.
The timeout value of 120 seconds is hardcoded. Consider making this configurable via a parameter or constant, especially for larger problem sizes or slower GPUs where benchmarks might legitimately take longer.
| # Replace [Arch] with your GPU architecture (e.g., gfx90a, gfx942) | ||
| ../script/cmake-ck-dev.sh ../ [Arch] -DPOOL_DATATYPE="fp16;fp32" -DPOOL_REDUCE_OP="max;avg" |
There was a problem hiding this comment.
The placeholder '[Arch]' should specify valid architecture examples in the comment on line 49, but those examples (e.g., gfx90a, gfx942) are only mentioned in the comment and not repeated in the actual command example. Consider showing a concrete example like 'gfx942' in place of '[Arch]' or making the placeholder more explicit.
| # Replace [Arch] with your GPU architecture (e.g., gfx90a, gfx942) | |
| ../script/cmake-ck-dev.sh ../ [Arch] -DPOOL_DATATYPE="fp16;fp32" -DPOOL_REDUCE_OP="max;avg" | |
| # Replace gfx942 with your GPU architecture (e.g., gfx90a, gfx942) | |
| ../script/cmake-ck-dev.sh ../ gfx942 -DPOOL_DATATYPE="fp16;fp32" -DPOOL_REDUCE_OP="max;avg" |
| --config_json configs/user_provided_config.json \ | ||
| --gen_all_individual | ||
| --gen_all_individual \ | ||
| --gpu_target gfx942 |
There was a problem hiding this comment.
Gemm TileEngine target is just gfx942?
And this change is related anyway to the adding pooling to TileEngine?
| "block_n": { | ||
| "values": [1] |
There was a problem hiding this comment.
If block_n is always 1 for pooling, does it make sense to hard code it in instance_builder or can theoretically select other values?
There was a problem hiding this comment.
I have a general question about the instance builder. In the GEMM tile engine, there’s a gemm_validation_utils.py module that is used by gemm_instance_builder to generate only valid instances that can later be compiled. However, it seems that for pooling, there isn’t a similar validation step. and, some checks are performed directly within pool_instance_builder. Would it make sense to adopt a similar structure for these tile engine operations to keep things same?
|
Error importing due to merge conflicts – please reopen the PR on ROCm/rocm-libraries |
Proposed changes
Add pooling to ckTileEngine
Checklist
Please put an
xinto the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.clang-formaton all changed files