[Feature Request] Improve layout when small fragments cause excessive thread replication

### Required prerequisites

- [x] I have searched the [Issue Tracker](https://github.com/tile-ai/tilelang/issues) that this hasn't already been reported. (comment there if it has.)

### Motivation

TileLang code:
```python
import tilelang
import tilelang.language as T


@tilelang.jit
def get_qwq():
    @T.prim_func
    def main(
        A: T.Tensor[(2, 2560), T.float32],
        B: T.Tensor[(2, 2560), T.float32],
        C: T.Tensor[(2, ), T.float32]
    ):
        with T.Kernel(1, threads=256):
            tx = T.get_thread_binding(0)

            C_local = T.alloc_fragment((2, ), T.float32)
            T.copy(C, C_local)

            for i, j in T.Parallel(2, 2560):
                if C_local[i] >= 0:
                    B[i, j] = A[i, j]

    return main


kernel = get_qwq()
print(kernel.get_kernel_source())
```

generated code:
```cuda
#include <tl_templates/cuda/gemm.h>
#include <tl_templates/cuda/copy.h>
#include <tl_templates/cuda/reduce.h>
#include <tl_templates/cuda/ldsm.h>
#include <tl_templates/cuda/threadblock_swizzle.h>
#include <tl_templates/cuda/debug.h>
#ifdef ENABLE_BF16
#include <tl_templates/cuda/cuda_bf16_fallbacks.cuh>
#endif

extern "C" __global__ void main_kernel(const float* __restrict__ A, float* __restrict__ B, const float* __restrict__ C);
extern "C" __global__ void __launch_bounds__(256, 1) main_kernel(const float* __restrict__ A, float* __restrict__ B, const float* __restrict__ C) {
  float C_local[1];
  C_local[0] = C[(((int)threadIdx.x) & 1)];
  if ((((int)threadIdx.x) >> 1) == 0) {
    #pragma unroll
    for (int i = 0; i < 640; ++i) {
      float4 v_ = *(float4*)(A + (((((int)threadIdx.x) & 1) * 2560) + (i * 4)));
      tl::store_global_128_conditional((&(B[(((((int)threadIdx.x) & 1) * 2560) + (i * 4))])), (*(uint4 *)(&(v_))), (0x0p+0f/*0.000000e+00*/ <= C_local[0]));
    }
  }
}
```

In the generated kernel, only threads 0 and 1 perform the global memory read/write, leaving most threads idle. This indicates a suboptimal layout and thread mapping, and there should be room to improve the layout.

### Solution

_No response_

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Improve layout when small fragments cause excessive thread replication #1729

Required prerequisites

Motivation

Solution

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Improve layout when small fragments cause excessive thread replication #1729

Description

Required prerequisites

Motivation

Solution

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions