Checklist
Describe the bug
Description
Hi, thanks for the great work on DFlash — the block-wise parallel training design is very elegant and well implemented.
While integrating DFlash into our training pipeline, we noticed that the current implementation of create_dflash_block_mask is tightly coupled with Flex Attention (torch.nn.attention.flex_attention) and relies on create_block_mask for constructing the sparse block mask.
At the moment, there does not appear to be an alternative implementation for more general attention backends such as:
torch.nn.functional.scaled_dot_product_attention (SDPA)
- Eager attention (manual masked matmul + softmax)
- FlashAttention (when flex is unavailable)
Problem
Flex Attention is not supported on all hardware platforms. For example:
- Ascend (Huawei NPU) does not support Flex Attention.
- Some custom accelerators and older CUDA environments also lack support.
- Flex Attention is still relatively new and backend availability is limited.
Because DFlash currently depends exclusively on Flex Attention for block mask construction, this prevents the model from being used on such platforms, even though the masking logic itself is backend-agnostic.
Suggestion
It would be very helpful if the project could provide:
1. A backend-agnostic attention option
For example, allowing:
attention_backend = "flex" | "sdpa" | "eager"
2. A dense SDPA-compatible block mask implementation
Something like:
create_dflash_sdpa_mask(...)
which returns a boolean mask of shape:
This would:
- Improve portability
- Enable DFlash training on Ascend and other non-CUDA platforms
- Make experimentation easier
- Remove hard dependency on Flex Attention
Thanks again for the excellent work!
Reproduction
/
Environment
/
Checklist
Describe the bug
Description
Hi, thanks for the great work on DFlash — the block-wise parallel training design is very elegant and well implemented.
While integrating DFlash into our training pipeline, we noticed that the current implementation of
create_dflash_block_maskis tightly coupled with Flex Attention (torch.nn.attention.flex_attention) and relies oncreate_block_maskfor constructing the sparse block mask.At the moment, there does not appear to be an alternative implementation for more general attention backends such as:
torch.nn.functional.scaled_dot_product_attention(SDPA)Problem
Flex Attention is not supported on all hardware platforms. For example:
Because DFlash currently depends exclusively on Flex Attention for block mask construction, this prevents the model from being used on such platforms, even though the masking logic itself is backend-agnostic.
Suggestion
It would be very helpful if the project could provide:
1. A backend-agnostic attention option
For example, allowing:
2. A dense SDPA-compatible block mask implementation
Something like:
create_dflash_sdpa_mask(...)which returns a boolean mask of shape:
This would:
Thanks again for the excellent work!
Reproduction
/
Environment
/