[Bug] DFlash block mask currently relies on Flex Attention only — request for SDPA / generic attention backend support

### Checklist

- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- [ ] 5. Please use English, otherwise it will be closed.

### Describe the bug

## Description

Hi, thanks for the great work on DFlash — the block-wise parallel training design is very elegant and well implemented.

While integrating DFlash into our training pipeline, we noticed that the current implementation of `create_dflash_block_mask` is tightly coupled with **Flex Attention** (`torch.nn.attention.flex_attention`) and relies on `create_block_mask` for constructing the sparse block mask.

At the moment, there does not appear to be an alternative implementation for more general attention backends such as:

* `torch.nn.functional.scaled_dot_product_attention` (SDPA)
* Eager attention (manual masked matmul + softmax)
* FlashAttention (when flex is unavailable)


## Problem

Flex Attention is not supported on all hardware platforms. For example:

* **Ascend (Huawei NPU)** does not support Flex Attention.
* Some custom accelerators and older CUDA environments also lack support.
* Flex Attention is still relatively new and backend availability is limited.

Because DFlash currently depends exclusively on Flex Attention for block mask construction, this prevents the model from being used on such platforms, even though the masking logic itself is backend-agnostic.


## Suggestion

It would be very helpful if the project could provide:

### 1. A backend-agnostic attention option

For example, allowing:

```python
attention_backend = "flex" | "sdpa" | "eager"
```

### 2. A dense SDPA-compatible block mask implementation

Something like:

```python
create_dflash_sdpa_mask(...)
```

which returns a boolean mask of shape:

```
[B, 1, Q_LEN, KV_LEN]
```

This would:

* Improve portability
* Enable DFlash training on Ascend and other non-CUDA platforms
* Make experimentation easier
* Remove hard dependency on Flex Attention

Thanks again for the excellent work!


### Reproduction

/

### Environment

/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] DFlash block mask currently relies on Flex Attention only — request for SDPA / generic attention backend support #474

Checklist

Describe the bug

Description

Problem

Suggestion

1. A backend-agnostic attention option

2. A dense SDPA-compatible block mask implementation

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] DFlash block mask currently relies on Flex Attention only — request for SDPA / generic attention backend support #474

Description

Checklist

Describe the bug

Description

Problem

Suggestion

1. A backend-agnostic attention option

2. A dense SDPA-compatible block mask implementation

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions