Prune text tokens that are masked

The max text sequence length for this model is 512. The image sequence length is usually between 1024 and 4096 depending on resolution. This means that a significant fraction of the processed tokens is text tokens - but due to Chroma's design, most of them are masked unless you have very long prompts.

Masked tokens can be removed:
```
        seq_lengths = bool_attention_mask.sum(dim=1)
        max_seq_length = seq_lengths.max().item()
        text_encoder_output = text_encoder_output[:, :max_seq_length, :]
        bool_attention_mask = bool_attention_mask[:, :max_seq_length]
```
(applied after the mask has been expanded, otherwise max_seq_length is off by 1)

Training and inference takes about 25% less time at 512 px. Less at time saving at higher resolutions


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prune text tokens that are masked #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Prune text tokens that are masked #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions