[Bug] --max_seq_len ignored for sliding window models with --use_custom_kv_cache --use_custom_sdpa (cache capped at sliding_window size)

## Environment
- optimum-executorch: main
- Model: google/gemma-3-1b-it
- Python: 3.12
- Ubuntu 24.04

## Setup
```
git clone https://github.com/pytorch/executorch.git
pushd executorch
git submodule update --init --recursive
bash install_requirements.sh
popd

git clone https://github.com/huggingface/optimum-executorch.git
pushd optimum-executorch
python install_dev.py --skip_override_torch
popd

pip install triton
```

## Export (succeeds)
```
optimum-cli export executorch \
    --model "google/gemma-3-1b-it" \
    --task "text-generation" \
    --recipe "xnnpack" \
    --use_custom_sdpa \
    --use_custom_kv_cache \
    --qlinear 8da4w \
    --qembedding 8w \
    --max_seq_len 1024 \
    --dtype "float32" \
    --device "cpu" \
    --output_dir "gemma3_1b_export"
```

## Reproduction (validate_export.py)
```
from optimum.executorch import ExecuTorchModelForCausalLM
from transformers import AutoTokenizer

model = ExecuTorchModelForCausalLM.from_pretrained("gemma3_1b_export")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")

# Simulate a prompt exceeding 512 tokens
long_prompt = " ".join(["hello"] * 750)

generated_text = model.text_generation(
    tokenizer=tokenizer,
    prompt=long_prompt,
    max_seq_len=1024,
)
print(generated_text)

tokens = tokenizer.encode(long_prompt)
print("Number of tokens:", len(tokens))
```

`python validate_export.py
`

## Error

> [tensor_impl.cpp:129] Attempted to resize a bounded tensor with a maximum capacity of 511 elements to 751 elements.
> [method.cpp:1136] Error resizing tensor at input 0
> Exception: Failed to execute method forward, error: 0x10.
> MethodMeta(name='forward', num_inputs=2, input_tensor_meta=['TensorInfo(sizes=[1, 511], dtype=Long, ...)' 'TensorInfo(sizes=[511], dtype=Long, ...)'], num_outputs=1)
> arg shapes: {'input_ids': torch.Size([1, 751]), 'cache_position': torch.Size([751])}
> RuntimeError: Failed to execute method forward, error: 0x10


The export succeeds and short prompts (< 512 tokens) work fine. The error only surfaces at runtime when the prompt exceeds 512 tokens (the model's sliding_window size) despite --max_seq_len 1024.

## Root cause
Two issues in CausalLMExportableModule in optimum/exporters/executorch/integrations.py:

_prepare_export_inputs: max_dim is computed as min(max_seq_len, sliding_window) - 1 = 511, so torch.export bakes a <= 511 guard on the sequence length dimension regardless of --max_seq_len.

export: TorchExportableModuleWithHybridCache.__init__ calls StaticCache, which internally creates StaticSlidingWindowLayer(max_cache_len=max_seq_len, sliding_window=512). That class sets effective_max_cache_len = min(sliding_window, max_cache_len) = 512, and its get_mask_sizes() returns kv_length = 512 as a constant during tracing — baking a second <= 512 guard into the exported graph.

## Fix
See linked PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] --max_seq_len ignored for sliding window models with --use_custom_kv_cache --use_custom_sdpa (cache capped at sliding_window size) #218

Environment

Setup

Export (succeeds)

Reproduction (validate_export.py)

Error

Root cause

Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] --max_seq_len ignored for sliding window models with --use_custom_kv_cache --use_custom_sdpa (cache capped at sliding_window size) #218

Description

Environment

Setup

Export (succeeds)

Reproduction (validate_export.py)

Error

Root cause

Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions