[Bug]: [NCC_INLA001] neuronx-cc fails to compile Llama-3.3-70B context_encoding_model with NxDI 0.7 — "type must be boolean, but is null"

### Your current environment

bug Description:

neuronx-cc fails to compile context_encoding_model HLO for Llama-3.3-70B-Instruct when using vllm-neuron 0.3.0 + NxDI 0.7. The token_generation_model compiles successfully, but multiple context_encoding_model buckets fail with an internal error.

Error:


[INTERNAL_ERROR] [NCC_INLA001] Unhandled exception with message: 
[json.exception.type_error.302] type must be boolean, but is null
What I've tried (all fail with the same error):

TP=32, max_model_len=8192
TP=32, max_model_len=4096
TP=16, max_model_len=4096
The bug is in the context_encoding_model HLO compilation path — it's not sensitive to TP size or sequence length. The token_generation_model always compiles fine.

Reproduction:


from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    max_num_seqs=4,
    max_model_len=4096,
    block_size=32,
    num_gpu_blocks_override=1024,
    tensor_parallel_size=32,
)
outputs = llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8))

export NEURON_RT_VISIBLE_CORES=0-31
export VLLM_PLUGINS=neuron
python test.py
Failed compiler invocation (from logs):


neuronx-cc compile --framework=XLA \
  /tmp/nxd_model/context_encoding_model/_tp0_bk18/model.MODULE_*.hlo_module.pb \
  --target=trn1 --auto-cast=none --model-type=transformer \
  --tensorizer-options=--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2 --vectorize-strided-dma \
  --lnc=1 -O1 \
  --internal-hlo2tensorizer-options= --modular-flow-mac-threshold=10 --verify-hlo=true
Failed buckets include: _tp0_bk16, _tp0_bk17, _tp0_bk18, _tp0_bk19, _tp0_bk25–_tp0_bk29 depending on configuration.

<details> <summary>The output of <code>python collect_env.py</code></summary>

Python: 3.12.3
PyTorch: 2.9.0+cu128
OS: Linux-6.17.0-1007-aws-x86_64-with-glibc2.39
</details> <details> <summary>Instance Type</summary>

trn1.32xlarge
</details> <details> <summary>Python Environment (<code>pip list | grep -E "torch|neuron|nki|vllm|nxdi|nixl"</code>)</summary>

libneuronxla                      2.2.14584.0+06ac23d1
neuronx-cc                        2.22.12471.0+b4a00d10
neuronx-distributed               0.16.25997+f431c02e
neuronx-distributed-inference     0.7.15063+bafa28d5
optimum-neuron                    0.4.3
torch                             2.9.0
torch-neuronx                     2.9.0.2.11.19912+e48cd891
torch-xla                         2.9.0
torchaudio                        2.9.0
torchvision                       0.24.0
vllm                              0.13.0
vllm-neuron                       0.3.0
</details>


### 🐛 Describe the bug

neuronx-cc fails to compile `context_encoding_model` HLO for Llama-3.3-70B-Instruct using vllm-neuron 0.3.0 + NxDI 0.7 on trn1.32xlarge. The `token_generation_model` compiles fine — all buckets pass. But `context_encoding_model` consistently fails on several buckets with:

```
[INTERNAL_ERROR] [NCC_INLA001] Unhandled exception with message: 
[json.exception.type_error.302] type must be boolean, but is null
```

Tried TP=32/16, max_model_len=8192/4096 — same error every time. The bug is in the context_encoding HLO compilation path, not sensitive to TP size or sequence length.

```python
# Minimal reproduction
from vllm import LLM, SamplingParams

# Run with: NEURON_RT_VISIBLE_CORES=0-31 VLLM_PLUGINS=neuron python test.py
llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    max_num_seqs=4,
    max_model_len=4096,
    block_size=32,
    num_gpu_blocks_override=1024,
    tensor_parallel_size=32,
)
outputs = llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8))
```

Failed compiler invocation from logs:
```
neuronx-cc compile --framework=XLA \
  /tmp/nxd_model/context_encoding_model/_tp0_bk18/model.MODULE_*.hlo_module.pb \
  --target=trn1 --auto-cast=none --model-type=transformer \
  --tensorizer-options=--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2 --vectorize-strided-dma \
  --lnc=1 -O1 \
  --internal-hlo2tensorizer-options= --modular-flow-mac-threshold=10 --verify-hlo=true
```

Failed buckets: `_tp0_bk16`–`_tp0_bk19`, `_tp0_bk25`–`_tp0_bk29` depending on config.





Python: 3.12.3
PyTorch: 2.9.0+cu128
OS: Linux-6.17.0-1007-aws-x86_64-with-glibc2.39



trn1.32xlarge



libneuronxla                      2.2.14584.0+06ac23d1
neuronx-cc                        2.22.12471.0+b4a00d10
neuronx-distributed               0.16.25997+f431c02e
neuronx-distributed-inference     0.7.15063+bafa28d5
optimum-neuron                    0.4.3
torch                             2.9.0
torch-neuronx                     2.9.0.2.11.19912+e48cd891
torch-xla                         2.9.0
torchaudio                        2.9.0
torchvision                       0.24.0
vllm                              0.13.0
vllm-neuron                       0.3.0

### Before submitting a new issue...

- [x] Make sure you already searched for relevant [issues](https://github.com/vllm-project/vllm-neuron/issues).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: [NCC_INLA001] neuronx-cc fails to compile Llama-3.3-70B context_encoding_model with NxDI 0.7 — "type must be boolean, but is null" #16

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: [NCC_INLA001] neuronx-cc fails to compile Llama-3.3-70B context_encoding_model with NxDI 0.7 — "type must be boolean, but is null" #16

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions