Skip to content

Fix TypeError when disabling on-device sampling#28

Open
tmuttaki wants to merge 1 commit intovllm-project:release-0.4.1from
tmuttaki:fix/cpu-sampling-import-error
Open

Fix TypeError when disabling on-device sampling#28
tmuttaki wants to merge 1 commit intovllm-project:release-0.4.1from
tmuttaki:fix/cpu-sampling-import-error

Conversation

@tmuttaki
Copy link
Copy Markdown

@tmuttaki tmuttaki commented Apr 3, 2026

Summary

Fixed incorrect import of Sampler class that caused a TypeError when on-device sampling was disabled.

Fixes #29

Problem

When NEURON_ON_DEVICE_SAMPLING_DISABLED=1 environment variable is set or when on_device_sampling_config is explicitly set to None in the Neuron config overrides, the server fails to start with:

TypeError: 'module' object is not callable
  File "vllm_neuron/worker/neuronx_distributed_model_loader.py", line 81, in __init__
    self.sampler = Sampler()

Root Cause

Line 54 in vllm_neuron/worker/neuronx_distributed_model_loader.py was importing the sampler module instead of the Sampler class:

# Before (buggy):
from vllm.v1.sample import sampler as Sampler

This imported the module object, which cannot be called as a constructor.

Solution

Changed the import to correctly import the Sampler class:

# After (fixed):
from vllm.v1.sample.sampler import Sampler

Testing

Tested on AWS Trainium (trn1.2xlarge) with TinyLlama-1.1B-Chat-v1.0:

  1. Server startup: Successfully starts with NEURON_ON_DEVICE_SAMPLING_DISABLED=1
  2. CPU sampling: Confirmed via logs: "CPU sampling enabled: on_device_sampling_config is None"
  3. Structured outputs: Successfully tested with response_format parameter using JSON schemas
  4. Generation quality: Validated JSON output conforms to provided schemas

Example structured output test:

curl http://localhost:8009/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "prompt": "Generate a person profile with name, age, and city",
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "person_profile",
        "schema": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "age": {"type": "integer"},
            "city": {"type": "string"}
          },
          "required": ["name", "age", "city"]
        }
      }
    }
  }'

Result:

{
  "name": "John Doe",
  "age": 30,
  "city": "New York"
}

Impact

  • Enables CPU sampling mode: Required for structured outputs and grammar-constrained generation
  • No breaking changes: Only affects code path when on-device sampling is explicitly disabled
  • Backward compatible: Default behavior (on-device sampling enabled) is unchanged

Related Issues

This fix enables users to:

  • Use structured outputs via response_format parameter
  • Apply grammar constraints that aren't supported by on-device sampling
  • Access all vLLM sampling features beyond temperature, top_k, and top_p

Fixed incorrect import of Sampler class that caused a TypeError when
NEURON_ON_DEVICE_SAMPLING_DISABLED=1 was set or when on_device_sampling_config
was explicitly set to None.

The bug occurred because the code was importing the sampler module instead of
the Sampler class:
  - Before: from vllm.v1.sample import sampler as Sampler
  - After: from vllm.v1.sample.sampler import Sampler

This caused a "TypeError: 'module' object is not callable" error when trying
to instantiate the sampler at line 81.

This fix enables CPU sampling mode, which is required for structured outputs
and grammar-constrained generation that are not supported by on-device sampling.

Tested on AWS Trainium (trn1.2xlarge) with TinyLlama-1.1B-Chat-v1.0 using
structured output via response_format parameter.

Signed-off-by: Tahmid Muttaki <tmuttaki@redhat.com>
@tmuttaki
Copy link
Copy Markdown
Author

tmuttaki commented Apr 3, 2026

Fixes #29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant