Skip to content

[Bug] OpenAI-compatible API hangs on chat/completions (model loaded, no tokens returned) #29

@dm3rdnv

Description

@dm3rdnv

macMLX Version

API reports 0.3.6

Apple Silicon Chip

M2 Pro / Max / Ultra

macOS Version

26.4.1 (25E253)

Bug Description

Describe the Bug

The OpenAI-compatible API is reachable and reports a model as loaded, but POST /v1/chat/completions hangs until client timeout with no response body and no generated tokens.

This blocks orchestration/reviewer usage where API generation must be reliable.

Environment

  • macMLX version: API reports 0.3.6
  • macOS version: 26.4.1 (25E253)
  • Mac model and chip: Mac14,6 (Apple Silicon)
  • Model being used: Qwen3-32B-4bit

Additional Context

This issue is separate from CLI crash paths.
Here, server mode remains alive and model is loaded, but generation does not produce any output for structured prompts.

Steps to Reproduce

Steps to Reproduce

  1. Start macMLX API server (http://localhost:8000/v1).

  2. Confirm API health:

    • GET /v1 -> 200
    • GET /v1/health -> {"status":"ok"}
    • GET /v1/models -> contains Qwen3-32B-4bit
    • GET /v1/status -> loaded_model: Qwen3-32B-4bit
  3. Control check (small prompt succeeds):

bash curl -m 30 -H "Content-Type: application/json" \ -X POST http://localhost:8000/v1/chat/completions \ -d '{"model":"Qwen3-32B-4bit","messages":[{"role":"user","content":"OK"}],"temperature":0,"max_tokens":4}'

→ returns normally

  1. Large/structured prompt:

bash curl -m 120 -H "Content-Type: application/json" \ -X POST http://localhost:8000/v1/chat/completions \ -d '{"model":"Qwen3-32B-4bit","messages":[{"role":"user","content":"Reply with exactly OK"}],"temperature":0,"max_tokens":8,"stream":false}'

→ times out with 0 bytes received

Expected Behavior

/v1/chat/completions should return a valid response (or stream tokens) within reasonable time once the model is loaded.

Actual Behavior

Request hangs until client timeout.
No response body is returned.
No tokens are generated.

Logs (optional)

## Logs

- /v1/status shows model loaded
- GPU utilization is high (~80–90%) during request
- curl result: Operation timed out ... with 0 bytes received

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions