macMLX Version
API reports 0.3.6
Apple Silicon Chip
M2 Pro / Max / Ultra
macOS Version
26.4.1 (25E253)
Bug Description
Describe the Bug
The OpenAI-compatible API is reachable and reports a model as loaded, but POST /v1/chat/completions hangs until client timeout with no response body and no generated tokens.
This blocks orchestration/reviewer usage where API generation must be reliable.
Environment
- macMLX version: API reports 0.3.6
- macOS version: 26.4.1 (25E253)
- Mac model and chip: Mac14,6 (Apple Silicon)
- Model being used: Qwen3-32B-4bit
Additional Context
This issue is separate from CLI crash paths.
Here, server mode remains alive and model is loaded, but generation does not produce any output for structured prompts.
Steps to Reproduce
Steps to Reproduce
-
Start macMLX API server (http://localhost:8000/v1).
-
Confirm API health:
- GET /v1 -> 200
- GET /v1/health -> {"status":"ok"}
- GET /v1/models -> contains Qwen3-32B-4bit
- GET /v1/status -> loaded_model: Qwen3-32B-4bit
-
Control check (small prompt succeeds):
bash curl -m 30 -H "Content-Type: application/json" \ -X POST http://localhost:8000/v1/chat/completions \ -d '{"model":"Qwen3-32B-4bit","messages":[{"role":"user","content":"OK"}],"temperature":0,"max_tokens":4}'
→ returns normally
- Large/structured prompt:
bash curl -m 120 -H "Content-Type: application/json" \ -X POST http://localhost:8000/v1/chat/completions \ -d '{"model":"Qwen3-32B-4bit","messages":[{"role":"user","content":"Reply with exactly OK"}],"temperature":0,"max_tokens":8,"stream":false}'
→ times out with 0 bytes received
Expected Behavior
/v1/chat/completions should return a valid response (or stream tokens) within reasonable time once the model is loaded.
Actual Behavior
Request hangs until client timeout.
No response body is returned.
No tokens are generated.
Logs (optional)
## Logs
- /v1/status shows model loaded
- GPU utilization is high (~80–90%) during request
- curl result: Operation timed out ... with 0 bytes received
macMLX Version
API reports 0.3.6
Apple Silicon Chip
M2 Pro / Max / Ultra
macOS Version
26.4.1 (25E253)
Bug Description
Describe the Bug
The OpenAI-compatible API is reachable and reports a model as loaded, but POST /v1/chat/completions hangs until client timeout with no response body and no generated tokens.
This blocks orchestration/reviewer usage where API generation must be reliable.
Environment
Additional Context
This issue is separate from CLI crash paths.
Here, server mode remains alive and model is loaded, but generation does not produce any output for structured prompts.
Steps to Reproduce
Steps to Reproduce
Start macMLX API server (http://localhost:8000/v1).
Confirm API health:
Control check (small prompt succeeds):
bash curl -m 30 -H "Content-Type: application/json" \ -X POST http://localhost:8000/v1/chat/completions \ -d '{"model":"Qwen3-32B-4bit","messages":[{"role":"user","content":"OK"}],"temperature":0,"max_tokens":4}'
→ returns normally
bash curl -m 120 -H "Content-Type: application/json" \ -X POST http://localhost:8000/v1/chat/completions \ -d '{"model":"Qwen3-32B-4bit","messages":[{"role":"user","content":"Reply with exactly OK"}],"temperature":0,"max_tokens":8,"stream":false}'
→ times out with 0 bytes received
Expected Behavior
/v1/chat/completions should return a valid response (or stream tokens) within reasonable time once the model is loaded.
Actual Behavior
Request hangs until client timeout.
No response body is returned.
No tokens are generated.
Logs (optional)
## Logs - /v1/status shows model loaded - GPU utilization is high (~80–90%) during request - curl result: Operation timed out ... with 0 bytes received