[Bug] OpenAI-compatible API hangs on chat/completions (model loaded, no tokens returned)

### macMLX Version

API reports 0.3.6

### Apple Silicon Chip

M2 Pro / Max / Ultra

### macOS Version

 26.4.1 (25E253)

### Bug Description

## Describe the Bug

The OpenAI-compatible API is reachable and reports a model as loaded, but POST /v1/chat/completions hangs until client timeout with no response body and no generated tokens.

This blocks orchestration/reviewer usage where API generation must be reliable.


## Environment

- macMLX version: API reports 0.3.6
- macOS version: 26.4.1 (25E253)
- Mac model and chip: Mac14,6 (Apple Silicon)
- Model being used: Qwen3-32B-4bit


## Additional Context

This issue is separate from CLI crash paths.  
Here, server mode remains alive and model is loaded, but generation does not produce any output for structured prompts.

### Steps to Reproduce

## Steps to Reproduce

1. Start macMLX API server (http://localhost:8000/v1).
2. Confirm API health:
   - GET /v1 -> 200
   - GET /v1/health -> {"status":"ok"}
   - GET /v1/models -> contains Qwen3-32B-4bit
   - GET /v1/status -> loaded_model: Qwen3-32B-4bit

3. Control check (small prompt succeeds):

bash curl -m 30 -H "Content-Type: application/json" \   -X POST http://localhost:8000/v1/chat/completions \   -d '{"model":"Qwen3-32B-4bit","messages":[{"role":"user","content":"OK"}],"temperature":0,"max_tokens":4}' 

→ returns normally

4. Large/structured prompt:

bash curl -m 120 -H "Content-Type: application/json" \   -X POST http://localhost:8000/v1/chat/completions \   -d '{"model":"Qwen3-32B-4bit","messages":[{"role":"user","content":"Reply with exactly OK"}],"temperature":0,"max_tokens":8,"stream":false}' 

→ times out with 0 bytes received

## Expected Behavior

/v1/chat/completions should return a valid response (or stream tokens) within reasonable time once the model is loaded.

## Actual Behavior

Request hangs until client timeout.  
No response body is returned.  
No tokens are generated.

### Logs (optional)

```shell
## Logs

- /v1/status shows model loaded
- GPU utilization is high (~80–90%) during request
- curl result: Operation timed out ... with 0 bytes received
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] OpenAI-compatible API hangs on chat/completions (model loaded, no tokens returned) #29

macMLX Version

Apple Silicon Chip

macOS Version

Bug Description

Describe the Bug

Environment

Additional Context

Steps to Reproduce

Steps to Reproduce

Expected Behavior

Actual Behavior

Logs (optional)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] OpenAI-compatible API hangs on chat/completions (model loaded, no tokens returned) #29

Description

macMLX Version

Apple Silicon Chip

macOS Version

Bug Description

Describe the Bug

Environment

Additional Context

Steps to Reproduce

Steps to Reproduce

Expected Behavior

Actual Behavior

Logs (optional)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions