OCR: GLM-OCR model produces garbage output via MLXVLM

## Problem

`macdoc ocr` pipeline runs end-to-end without crashes, but the OCR output is meaningless token repetitions instead of recognized text.

## Type
bug

## Expected

Running `macdoc ocr /tmp/ocr-test.png` on an image containing "Hello OCR Test 你好世界" should produce readable Markdown text.

## Actual

Output is repeated garbage tokens:
```
刷刷刷刷刷刷一闪ount刷刷刷刷刷一闪ount刷刷刷刷刷刷...
```

Or with longer max-tokens:
```
encyencyencyencyency...指指指指指...Span Span Span...
```

## Context

- Model: `EZCon/GLM-OCR-8bit-mlx` (8-bit quantized GLM-OCR for MLX)
- Pipeline: `OCRPipeline` → `VLMModelFactory.shared.loadContainer()` → `MLXLMCommon.generate()`
- The MLXVLM library has a `GlmOcr.swift` implementation, so model architecture should be correct
- Possible causes:
  1. The quantized model (`EZCon/GLM-OCR-8bit-mlx`) itself may be broken
  2. Image preprocessing (CIImage conversion) may not match what the model expects
  3. Chat template / prompt format may be wrong for this model
  4. Need to test with a known-working model (e.g., `mlx-community/Qwen2.5-VL-3B-Instruct-4bit`)

## Impact

`macdoc ocr` is unusable until this is resolved — the entire OCR feature depends on correct VLM output.

## Next Steps

- [ ] Test with `mlx-community/Qwen2.5-VL-3B-Instruct-4bit` to verify the pipeline works with a different model
- [ ] If Qwen works, the issue is GLM-OCR specific (model or config)
- [ ] If Qwen also fails, the issue is in `OCRPipeline`'s image/prompt handling
---

## Current Status

**Phase**: diagnosed (round 2)
**Last updated**: 2026-04-21 by idd-diagnose

### Key Decisions
- **(2026-04-21) Round 2 refreshed diagnosis** posted ([issuecomment-4288001978](https://github.com/PsychQuant/macdoc/issues/66#issuecomment-4288001978)): original 2026-03-28 diagnosis predates #84's backend refactor; issue scope has widened to include general MLX VLM instability (`mlx-swift-lm#191` broadcast_shapes upstream bug per MLXBackend.swift docstring).
- **(2026-04-21) Strategy restructured into Track A/B/C/D**: Track A = investigation (3 reproductions), Track B = GLM-OCR-specific fix, Track C = MLX upstream tracking, Track D = close-path decision depending on A.
- **(2026-04-21) Key finding in code**: `MacDoc+OCR.swift:60-65` already silently routes `config.ocrDefaultModel == "glm-ocr"` to Qwen3-VL — compatibility shim already in place, effectively side-stepping #66 for the default path.
- **(2026-04-21) Complexity = SDD-warranted** (6 open questions: Track A ordering, GLM-OCR deprecation vs fix, MLX-vs-Ollama default, test coverage, pdf-ocr-vs-top-ocr sharing, issue scope). User chose `/spectra-discuss`.
- **(2026-03-28 R1) Original diagnosis** recommended Step 1-4: test Qwen2.5-VL first, then isolate GLM-OCR specific vs pipeline. Never actioned.

### Scope Changes
- **(2026-04-21 R2) Scope widened** from "GLM-OCR specific garbage output" to "OCR backend strategy post-#84" — includes general MLX VLM instability risk.
- **(2026-04-21 R2) Investigation gate added** — Track A 3 reproductions must complete before Track B/C/D can proceed.

### Blocking
- Waiting for `/spectra-discuss` — 6 open questions, primary being "is scope narrow (GLM-OCR only) or broad (MLX VLM path)". Cannot proceed to `/spectra-propose` without this alignment + Track A evidence.
- Track A requires local Ollama server + multi-GB model downloads; may need separate session with that setup.

### Commits (relevant — since issue opened 2026-03-28)
- `1a95969` (#84) wip: swap default OCR model to Qwen3-VL + OCRCore backend abstraction — side-stepped #66 at default level
- `51e8a07` feat(config): add `config ocr` subcommand with host profile support — Ollama host management
- `f1b975f` refactor: simplify PDF pipeline — replace block segmentation with page-level GLM-OCR — `pdf ocr` still defaults to GLM-OCR
- `692dc17` (#79) ocr-swift package extracted from macdoc — code now lives in PsychQuant/ocr-swift

### Related Issues
- `#68` (CLOSED) — hf xet download produced corrupted safetensors (referenced in original diagnosis Step 4)
- `#78` — note packages extraction (unrelated, same era)
- `#79` — ocr-swift extraction (relocated the code)
- Upstream: `ml-explore/mlx-swift-lm#191` — broadcast_shapes crash in VLM inference (actively affects #66)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR: GLM-OCR model produces garbage output via MLXVLM #66

Problem

Type

Expected

Actual

Context

Impact

Next Steps

Current Status

Key Decisions

Scope Changes

Blocking

Commits (relevant — since issue opened 2026-03-28)

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OCR: GLM-OCR model produces garbage output via MLXVLM #66

Description

Problem

Type

Expected

Actual

Context

Impact

Next Steps

Current Status

Key Decisions

Scope Changes

Blocking

Commits (relevant — since issue opened 2026-03-28)

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions