Skip to content

OCR: GLM-OCR model produces garbage output via MLXVLM #66

@kiki830621

Description

@kiki830621

Problem

macdoc ocr pipeline runs end-to-end without crashes, but the OCR output is meaningless token repetitions instead of recognized text.

Type

bug

Expected

Running macdoc ocr /tmp/ocr-test.png on an image containing "Hello OCR Test 你好世界" should produce readable Markdown text.

Actual

Output is repeated garbage tokens:

刷刷刷刷刷刷一闪ount刷刷刷刷刷一闪ount刷刷刷刷刷刷...

Or with longer max-tokens:

encyencyencyencyency...指指指指指...Span Span Span...

Context

  • Model: EZCon/GLM-OCR-8bit-mlx (8-bit quantized GLM-OCR for MLX)
  • Pipeline: OCRPipelineVLMModelFactory.shared.loadContainer()MLXLMCommon.generate()
  • The MLXVLM library has a GlmOcr.swift implementation, so model architecture should be correct
  • Possible causes:
    1. The quantized model (EZCon/GLM-OCR-8bit-mlx) itself may be broken
    2. Image preprocessing (CIImage conversion) may not match what the model expects
    3. Chat template / prompt format may be wrong for this model
    4. Need to test with a known-working model (e.g., mlx-community/Qwen2.5-VL-3B-Instruct-4bit)

Impact

macdoc ocr is unusable until this is resolved — the entire OCR feature depends on correct VLM output.

Next Steps

  • Test with mlx-community/Qwen2.5-VL-3B-Instruct-4bit to verify the pipeline works with a different model
  • If Qwen works, the issue is GLM-OCR specific (model or config)
  • If Qwen also fails, the issue is in OCRPipeline's image/prompt handling

Current Status

Phase: diagnosed (round 2)
Last updated: 2026-04-21 by idd-diagnose

Key Decisions

  • (2026-04-21) Round 2 refreshed diagnosis posted (issuecomment-4288001978): original 2026-03-28 diagnosis predates fix: PageOCRRunner uses OCRCore.backend API (unblocks main build) #84's backend refactor; issue scope has widened to include general MLX VLM instability (mlx-swift-lm#191 broadcast_shapes upstream bug per MLXBackend.swift docstring).
  • (2026-04-21) Strategy restructured into Track A/B/C/D: Track A = investigation (3 reproductions), Track B = GLM-OCR-specific fix, Track C = MLX upstream tracking, Track D = close-path decision depending on A.
  • (2026-04-21) Key finding in code: MacDoc+OCR.swift:60-65 already silently routes config.ocrDefaultModel == "glm-ocr" to Qwen3-VL — compatibility shim already in place, effectively side-stepping OCR: GLM-OCR model produces garbage output via MLXVLM #66 for the default path.
  • (2026-04-21) Complexity = SDD-warranted (6 open questions: Track A ordering, GLM-OCR deprecation vs fix, MLX-vs-Ollama default, test coverage, pdf-ocr-vs-top-ocr sharing, issue scope). User chose /spectra-discuss.
  • (2026-03-28 R1) Original diagnosis recommended Step 1-4: test Qwen2.5-VL first, then isolate GLM-OCR specific vs pipeline. Never actioned.

Scope Changes

Blocking

  • Waiting for /spectra-discuss — 6 open questions, primary being "is scope narrow (GLM-OCR only) or broad (MLX VLM path)". Cannot proceed to /spectra-propose without this alignment + Track A evidence.
  • Track A requires local Ollama server + multi-GB model downloads; may need separate session with that setup.

Commits (relevant — since issue opened 2026-03-28)

Related Issues

  • #68 (CLOSED) — hf xet download produced corrupted safetensors (referenced in original diagnosis Step 4)
  • #78 — note packages extraction (unrelated, same era)
  • #79 — ocr-swift extraction (relocated the code)
  • Upstream: ml-explore/mlx-swift-lm#191 — broadcast_shapes crash in VLM inference (actively affects OCR: GLM-OCR model produces garbage output via MLXVLM #66)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions