Validate image bytes + steer LLM to company domains#24
Merged
Conversation
Two fixes for the LLM-fallback class of failures observed in prod. ## Magic-byte image format validation After grounding (#16) and the wikidata layer (#18), the remaining LLM misses were failing in a particularly bad way: the model returned a URL that downloaded fine, but the bytes weren't an image — usually HTML from a Wikipedia file *page* or a generic 404 page. libvips spent ~6s per size trying to process, hit our (now 60s) write timeout, and returned an opaque "Unsupported image format" without marking the row failed for retry. New `validateImageFormat` checks magic bytes for PNG, JPEG, WebP, GIF before handing to bimg, with specific errors for the common bad cases (SVG → libvips can't handle on Alpine; HTML → wrong URL). Rejection is fast (microseconds) and surfaces a clear reason in the logs. ## Gemini prompt rewrite The prior prompt encouraged Wikipedia/Wikimedia URLs — exactly the hosts the model can't reliably target because of the MD5-derived hash prefix in Commons paths. Prod hallucinations included three different invented hash prefixes for Repsol's logo (1/12, f/f9, 3/30) plus a malformed /thumb/ URL. Pivot the prompt to the company's own domain (and CDNs they own) and explicitly enumerate the anti-patterns we've observed: - upload.wikimedia.org URLs (use Wikidata path instead) - en.wikipedia.org/wiki/File:* (HTML pages, not files) - /thumb/ paths - stock-exchange "logo" endpoints - pattern-constructed URLs vs. URLs from real search results Also tightens the required-format list to PNG/JPEG/WebP (drops SVG), matching what our libvips build actually supports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two fixes for the LLM-fallback class of failures
1. Magic-byte image format validation
After grounding (#16) and the wikidata layer (#18), remaining LLM misses fail in a particularly bad way: the model returns a URL that downloads fine, but the bytes aren't an image (HTML 404 page, Wikipedia file page served as HTML, etc.). libvips spends ~6s per size trying to process, hits our 60s WriteTimeout, and returns "Unsupported image format" without marking the row failed for retry.
New `validateImageFormat` checks magic bytes for PNG / JPEG / WebP / GIF before handing to bimg, with specific errors for the common bad cases (SVG → not supported on Alpine vips, HTML → wrong URL). Rejection is microseconds, surfaces a clear reason in logs, marks the row failed cleanly.
2. Gemini prompt rewrite
Production hallucinations included three different invented Wikimedia hash prefixes for Repsol's logo (`/1/12/`, `/f/f9/`, `/3/30/`) plus a malformed `/thumb/` URL — the prior prompt actively encouraged Wikipedia/Wikimedia URLs, which are exactly the hosts the model can't reliably target because of the MD5-derived hash prefix in Commons paths.
Pivot the prompt to the company's own domain (and CDNs they own). Explicitly enumerate the anti-patterns we've observed in prod:
Tightens the required-format list to PNG/JPEG/WebP (drops SVG), matching what our libvips build actually supports.
Test plan
🤖 Generated with Claude Code