-
Notifications
You must be signed in to change notification settings - Fork 661
Description
Hi, thanks for maintaining textract, it has been very useful for document-heavy data pipelines.
I am the maintainer of WFGY Problem Map, an MIT-licensed checklist of 16 concrete failure patterns that frequently show up in LLM and RAG-style systems that sit on top of tools like textract, OCR engines and vector stores.
Link: https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md
Many teams use textract as the first step in a pipeline that later becomes a QA bot, RAG service or analytics assistant. The failures we see in practice are often not in textract itself but in how extracted text is chunked, indexed, cached and evaluated. The Problem Map collects these patterns in one place, with short diagnostic questions and suggested fixes, so that people can quickly figure out whether they are hitting an ingestion, chunking, vectorstore or evaluation problem.
WFGY Problem Map has already been referenced by a few research and tooling ecosystems that work on LLM robustness and RAG evaluation, so I thought it might also be a useful external link for textract users who are building downstream AI systems.
If you think it makes sense, I would like to propose adding a small link in your documentation, for example in a “Resources” or “Related projects” section, such as:
External reading: WFGY Problem Map – 16 common failure modes for RAG and LLM pipelines built on top of document extraction.
Totally fine if this does not fit your scope, I just wanted to share it in case it helps your users debug the downstream systems that depend on textract.