PDF to markdown #24
cobaltautomationdev
started this conversation in
Ideas
Replies: 1 comment 2 replies
-
|
这个现在有很多现有工具了呀。我们之前就做过一个PDF转换的: RapidDoc。现在搁置了,暂时没有精力。 |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
@SWHL 兄弟你可以帮忙研究下将PDF转换为markdown? 目前我用的库是marker。它也挺厉害的。,只不过偶尔会丢失PDF顶端的一些文字。但是如果你也可以rapidocrpdf开发一个转换pdf to markdown工具就完美了。之所以要转换为Markdown,因为我发现现在的AI对markdown的理解非常的准确。 markdown保留了基本的表格格式。 而AI 又可以从markdown很好的理解Layout。所以提取数据简直可以做到完美。 目前我主要是处理发票,我发现准确度基本可以媲美专业的ABBYY OCR商业软件。 无需训练,通过AI都可以完美提取。
如下是我的system prompt
Output a valid JSON object only. No extra text allowed.
IMPORTANT: Extract the following details from the provided invoice content:
If you encounter a potentially incorrect currency unit code,
attempt to correct it to the most likely standard ISO 4217 currency code.
Output in this JSON format:
{
"Version": "3.0",
"Fields": {
"InvoiceNumber": "string",
"InvoiceDate": "YYYY-MM-DD",
"InvoiceAmount": "number",
"CurrencyUnit": "string",
"Entity": "string",
"AccountNumber": "string",
"VendorName": "string"
}
}
Replace placeholders with actual values. Use null for missing fields.
Beta Was this translation helpful? Give feedback.
All reactions