Parse files & links into clean, structured, LLM-ready text.
Reparsed is a self-hosted API that takes a URL or a file, throws out the noise (ads, trackers, navigation, cookie banners, UI chrome, "you might also like"), figures out what kind of content it is, and returns a clean, structured JSON payload your downstream LLM can actually use.
It exists because engines like Ollama and LocalAI can't fetch a URL or read a PDF. Reparsed is the authenticated pre-processor you put in front of them.
┌─ messy input ──────────────┐ ┌─ clean output ─────────────────────┐
│ nav · ads · cookie banner │ │ { "content_type": "video", │
│ TITLE · 142K views │ ──▶ │ "title": "Building a Rust server",│
│ Subscribe! Share! tracking │ reparsed│ "structured": { "channel": … }, │
│ description · 1.2k comments │ │ "markdown": "# Building a Rust…" }│
└─────────────────────────────┘ └─────────────────────────────────────┘
| Concern | Decision | Why |
|---|---|---|
| Similar tech | Conceptually Jina Reader / Firecrawl, plus a semantic layer | Those tools do HTML→markdown. Reparsed additionally classifies the content type and keeps only the important fields, which is what makes downstream prompting cheap and reliable. |
| HTML main-content extraction | Trafilatura (MIT) | Highest F1 (~0.958) among open-source extractors, beating Mozilla Readability; fast, deterministic, no GPU. (benchmark) |
| JS-heavy pages (e.g. video) | Playwright headless Chromium fallback | Static fetch misses JS-rendered content; we render only when static extraction is thin. |
| Binary files (PDF/DOCX/…) | MarkItDown (MIT) default, Docling (MIT) optional | MarkItDown is fast, light, and emits LLM-friendly markdown across PDF/DOCX/PPTX/XLSX/images. Docling wins on complex tables (~98% accuracy) but is heavy (~1GB), so it's opt-in. |
| Which model | A single general instruction model (default qwen3:8b), configurable |
The Qwen3 line is the most reliable for JSON-schema-constrained output in Ollama. One model handles classify + restructure. Swap via REPARSED_MODEL. |
| Output format | Markdown body inside a JSON envelope | Research is consistent that markdown is the best LLM-ingestion format (10–20% fewer tokens than HTML, cleaner structure). The JSON envelope gives code the typed fields; the markdown field gives the next model clean prose. (why) |
| Structured output reliability | Ollama's format = JSON Schema |
Constrains decoding so the response is always valid JSON. (docs) |
A purpose-built model worth knowing about: Jina's ReaderLM-v2 (1.5B, available on Ollama) does
HTML→markdown/JSONconversion specifically. We use Trafilatura for that step instead because it's deterministic, free, and needs no GPU — but ReaderLM-v2 is a natural drop-in if you want a model-based extractor later.
client / resubmit.ai ──(Bearer rp_live_…)──▶ Caddy (TLS, reparsed.app)
│
FastAPI :17177
├─ /v1/parse (API-key auth)
├─ dashboard + auth (session cookies)
│
│ Stage 1 — deterministic extract
│ url → httpx → Trafilatura
│ └ Playwright fallback (JS)
│ file → MarkItDown (→ Docling opt.)
│ Stage 2 — LLM classify + restructure
│ → Ollama, external GPU host (JSON-schema)
▼
Ollama (your GPU server) Postgres (users, keys, usage)
Two stages keep it cheap and robust: deterministic extraction removes ~90% of the junk for free, so the model only does the semantic part (detect type, keep the important fields, respect the output budget — e.g. "top 10 comments if they fit").
Prefer not to run it? Use the hosted version at reparsed.app — sign up, grab a key, and skip the infrastructure. This repo is for self-hosting.
Requirements: Docker + Docker Compose, and an Ollama server reachable on your network — Reparsed talks to it over HTTP (the GPU lives there, not in this stack).
git clone <this repo> reparsed && cd reparsed
cp .env.example .env
# → edit .env: set SESSION_SECRET, POSTGRES_PASSWORD, and OLLAMA_BASE_URL
# Make sure your model is pulled on the Ollama server:
# ollama pull qwen3:8b
docker compose up -dSet OLLAMA_BASE_URL to your Ollama host (e.g. http://192.168.1.10:11434). Once
up, the API comes up on http://localhost:17177.
curl http://localhost:17177/v1/health
# {"status":"ok","model":"qwen3:8b","model_ready":true}Open http://localhost:17177, register an account, and generate your API key in the dashboard.
Point DNS at the host, then start with the TLS proxy profile:
REPARSED_DOMAIN=reparsed.app PUBLIC_BASE_URL=https://reparsed.app SESSION_COOKIE_SECURE=true \
docker compose --profile proxy up -dCaddy fetches a Let's Encrypt certificate automatically and proxies :443 → api:17177.
Set SESSION_COOKIE_SECURE=true only when every client reaches the app over HTTPS;
leave it false (the default) if you also hit it over plain HTTP on the LAN, or the
session cookie is dropped and logins silently fail.
Auth: Authorization: Bearer rp_live_… (or X-API-Key: rp_live_…).
Provide exactly one source. JSON body for url/text/html, or multipart/form-data for a file:
| Field | Where | Description |
|---|---|---|
url |
JSON | Page or file URL to fetch and parse. |
text |
JSON | Raw text to clean and structure. |
html |
JSON | Raw HTML to extract from. |
file |
multipart | Uploaded file (PDF/DOCX/PPTX/XLSX/image/…). |
max_output_chars |
both | Cap on the clean text (default 8000, max 32000). |
format |
both | both (default), markdown, or json. |
render |
both | auto (default), always, or never (browser rendering). |
content_hint |
both | Optional type hint, e.g. "job_posting". |
Examples
# Parse a link
curl -X POST http://localhost:17177/v1/parse \
-H "Authorization: Bearer rp_live_…" -H "Content-Type: application/json" \
-d '{"url": "https://news.ycombinator.com/item?id=1"}'
# Upload a file
curl -X POST http://localhost:17177/v1/parse \
-H "Authorization: Bearer rp_live_…" \
-F "file=@resume.pdf" -F "content_hint=resume"
# Clean raw text
curl -X POST http://localhost:17177/v1/parse \
-H "Authorization: Bearer rp_live_…" -H "Content-Type: application/json" \
-d '{"text": "....", "max_output_chars": 4000}'Response
{
"content_type": "job_posting",
"title": "Senior Backend Engineer",
"language": "en",
"structured": {
"company": "Acme",
"location": "Remote (EU)",
"salary_range": "€70k–€90k",
"requirements": ["5+ years Python", "Postgres", "Kubernetes"],
"benefits": ["Equity", "30 days PTO"]
},
"markdown": "# Senior Backend Engineer\n**Company:** Acme …",
"meta": {
"source": "url", "url": "https://…", "filename": null,
"language": "en", "extractor": "trafilatura", "model": "qwen3:8b",
"input_chars": 18342, "output_chars": 1204,
"input_truncated": false, "output_truncated": false, "latency_ms": 2310
}
}GET /v1/content-types— the catalog of types Reparsed detects.GET /v1/health— backend + model readiness.GET /api-docs— interactive OpenAPI (Swagger) docs.
400 bad input · 401 missing/invalid key · 413 file too large · 422 nothing
extractable / unsupported · 429 rate limited · 502/504 fetch or model failure.
If the model is unreachable, Reparsed degrades gracefully and returns the
deterministic extraction as content_type: "generic" rather than failing.
| Default | |
|---|---|
| Max upload | 25 MB |
| Max fetched HTML | 10 MB |
| Text sent to model | ~96k chars (~24k tokens) |
| Default / max output | 8k / 32k chars |
| Rate limit | 30/min · 2000/day per key |
article · video · job_posting · product · resume · forum_thread ·
documentation · recipe · event · search_results · social_post · generic
Each type defines what "important" means (see app/content_types.py).
A video keeps the description, recommended videos and top comments; a job posting
keeps company/salary/requirements; a recipe drops the life story. Unknown content
falls back to generic.
Reparsed is designed to be chained. The pattern:
import requests, ollama
KEY = "rp_live_…"
# 1 — Reparsed turns a messy job-post URL into clean, typed fields
clean = requests.post(
"https://reparsed.app/v1/parse",
headers={"Authorization": f"Bearer {KEY}"},
json={"url": job_url, "content_hint": "job_posting"},
).json()
# 2 — feed the clean text to your own model
ollama.chat(model="llama3", messages=[
{"role": "user",
"content": f"Tailor my resume to this role:\n\n{clean['markdown']}"}
])For resubmit.ai specifically, the job_posting and resume content types map
directly onto resume-tailoring: parse the job URL and the user's uploaded resume
PDF through the same endpoint, then hand both clean markdown blocks (and the typed
structured fields) to the tailoring model. One authenticated POST per artifact, no
scraping or PDF parsing to maintain on resubmit.ai's side.
All via environment variables (see .env.example). Highlights:
| Var | Default | Notes |
|---|---|---|
REPARSED_MODEL |
qwen3:8b |
Ollama model. CPU host? Use qwen3:4b / gemma3:4b. |
OLLAMA_BASE_URL |
http://192.168.1.10:11434 |
Your external Ollama server. Set this. |
SESSION_SECRET |
— | Set this. Signs dashboard cookies. |
SESSION_COOKIE_SECURE |
false |
true adds the cookie Secure flag (HTTPS-only). Keep false if reached over HTTP. |
PUBLIC_BASE_URL |
http://localhost:17177 |
Shown in dashboard samples. |
ALLOW_REGISTRATION |
true |
Set false to lock signups. |
PLAYWRIGHT_ENABLED |
true |
JS-render fallback. |
ENABLE_DOCLING |
false |
High-fidelity PDF (needs the optional install). |
RATE_LIMIT_PER_MINUTE / _PER_DAY |
30 / 2000 |
Per key. |
- API keys are random 256-bit secrets, shown once and stored only as a SHA-256 hash. One key per user; deleting and regenerating is the rotation path.
- Passwords are bcrypt-hashed. Dashboard auth uses signed,
SameSite=Laxsession cookies; state-changing dashboard calls require a CSRF token (double-submit). The cookieSecureflag is opt-in viaSESSION_COOKIE_SECURE(enable it in HTTPS-only deployments). - The rate limiter is in-memory (single instance). For multiple replicas, back it
with Redis behind the same interface in
app/ratelimit.py. - Treat keys as server-side secrets — call the API server-to-server, not from a browser.
reparsed/
├── docker-compose.yml # db · api · caddy(profile) — Ollama is external
├── Caddyfile # TLS reverse proxy for reparsed.app
├── .env.example
└── api/
├── Dockerfile # python + playwright chromium
├── requirements.txt
├── app/
│ ├── main.py # app wiring, middleware, lifespan
│ ├── config.py # env settings
│ ├── db.py / models.py # SQLAlchemy (users, api_keys, usage_logs)
│ ├── security.py deps.py ratelimit.py
│ ├── content_types.py # the semantic registry
│ ├── schemas.py
│ ├── parsing/ # web.py · files.py · extract.py (Stage 1)
│ ├── llm/ # ollama_client.py · restructure.py (Stage 2)
│ └── routers/ # parse · auth · keys · pages
├── templates/ # landing · login · register · dashboard
└── static/ # css · js · logo
- Deeply dynamic content (e.g. YouTube comments that lazy-load on scroll) is
extracted from the rendered DOM only; per-site adapters or official APIs do better.
A YouTube oEmbed adapter ships as an example in
parsing/web.py. - Single-instance rate limiting and table auto-create (swap to Redis + Alembic for scale).
- Optional model routing (small model for classify, larger for long docs) and a ReaderLM-v2 extraction mode are natural next steps.
MIT.