Reparsed

Parse files & links into clean, structured, LLM-ready text.

Reparsed is a self-hosted API that takes a URL or a file, throws out the noise (ads, trackers, navigation, cookie banners, UI chrome, "you might also like"), figures out what kind of content it is, and returns a clean, structured JSON payload your downstream LLM can actually use.

It exists because engines like Ollama and LocalAI can't fetch a URL or read a PDF. Reparsed is the authenticated pre-processor you put in front of them.

┌─ messy input ──────────────┐         ┌─ clean output ─────────────────────┐
│ nav · ads · cookie banner   │         │ { "content_type": "video",          │
│ TITLE · 142K views          │ ──▶     │   "title": "Building a Rust server",│
│ Subscribe! Share! tracking  │ reparsed│   "structured": { "channel": … },   │
│ description · 1.2k comments │         │   "markdown": "# Building a Rust…" }│
└─────────────────────────────┘         └─────────────────────────────────────┘

Why these design choices (the research)

Concern	Decision	Why
Similar tech	Conceptually Jina Reader / Firecrawl, plus a semantic layer	Those tools do `HTML→markdown`. Reparsed additionally classifies the content type and keeps only the important fields, which is what makes downstream prompting cheap and reliable.
HTML main-content extraction	Trafilatura (MIT)	Highest F1 (~0.958) among open-source extractors, beating Mozilla Readability; fast, deterministic, no GPU. (benchmark)
JS-heavy pages (e.g. video)	Playwright headless Chromium fallback	Static fetch misses JS-rendered content; we render only when static extraction is thin.
Binary files (PDF/DOCX/…)	MarkItDown (MIT) default, Docling (MIT) optional	MarkItDown is fast, light, and emits LLM-friendly markdown across PDF/DOCX/PPTX/XLSX/images. Docling wins on complex tables (~98% accuracy) but is heavy (~1GB), so it's opt-in.
Which model	A single general instruction model (default `qwen3:8b`), configurable	The Qwen3 line is the most reliable for JSON-schema-constrained output in Ollama. One model handles classify + restructure. Swap via `REPARSED_MODEL`.
Output format	Markdown body inside a JSON envelope	Research is consistent that markdown is the best LLM-ingestion format (10–20% fewer tokens than HTML, cleaner structure). The JSON envelope gives code the typed fields; the `markdown` field gives the next model clean prose. (why)
Structured output reliability	Ollama's `format` = JSON Schema	Constrains decoding so the response is always valid JSON. (docs)

A purpose-built model worth knowing about: Jina's ReaderLM-v2 (1.5B, available on Ollama) does HTML→markdown/JSON conversion specifically. We use Trafilatura for that step instead because it's deterministic, free, and needs no GPU — but ReaderLM-v2 is a natural drop-in if you want a model-based extractor later.

Architecture

client / resubmit.ai ──(Bearer rp_live_…)──▶ Caddy (TLS, reparsed.app)
                                                  │
                                          FastAPI :17177
                                          ├─ /v1/parse        (API-key auth)
                                          ├─ dashboard + auth  (session cookies)
                                          │
                                          │  Stage 1 — deterministic extract
                                          │    url  → httpx → Trafilatura
                                          │           └ Playwright fallback (JS)
                                          │    file → MarkItDown (→ Docling opt.)
                                          │  Stage 2 — LLM classify + restructure
                                          │    → Ollama, external GPU host (JSON-schema)
                                          ▼
                       Ollama (your GPU server)      Postgres (users, keys, usage)

Two stages keep it cheap and robust: deterministic extraction removes ~90% of the junk for free, so the model only does the semantic part (detect type, keep the important fields, respect the output budget — e.g. "top 10 comments if they fit").

Prefer not to run it? Use the hosted version at reparsed.app — sign up, grab a key, and skip the infrastructure. This repo is for self-hosting.

Quick start

Requirements: Docker + Docker Compose, and an Ollama server reachable on your network — Reparsed talks to it over HTTP (the GPU lives there, not in this stack).

git clone <this repo> reparsed && cd reparsed
cp .env.example .env
#  → edit .env: set SESSION_SECRET, POSTGRES_PASSWORD, and OLLAMA_BASE_URL

# Make sure your model is pulled on the Ollama server:
#   ollama pull qwen3:8b

docker compose up -d

Set OLLAMA_BASE_URL to your Ollama host (e.g. http://192.168.1.10:11434). Once up, the API comes up on http://localhost:17177.

curl http://localhost:17177/v1/health
# {"status":"ok","model":"qwen3:8b","model_ready":true}

Open http://localhost:17177, register an account, and generate your API key in the dashboard.

Production (reparsed.app)

Point DNS at the host, then start with the TLS proxy profile:

REPARSED_DOMAIN=reparsed.app PUBLIC_BASE_URL=https://reparsed.app SESSION_COOKIE_SECURE=true \
  docker compose --profile proxy up -d

Caddy fetches a Let's Encrypt certificate automatically and proxies :443 → api:17177. Set SESSION_COOKIE_SECURE=true only when every client reaches the app over HTTPS; leave it false (the default) if you also hit it over plain HTTP on the LAN, or the session cookie is dropped and logins silently fail.

API reference

`POST /v1/parse`

Auth: Authorization: Bearer rp_live_… (or X-API-Key: rp_live_…).

Provide exactly one source. JSON body for url/text/html, or multipart/form-data for a file:

Field	Where	Description
`url`	JSON	Page or file URL to fetch and parse.
`text`	JSON	Raw text to clean and structure.
`html`	JSON	Raw HTML to extract from.
`file`	multipart	Uploaded file (PDF/DOCX/PPTX/XLSX/image/…).
`max_output_chars`	both	Cap on the clean text (default `8000`, max `32000`).
`format`	both	`both` (default), `markdown`, or `json`.
`render`	both	`auto` (default), `always`, or `never` (browser rendering).
`content_hint`	both	Optional type hint, e.g. `"job_posting"`.

Examples

# Parse a link
curl -X POST http://localhost:17177/v1/parse \
  -H "Authorization: Bearer rp_live_…" -H "Content-Type: application/json" \
  -d '{"url": "https://news.ycombinator.com/item?id=1"}'

# Upload a file
curl -X POST http://localhost:17177/v1/parse \
  -H "Authorization: Bearer rp_live_…" \
  -F "file=@resume.pdf" -F "content_hint=resume"

# Clean raw text
curl -X POST http://localhost:17177/v1/parse \
  -H "Authorization: Bearer rp_live_…" -H "Content-Type: application/json" \
  -d '{"text": "....", "max_output_chars": 4000}'

Response

{
  "content_type": "job_posting",
  "title": "Senior Backend Engineer",
  "language": "en",
  "structured": {
    "company": "Acme",
    "location": "Remote (EU)",
    "salary_range": "€70k–€90k",
    "requirements": ["5+ years Python", "Postgres", "Kubernetes"],
    "benefits": ["Equity", "30 days PTO"]
  },
  "markdown": "# Senior Backend Engineer\n**Company:** Acme  …",
  "meta": {
    "source": "url", "url": "https://…", "filename": null,
    "language": "en", "extractor": "trafilatura", "model": "qwen3:8b",
    "input_chars": 18342, "output_chars": 1204,
    "input_truncated": false, "output_truncated": false, "latency_ms": 2310
  }
}

Other endpoints

GET /v1/content-types — the catalog of types Reparsed detects.
GET /v1/health — backend + model readiness.
GET /api-docs — interactive OpenAPI (Swagger) docs.

Errors

400 bad input · 401 missing/invalid key · 413 file too large · 422 nothing extractable / unsupported · 429 rate limited · 502/504 fetch or model failure. If the model is unreachable, Reparsed degrades gracefully and returns the deterministic extraction as content_type: "generic" rather than failing.

Limits (defaults, all configurable)

	Default
Max upload	25 MB
Max fetched HTML	10 MB
Text sent to model	~96k chars (~24k tokens)
Default / max output	8k / 32k chars
Rate limit	30/min · 2000/day per key

Detected content types

article · video · job_posting · product · resume · forum_thread · documentation · recipe · event · search_results · social_post · generic

Each type defines what "important" means (see app/content_types.py). A video keeps the description, recommended videos and top comments; a job posting keeps company/salary/requirements; a recipe drops the life story. Unknown content falls back to generic.

Integrating with resubmit.ai (and any LLM stack)

Reparsed is designed to be chained. The pattern:

import requests, ollama

KEY = "rp_live_…"

# 1 — Reparsed turns a messy job-post URL into clean, typed fields
clean = requests.post(
    "https://reparsed.app/v1/parse",
    headers={"Authorization": f"Bearer {KEY}"},
    json={"url": job_url, "content_hint": "job_posting"},
).json()

# 2 — feed the clean text to your own model
ollama.chat(model="llama3", messages=[
    {"role": "user",
     "content": f"Tailor my resume to this role:\n\n{clean['markdown']}"}
])

For resubmit.ai specifically, the job_posting and resume content types map directly onto resume-tailoring: parse the job URL and the user's uploaded resume PDF through the same endpoint, then hand both clean markdown blocks (and the typed structured fields) to the tailoring model. One authenticated POST per artifact, no scraping or PDF parsing to maintain on resubmit.ai's side.

Configuration

All via environment variables (see .env.example). Highlights:

Var	Default	Notes
`REPARSED_MODEL`	`qwen3:8b`	Ollama model. CPU host? Use `qwen3:4b` / `gemma3:4b`.
`OLLAMA_BASE_URL`	`http://192.168.1.10:11434`	Your external Ollama server. Set this.
`SESSION_SECRET`	—	Set this. Signs dashboard cookies.
`SESSION_COOKIE_SECURE`	`false`	`true` adds the cookie `Secure` flag (HTTPS-only). Keep `false` if reached over HTTP.
`PUBLIC_BASE_URL`	`http://localhost:17177`	Shown in dashboard samples.
`ALLOW_REGISTRATION`	`true`	Set `false` to lock signups.
`PLAYWRIGHT_ENABLED`	`true`	JS-render fallback.
`ENABLE_DOCLING`	`false`	High-fidelity PDF (needs the optional install).
`RATE_LIMIT_PER_MINUTE` / `_PER_DAY`	`30` / `2000`	Per key.

Security notes

API keys are random 256-bit secrets, shown once and stored only as a SHA-256 hash. One key per user; deleting and regenerating is the rotation path.
Passwords are bcrypt-hashed. Dashboard auth uses signed, SameSite=Lax session cookies; state-changing dashboard calls require a CSRF token (double-submit). The cookie Secure flag is opt-in via SESSION_COOKIE_SECURE (enable it in HTTPS-only deployments).
The rate limiter is in-memory (single instance). For multiple replicas, back it with Redis behind the same interface in app/ratelimit.py.
Treat keys as server-side secrets — call the API server-to-server, not from a browser.

Project layout

reparsed/
├── docker-compose.yml          # db · api · caddy(profile) — Ollama is external
├── Caddyfile                   # TLS reverse proxy for reparsed.app
├── .env.example
└── api/
    ├── Dockerfile              # python + playwright chromium
    ├── requirements.txt
    ├── app/
    │   ├── main.py             # app wiring, middleware, lifespan
    │   ├── config.py           # env settings
    │   ├── db.py / models.py   # SQLAlchemy (users, api_keys, usage_logs)
    │   ├── security.py deps.py ratelimit.py
    │   ├── content_types.py    # the semantic registry
    │   ├── schemas.py
    │   ├── parsing/            # web.py · files.py · extract.py  (Stage 1)
    │   ├── llm/                # ollama_client.py · restructure.py (Stage 2)
    │   └── routers/            # parse · auth · keys · pages
    ├── templates/              # landing · login · register · dashboard
    └── static/                 # css · js · logo

Roadmap / known limits

Deeply dynamic content (e.g. YouTube comments that lazy-load on scroll) is extracted from the rendered DOM only; per-site adapters or official APIs do better. A YouTube oEmbed adapter ships as an example in parsing/web.py.
Single-instance rate limiting and table auto-create (swap to Redis + Alembic for scale).
Optional model routing (small model for classify, larger for long docs) and a ReaderLM-v2 extraction mode are natural next steps.

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
api		api
.env.example		.env.example
.gitignore		.gitignore
Caddyfile		Caddyfile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reparsed

Why these design choices (the research)

Architecture

Quick start

Production (reparsed.app)

API reference

`POST /v1/parse`

Other endpoints

Errors

Limits (defaults, all configurable)

Detected content types

Integrating with resubmit.ai (and any LLM stack)

Configuration

Security notes

Project layout

Roadmap / known limits

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reparsed

Why these design choices (the research)

Architecture

Quick start

Production (reparsed.app)

API reference

POST /v1/parse

Other endpoints

Errors

Limits (defaults, all configurable)

Detected content types

Integrating with resubmit.ai (and any LLM stack)

Configuration

Security notes

Project layout

Roadmap / known limits

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`POST /v1/parse`

Packages