Skip to content

aavramch/reparsed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reparsed

Parse files & links into clean, structured, LLM-ready text.

Reparsed is a self-hosted API that takes a URL or a file, throws out the noise (ads, trackers, navigation, cookie banners, UI chrome, "you might also like"), figures out what kind of content it is, and returns a clean, structured JSON payload your downstream LLM can actually use.

It exists because engines like Ollama and LocalAI can't fetch a URL or read a PDF. Reparsed is the authenticated pre-processor you put in front of them.

┌─ messy input ──────────────┐         ┌─ clean output ─────────────────────┐
│ nav · ads · cookie banner   │         │ { "content_type": "video",          │
│ TITLE · 142K views          │ ──▶     │   "title": "Building a Rust server",│
│ Subscribe! Share! tracking  │ reparsed│   "structured": { "channel": … },   │
│ description · 1.2k comments │         │   "markdown": "# Building a Rust…" }│
└─────────────────────────────┘         └─────────────────────────────────────┘

Why these design choices (the research)

Concern Decision Why
Similar tech Conceptually Jina Reader / Firecrawl, plus a semantic layer Those tools do HTML→markdown. Reparsed additionally classifies the content type and keeps only the important fields, which is what makes downstream prompting cheap and reliable.
HTML main-content extraction Trafilatura (MIT) Highest F1 (~0.958) among open-source extractors, beating Mozilla Readability; fast, deterministic, no GPU. (benchmark)
JS-heavy pages (e.g. video) Playwright headless Chromium fallback Static fetch misses JS-rendered content; we render only when static extraction is thin.
Binary files (PDF/DOCX/…) MarkItDown (MIT) default, Docling (MIT) optional MarkItDown is fast, light, and emits LLM-friendly markdown across PDF/DOCX/PPTX/XLSX/images. Docling wins on complex tables (~98% accuracy) but is heavy (~1GB), so it's opt-in.
Which model A single general instruction model (default qwen3:8b), configurable The Qwen3 line is the most reliable for JSON-schema-constrained output in Ollama. One model handles classify + restructure. Swap via REPARSED_MODEL.
Output format Markdown body inside a JSON envelope Research is consistent that markdown is the best LLM-ingestion format (10–20% fewer tokens than HTML, cleaner structure). The JSON envelope gives code the typed fields; the markdown field gives the next model clean prose. (why)
Structured output reliability Ollama's format = JSON Schema Constrains decoding so the response is always valid JSON. (docs)

A purpose-built model worth knowing about: Jina's ReaderLM-v2 (1.5B, available on Ollama) does HTML→markdown/JSON conversion specifically. We use Trafilatura for that step instead because it's deterministic, free, and needs no GPU — but ReaderLM-v2 is a natural drop-in if you want a model-based extractor later.


Architecture

client / resubmit.ai ──(Bearer rp_live_…)──▶ Caddy (TLS, reparsed.app)
                                                  │
                                          FastAPI :17177
                                          ├─ /v1/parse        (API-key auth)
                                          ├─ dashboard + auth  (session cookies)
                                          │
                                          │  Stage 1 — deterministic extract
                                          │    url  → httpx → Trafilatura
                                          │           └ Playwright fallback (JS)
                                          │    file → MarkItDown (→ Docling opt.)
                                          │  Stage 2 — LLM classify + restructure
                                          │    → Ollama, external GPU host (JSON-schema)
                                          ▼
                       Ollama (your GPU server)      Postgres (users, keys, usage)

Two stages keep it cheap and robust: deterministic extraction removes ~90% of the junk for free, so the model only does the semantic part (detect type, keep the important fields, respect the output budget — e.g. "top 10 comments if they fit").


Prefer not to run it? Use the hosted version at reparsed.app — sign up, grab a key, and skip the infrastructure. This repo is for self-hosting.

Quick start

Requirements: Docker + Docker Compose, and an Ollama server reachable on your network — Reparsed talks to it over HTTP (the GPU lives there, not in this stack).

git clone <this repo> reparsed && cd reparsed
cp .env.example .env
#  → edit .env: set SESSION_SECRET, POSTGRES_PASSWORD, and OLLAMA_BASE_URL

# Make sure your model is pulled on the Ollama server:
#   ollama pull qwen3:8b

docker compose up -d

Set OLLAMA_BASE_URL to your Ollama host (e.g. http://192.168.1.10:11434). Once up, the API comes up on http://localhost:17177.

curl http://localhost:17177/v1/health
# {"status":"ok","model":"qwen3:8b","model_ready":true}

Open http://localhost:17177, register an account, and generate your API key in the dashboard.

Production (reparsed.app)

Point DNS at the host, then start with the TLS proxy profile:

REPARSED_DOMAIN=reparsed.app PUBLIC_BASE_URL=https://reparsed.app SESSION_COOKIE_SECURE=true \
  docker compose --profile proxy up -d

Caddy fetches a Let's Encrypt certificate automatically and proxies :443 → api:17177. Set SESSION_COOKIE_SECURE=true only when every client reaches the app over HTTPS; leave it false (the default) if you also hit it over plain HTTP on the LAN, or the session cookie is dropped and logins silently fail.


API reference

POST /v1/parse

Auth: Authorization: Bearer rp_live_… (or X-API-Key: rp_live_…).

Provide exactly one source. JSON body for url/text/html, or multipart/form-data for a file:

Field Where Description
url JSON Page or file URL to fetch and parse.
text JSON Raw text to clean and structure.
html JSON Raw HTML to extract from.
file multipart Uploaded file (PDF/DOCX/PPTX/XLSX/image/…).
max_output_chars both Cap on the clean text (default 8000, max 32000).
format both both (default), markdown, or json.
render both auto (default), always, or never (browser rendering).
content_hint both Optional type hint, e.g. "job_posting".

Examples

# Parse a link
curl -X POST http://localhost:17177/v1/parse \
  -H "Authorization: Bearer rp_live_…" -H "Content-Type: application/json" \
  -d '{"url": "https://news.ycombinator.com/item?id=1"}'

# Upload a file
curl -X POST http://localhost:17177/v1/parse \
  -H "Authorization: Bearer rp_live_…" \
  -F "file=@resume.pdf" -F "content_hint=resume"

# Clean raw text
curl -X POST http://localhost:17177/v1/parse \
  -H "Authorization: Bearer rp_live_…" -H "Content-Type: application/json" \
  -d '{"text": "....", "max_output_chars": 4000}'

Response

{
  "content_type": "job_posting",
  "title": "Senior Backend Engineer",
  "language": "en",
  "structured": {
    "company": "Acme",
    "location": "Remote (EU)",
    "salary_range": "€70k–€90k",
    "requirements": ["5+ years Python", "Postgres", "Kubernetes"],
    "benefits": ["Equity", "30 days PTO"]
  },
  "markdown": "# Senior Backend Engineer\n**Company:** Acme  …",
  "meta": {
    "source": "url", "url": "https://…", "filename": null,
    "language": "en", "extractor": "trafilatura", "model": "qwen3:8b",
    "input_chars": 18342, "output_chars": 1204,
    "input_truncated": false, "output_truncated": false, "latency_ms": 2310
  }
}

Other endpoints

  • GET /v1/content-types — the catalog of types Reparsed detects.
  • GET /v1/health — backend + model readiness.
  • GET /api-docs — interactive OpenAPI (Swagger) docs.

Errors

400 bad input · 401 missing/invalid key · 413 file too large · 422 nothing extractable / unsupported · 429 rate limited · 502/504 fetch or model failure. If the model is unreachable, Reparsed degrades gracefully and returns the deterministic extraction as content_type: "generic" rather than failing.

Limits (defaults, all configurable)

Default
Max upload 25 MB
Max fetched HTML 10 MB
Text sent to model ~96k chars (~24k tokens)
Default / max output 8k / 32k chars
Rate limit 30/min · 2000/day per key

Detected content types

article · video · job_posting · product · resume · forum_thread · documentation · recipe · event · search_results · social_post · generic

Each type defines what "important" means (see app/content_types.py). A video keeps the description, recommended videos and top comments; a job posting keeps company/salary/requirements; a recipe drops the life story. Unknown content falls back to generic.


Integrating with resubmit.ai (and any LLM stack)

Reparsed is designed to be chained. The pattern:

import requests, ollama

KEY = "rp_live_…"

# 1 — Reparsed turns a messy job-post URL into clean, typed fields
clean = requests.post(
    "https://reparsed.app/v1/parse",
    headers={"Authorization": f"Bearer {KEY}"},
    json={"url": job_url, "content_hint": "job_posting"},
).json()

# 2 — feed the clean text to your own model
ollama.chat(model="llama3", messages=[
    {"role": "user",
     "content": f"Tailor my resume to this role:\n\n{clean['markdown']}"}
])

For resubmit.ai specifically, the job_posting and resume content types map directly onto resume-tailoring: parse the job URL and the user's uploaded resume PDF through the same endpoint, then hand both clean markdown blocks (and the typed structured fields) to the tailoring model. One authenticated POST per artifact, no scraping or PDF parsing to maintain on resubmit.ai's side.


Configuration

All via environment variables (see .env.example). Highlights:

Var Default Notes
REPARSED_MODEL qwen3:8b Ollama model. CPU host? Use qwen3:4b / gemma3:4b.
OLLAMA_BASE_URL http://192.168.1.10:11434 Your external Ollama server. Set this.
SESSION_SECRET Set this. Signs dashboard cookies.
SESSION_COOKIE_SECURE false true adds the cookie Secure flag (HTTPS-only). Keep false if reached over HTTP.
PUBLIC_BASE_URL http://localhost:17177 Shown in dashboard samples.
ALLOW_REGISTRATION true Set false to lock signups.
PLAYWRIGHT_ENABLED true JS-render fallback.
ENABLE_DOCLING false High-fidelity PDF (needs the optional install).
RATE_LIMIT_PER_MINUTE / _PER_DAY 30 / 2000 Per key.

Security notes

  • API keys are random 256-bit secrets, shown once and stored only as a SHA-256 hash. One key per user; deleting and regenerating is the rotation path.
  • Passwords are bcrypt-hashed. Dashboard auth uses signed, SameSite=Lax session cookies; state-changing dashboard calls require a CSRF token (double-submit). The cookie Secure flag is opt-in via SESSION_COOKIE_SECURE (enable it in HTTPS-only deployments).
  • The rate limiter is in-memory (single instance). For multiple replicas, back it with Redis behind the same interface in app/ratelimit.py.
  • Treat keys as server-side secrets — call the API server-to-server, not from a browser.

Project layout

reparsed/
├── docker-compose.yml          # db · api · caddy(profile) — Ollama is external
├── Caddyfile                   # TLS reverse proxy for reparsed.app
├── .env.example
└── api/
    ├── Dockerfile              # python + playwright chromium
    ├── requirements.txt
    ├── app/
    │   ├── main.py             # app wiring, middleware, lifespan
    │   ├── config.py           # env settings
    │   ├── db.py / models.py   # SQLAlchemy (users, api_keys, usage_logs)
    │   ├── security.py deps.py ratelimit.py
    │   ├── content_types.py    # the semantic registry
    │   ├── schemas.py
    │   ├── parsing/            # web.py · files.py · extract.py  (Stage 1)
    │   ├── llm/                # ollama_client.py · restructure.py (Stage 2)
    │   └── routers/            # parse · auth · keys · pages
    ├── templates/              # landing · login · register · dashboard
    └── static/                 # css · js · logo

Roadmap / known limits

  • Deeply dynamic content (e.g. YouTube comments that lazy-load on scroll) is extracted from the rendered DOM only; per-site adapters or official APIs do better. A YouTube oEmbed adapter ships as an example in parsing/web.py.
  • Single-instance rate limiting and table auto-create (swap to Redis + Alembic for scale).
  • Optional model routing (small model for classify, larger for long docs) and a ReaderLM-v2 extraction mode are natural next steps.

License

MIT.

About

A parser your AI will love!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors