PDF Structure Extractor using Tiny LLM
Project Overview
This project extracts and restructures the content of a PDF file by identifying headings, subheadings, and plain text using a lightweight language model (Tiny LLM ~60MB). Due to LLM input token limitations (1024 tokens), large PDFs are broken into manageable chunks before processing. The tool outputs structured JSON and a final PDF.
Project Structure
Got it! Here's your Project Structure section rewritten in a clean, numbered format (ideal for GitHub READMEs):
-
uploads/Stores uploaded PDF files from the frontend. -
index.jsMain Node.js backend file to handle file upload and processing. -
processpdf.jsContains logic to split PDFs into chunks and manage token limits for LLM. -
frontend/(React + Vite app)public/– Static assetssrc/– React components and frontend logicpackage.json– Frontend dependencies and scripts
-
llm-server/(Python backend with FastAPI)main.py– API endpoints for processing text using Tiny LLMdownload_model.py– Downloads and prepares the LLM modelrequirements.txt– Python dependencies for the server
-
.gitignoreLists files and folders ignored by Git (likenode_modulesandvenv). -
README.mdDocumentation of the project.
Let me know if you want this embedded back into the full README and exported as an updated PDF.
Features
- Upload a PDF via UI
- Automatically chunks into ~500-word parts
- Sends to Tiny LLM and extracts structure
- Displays JSON result
- Outputs a structured final PDF Setup Instructions
- Clone the Repository: git clone https://github.com/your-username/pdf-structure-extractor.git cd pdf-structure-extractor
- Setup Python Environment: PDF Structure Extractor using Tiny LLM cd llm-server python -m venv venv source venv/bin/activate (Windows: venv\Scripts\activate) pip install -r requirements.txt
- Install Node Dependencies: Run 'npm install' in root and 'frontend/' folders. Running the App
- Start Node Backend: node index.js
- Start LLM Server: cd llm-server uvicorn main:app --host 127.0.0.1 --port 8000
- Start Frontend: cd frontend npm run dev How to Use
- Open the app at http://localhost:5173
- Upload a PDF
- Wait for JSON and PDF outputs
- JSON is displayed; PDF is downloadable Important Notes
- Tiny LLM has a 1024-token input limit
- We split large PDFs into ~500-word chunks
- Python, pip, and virtualenv are required
- Run 'npm install' in each folder with a package.json PDF Structure Extractor using Tiny LLM