1-click
docker compose upsystem that scrapes ArXiv + Semantic Scholar across 9 scientific domains, extracts text & figures via separate pipelines, produces structured "What / How / To Whom" summaries with figure analysis, and delivers daily Telegram digests with 👍/👎 approval. Approved papers feed into Google Sheets + a ChromaDB vector store. Weekly cross-domain meta-synthesis runs automatically.
Runs on any free-tier cloud VM — GCP, AWS, Azure, or any 1 GB+ VPS.
┌─────────────┐ scrape ┌──────────┐ download ┌──────────┐
│ Scheduler │──────────▶│ ArXiv │───────────▶│ PDFs │
│ (daily cron)│ │ + S2 │ │ on disk │
└──────┬──────┘ └──────────┘ └─────┬────┘
│ │
▼ ▼
┌──────────────────────────────────────────────────────────────┐
│ RQ Task Queue (Redis) │
│ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ Text Pipeline │ │ Figure Pipeline │ │
│ │ PyMuPDF → text │ │ PyMuPDF → images │ │
│ │ Groq API → JSON │ │ Groq Vision → JSON │ │
│ └────────┬────────┘ └──────────┬──────────┘ │
│ └────────────┬─────────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Combine Task │ │
│ │ → Markdown Card │ │
│ └────────┬────────┘ │
└───────────────────────┼──────────────────────────────────────┘
▼
┌─────────────────┐ 👍 ──▶ Google Sheets
│ Telegram Bot │──────────▶ ChromaDB
│ daily digest │ 👎 ──▶ Discard
└────────┬────────┘
│ weekly
▼
┌─────────────────┐
│ Meta-Synthesis │
│ RAG + LLM API │
└─────────────────┘
Since no free-tier VM has enough RAM for local LLM inference, this pipeline offloads all AI work to free cloud inference APIs:
| Provider | Free Tier | Best Model Available | Vision | Sign-up |
|---|---|---|---|---|
| Groq (recommended) | 30 req/min, unlimited | Llama 3.3 70B | ✅ Llama 3.2 90B Vision | console.groq.com |
| Together | $1 free credit (~500 calls) | Llama 3.1 8B Turbo | ✅ Llama 3.2 11B Vision | api.together.xyz |
| Cerebras | Free tier, fast | Llama 3.3 70B | ❌ | cloud.cerebras.ai |
| OpenRouter | Free credits on signup | Many free models | ✅ | openrouter.ai |
Groq is the default — no credit card required, fastest inference, and you get Llama 3.3 70B which is better than any model that fits in 24 GB locally.
| Provider | Tier | Specs | Duration | Credit Card |
|---|---|---|---|---|
| GCP | Always Free | e2-micro: 2 vCPU (shared), 1 GB RAM, 30 GB disk | Forever | Required |
| AWS | Free Tier | t2.micro: 1 vCPU, 1 GB RAM, 30 GB EBS | 12 months | Required |
| Azure | Free Tier | B1s: 1 vCPU, 1 GB RAM, 64 GB disk | 12 months | Required |
| Any cheap VPS | — | 1+ GB RAM | Varies | Varies |
All three work. GCP is the only true "always free" option.
- Docker 24+ with
docker composev2 - A free LLM API key (Groq recommended)
- A Telegram bot token (from @BotFather)
# Groq (2 minutes, no credit card):
# 1. Go to https://console.groq.com
# 2. Sign in with Google/GitHub
# 3. Create API Key → copy it
# Telegram Bot:
# 1. Message @BotFather on Telegram
# 2. /newbot → follow prompts → copy token
# 3. Send /start to your new bot
# 4. Get your chat ID:
curl "https://api.telegram.org/bot<TOKEN>/getUpdates" | python3 -m json.tool
# Look for "chat":{"id": YOUR_CHAT_ID}# Create e2-micro instance in GCP Console:
# Compute Engine → Create Instance
# Machine type: e2-micro (2 vCPU, 1 GB)
# Boot disk: Ubuntu 24.04, 30 GB Standard persistent disk
# Region: us-west1, us-central1, or us-east1 (free tier eligible)
# Firewall: Allow HTTP + HTTPS
# SSH in and install Docker
sudo apt-get update && sudo apt-get install -y docker.io docker-compose-v2
sudo usermod -aG docker $USER && newgrp docker
# Clone and configure
git clone https://github.com/YOUR_USERNAME/research-pipeline.git
cd research-pipeline
cp .env.example .env
nano .env # Set GROQ_API_KEY, TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID
# Launch (takes ~2 min to build)
docker compose up -d# Launch t2.micro EC2 instance:
# AMI: Ubuntu 24.04 LTS
# Instance type: t2.micro
# Storage: 30 GB gp3
# Security group: open ports 22, 8000
# SSH in
ssh -i your-key.pem ubuntu@<ip>
# Install Docker
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER && newgrp docker
# Same clone + configure + launch as GCP above# Create B1s VM:
# Azure Portal → Virtual Machines → Create
# Size: Standard_B1s (1 vCPU, 1 GB)
# Image: Ubuntu 24.04
# Disk: 64 GB Standard SSD
# SSH in, install Docker, same as abovedocker compose ps # All services healthy?
curl http://localhost:8000/health # {"status":"ok"}
curl http://localhost:8000/status # Pipeline stats
# Trigger first ingestion manually
curl -X POST http://localhost:8000/trigger/ingest
# Or from Telegram: send /ingest to your bot| Component | Memory |
|---|---|
| OS + Docker overhead | ~300 MB |
| Redis (capped) | 64 MB |
| API + Telegram bot + scheduler | 250 MB |
| RQ Worker | 250 MB |
| RQ Dashboard | 80 MB |
| Total | ~950 MB |
PDF processing is the only memory spike — PyMuPDF processes one paper at a time and peaks at ~100 MB per paper. The worker memory limit handles this.
| Service | Port | Purpose |
|---|---|---|
api |
8000 | FastAPI + Telegram bot + scheduler |
worker |
— | RQ worker (text, figure, combine queues) |
redis |
6379 | Task queue (localhost only) |
rq-dashboard |
9181 | Task monitoring UI (localhost only) |
| Command | Description |
|---|---|
/start |
Welcome message |
/digest |
Send all ready papers now |
/status |
Show queue statistics |
/ingest |
Trigger manual scraping |
Each paper card has inline buttons: 👍 Track this Trend / 👎 Discard
| Method | Path | Description |
|---|---|---|
GET |
/health |
Liveness probe |
GET |
/status |
Pipeline statistics |
POST |
/trigger/ingest |
Manual ingestion |
POST |
/trigger/synthesis |
Run weekly synthesis |
GET |
/papers?status=approved |
List papers |
GET |
/papers/{id} |
Paper detail |
Just change one line in .env:
# Option A: Groq (recommended — fastest, 70B model, free forever)
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_...
# Option B: Together (good vision support)
LLM_PROVIDER=together
TOGETHER_API_KEY=...
# Option C: Cerebras (very fast, no vision)
LLM_PROVIDER=cerebras
CEREBRAS_API_KEY=...
# Option D: OpenRouter (many free models)
LLM_PROVIDER=openrouter
OPENROUTER_API_KEY=...
# Option E: Self-hosted Ollama (needs separate 8+ GB server)
LLM_PROVIDER=ollama
OLLAMA_HOST=http://your-big-server:11434
# Also: pip install ollama inside containersThen docker compose restart api worker.
research-pipeline/
├── docker-compose.yml # 4 services (no Ollama container)
├── Dockerfile # Slim image (~450 MB)
├── .env.example # All config with provider options
├── requirements.txt # Lightweight deps (no torch)
├── scripts/
│ └── manage.sh # CLI helper
├── app/
│ ├── main.py # FastAPI application
│ ├── config.py # Pydantic settings
│ ├── schemas.py # Data models
│ ├── llm_client.py # ★ Unified LLM client (Groq/Together/etc)
│ ├── logging_cfg.py # Structured logging
│ ├── ingestion/
│ │ ├── arxiv_scraper.py # ArXiv API client
│ │ ├── semantic_scholar.py # S2 API client
│ │ └── downloader.py # PDF downloader
│ ├── processing/
│ │ ├── prompts.py # Few-shot prompt templates
│ │ ├── text_pipeline.py # Text extraction + LLM API
│ │ ├── figure_pipeline.py # Figure extraction + Vision API
│ │ └── combiner.py # Merge into Markdown cards
│ ├── tasks/
│ │ ├── worker.py # RQ task definitions
│ │ ├── scheduler.py # Cron scheduler
│ │ ├── ingest_job.py # Daily ingestion orchestrator
│ │ └── model_pull.py # Provider connectivity check
│ ├── delivery/
│ │ └── telegram_bot.py # Telegram bot + approval flow
│ ├── storage/
│ │ ├── paper_db.py # SQLite paper state DB
│ │ ├── google_sheets.py # gspread integration
│ │ └── vector_store.py # ChromaDB + API embeddings
│ └── weekly/
│ └── meta_synthesis.py # Cross-domain RAG synthesis
└── data/
├── pdfs/ # Downloaded papers
├── figures/ # Extracted figure images
└── db/ # SQLite + ChromaDB
Rate limited by Groq?
The pipeline has automatic retry with exponential backoff. If you're hitting limits often, reduce MAX_PAPERS_PER_DAY to 4 or switch to Together AI.
Out of memory on e2-micro?
# Add 1 GB swap (free, uses disk)
sudo fallocate -l 1G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstabDocker build fails on 30 GB disk?
# Clean up Docker cache
docker system prune -afVision model not available on Cerebras? Cerebras doesn't support vision — switch to Groq or Together for figure analysis:
LLM_PROVIDER=groq # Has vision supportMIT