Pixurebyte is an open-source, self-hostable website capture and analysis platform — inspired by URLScan but built for speed, control, and privacy.
Unlike traditional web scanners that struggle with Cloudflare or bot-protection challenges, Pixurebyte allows you to bypass challenge pages entirely by hosting your own compute infrastructure.
You get screenshots, metadata, and raw HTML of any site in minutes — all while maintaining full control of your data.
-
🧠 Self-Hostable Architecture
Run your web app, database, and Redis locally while leveraging AWS for scalable compute and object storage. -
⚡ Bypass Cloudflare & Bot Challenges
PixureByte leverages multiple libraries, including Scrapling to assist in bypassing Cloudflare related bot challenges. -
🖼️ Full Page Screenshots
Automatically capture high-resolution screenshots of pages. -
🌐 Rich Site Metadata Collection
Collect HTML, response/request data, headers, and more. -
🧩 Modular & Extensible
Designed for easy integration with your research workflows. New data collectors will be added soon. -
🔒 Privacy-Conscious Design
Everything you scan and store stays within your control — nothing is sent to external services.
Pixurebyte uses AWS ECS Fargate to launch short-lived scan containers on demand.
- Primary capacity provider:
FARGATE_SPOT - Fallback: Standard FARGATE (if Spot is unavailable)
- Task size: 0.5 vCPU / 2 GB RAM
- Typical runtime: ~1–3 minutes per scan
This configuration balances performance, reliability, and cost efficiency — giving you full browser capabilities at minimal expense.
Pixurebyte was designed with a minimal AWS footprint in mind.
All heavy-lifting services (database, web, Redis, API) run locally using Docker Compose — meaning you only pay for ephemeral AWS tasks and S3 storage.
Below is an approximate cost estimate for a modest personal/research deployment:
| Resource | Description | Est. Monthly Cost (USD) |
|---|---|---|
| ECS Fargate Spot Tasks | 0.5 vCPU / 2 GB RAM per scan, ~2 min average. Spot pricing ≈ $0.0008/min. | $0.02 – $0.05 per scan |
| Fallback Fargate (on-demand) | Used only if Spot capacity unavailable (~2× cost). | $0.04 – $0.10 per scan |
| S3 Storage | Screenshots, HTML, and JSON metadata (≈ 10–20 MB per scan). | $0.02 – $0.10 / month for hundreds of scans |
| CloudFront (optional) | CDN delivery for public image access. | Free (under free tier) or ~$0.01/GB |
| AWS Data Transfer | Negligible due to CDN usage. | <$0.10 / month |
💡 Total cost for light usage: under $1 per month for ~100 scans.
Even at moderate scale (hundreds per week), expect costs under $5–10/month.
The bulk of your infrastructure — API, DB, Redis, and frontend — runs free on your own hardware.
Pixurebyte is composed of two parts:
| Component | Description |
|---|---|
| Local Stack | Django backend, Redis, PostgreSQL, and frontend (NextJS) served locally via Docker Compose |
| AWS Infrastructure | S3 for media storage + ECS Fargate for ephemeral scan workers |
This hybrid model allows fast local management with elastic remote compute.
Before installing, make sure you have:
- Terraform ≥ 1.6.0
- AWS CLI (configured with credentials)
- Docker
- OpenSSL (used for generating Django secret keys)
To run Pixurebytes infrastructure components, you’ll need Terraform. Here's how to install it on a Debian-based Linux system (e.g. Ubuntu):
1. Update and install prerequisites
sudo apt-get update -y && sudo apt-get install -y gnupg software-properties-common2. Install the HashiCorp GPG Key
wget -O- https://apt.releases.hashicorp.com/gpg | \
gpg --dearmor | \
sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg > /dev/null3. Add the official HashiCorp repository to your linux system.
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] \
https://apt.releases.hashicorp.com $(grep -oP '(?<=UBUNTU_CODENAME=).*' /etc/os-release || lsb_release -cs) main" | \
sudo tee /etc/apt/sources.list.d/hashicorp.list4. Download the package information
sudo apt update -y5. Install Terraform
sudo apt-get install -y terraformTo allow Terraform to authenticate with AWS, you need to provide your Access Key ID and Secret Access Key. Here's how to obtain them from your AWS Root Account (IAM user is also sufficient):
Go to https://aws.amazon.com/console/ and log in as the root user (email + password). Feel free to use an IAM user instead as long as the permissions are correct.
- Navigate to My Security Credentials (top-right dropdown → “My Security Credentials”).
- Scroll down to the Access keys section.
- Click Create access key.
- Download or copy the credentials safely:
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEY
⚠️ You will only see the secret key once. Store it securely.
You can pass the credentials via environment variables:
export AWS_ACCESS_KEY_ID="your-access-key-id"
export AWS_SECRET_ACCESS_KEY="your-secret-access-key"
export AWS_DEFAULT_REGION="us-east-2"git clone https://github.com/fish-not-phish/pixurebyte.git
cd pixurebytecd terraform
./setup.sh
You’ll be prompted for your custom domain and other configuration options. The script automatically provisions:
- S3 bucket for screenshots and raw HTML
- ECS task definition for scan workers
- CloudFront CDN (so your S3 bucket remains non-public)
Once finished, your AWS resources are ready for use.
cd ..
docker compose up -d
This brings up:
- Django backend (localhost:8000)
- Next.js frontend (localhost:3000)
- Redis + Postgres containers
docker compose down
To remove all stored data:
docker compose down -v
cd terraform
./destroy.sh
This cleanly tears down all provisioned AWS resources (ECS tasks, S3 bucket, etc).
Caution
This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect the terms of service of websites and robots.txt files.
