A Python-based web scraper for collecting manga/manhwa data from KomikCast, including metadata, chapter information, and image links. The scraper supports automatic uploads to Supabase storage and includes an auto-update mode for tracking new chapters.
- π Scrape manga/manhwa listings with metadata (title, rating, cover image)
- π Extract chapter links and image URLs from individual manga pages
- βοΈ Automatic Supabase upload for images and metadata
- π Auto-update mode to check for new chapters
- πΎ Progress tracking with resume capability
- π― Configurable limits for pages, comics, and chapters
- β‘ Adjustable delays to control scraping speed
Scrape/
βββ scrape_links_only.py # Main scraper for chapter images
βββ generate-manifest.py # Generates manifest files from Supabase
βββ merger_link.json # Input: List of manga/manhwa to scrape
βββ requirements.txt # Python dependencies
βββ .env.example # Environment variables template
βββ .env # Local environment configuration (not committed)
βββ manga_local_image_links.json # Output: Chapter image links (local copy)
βββ scrape_links_progress.json # Progress tracking file
βββ README_ENV.md # Environment setup documentation
βββ GITHUB_SECRETS_SETUP.md # GitHub Actions secrets guide
βββ .github/
β βββ workflows/
β βββ Update-Chapter.yml # GitHub Actions: Update chapters
β βββ manifest-komik.yaml # GitHub Actions: Generate manifest
βββ .gitignore # Git ignore patterns
- Python 3.8 or higher
- pip (Python package manager)
-
Clone or download this repository
-
Create a virtual environment (recommended):
python -m venv .venv
-
Activate the virtual environment:
- Windows:
.venv\Scripts\activate
- Linux/Mac:
source .venv/bin/activate
- Windows:
-
Install dependencies:
pip install -r requirements.txt
-
Configure environment variables:
For Local Development:
- Copy
.env.exampleto.env:cp .env.example .env
- Edit
.envand fill in your configuration:SUPABASE_URL=https://your-project.supabase.co SUPABASE_KEY=your-service-role-key-here BUCKET_NAME=manga-data # ... other settings
For GitHub Actions:
- Go to Repository Settings > Secrets and variables > Actions
- Add the required secrets (see
GITHUB_SECRETS_SETUP.mdfor details)
π Complete setup guide:
README_ENV.md - Copy
Scrape Chapter Images:
python scrape_links_only.pyGenerate Manifest (from Supabase):
python generate-manifest.pyThe project includes two automated workflows:
Update Chapter Manhwa/Manhua:
- Trigger: Manual or scheduled (daily at 14:00 UTC)
- Action: Scrapes new chapters and uploads to Supabase
- File:
.github/workflows/Update-Chapter.yml
Generate Manifest:
- Trigger: Manual or scheduled (daily at 14:00 UTC)
- Action: Generates manifest files from Supabase bucket
- File:
.github/workflows/manifest-komik.yaml
You can manually trigger workflows from the GitHub Actions tab:
- Go to Actions tab in your repository
- Select the desired workflow
- Click Run workflow
- Choose parameters if available
The scraper supports two configuration methods:
For local development, edit the .env file:
# Supabase Configuration
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your-service-role-key
BUCKET_NAME=manga-data
ENABLE_SUPABASE_UPLOAD=True
# Scraping Configuration
JSON_FILE=merger_link.json
MAX_COMICS_TO_PROCESS=50
AUTO_UPDATE_MODE=True
AUTO_UPDATE_MAX_COMICS=255
# Speed Configuration
DELAY_BETWEEN_CHAPTERS=0.5
DELAY_BETWEEN_COMICS=1
REQUEST_TIMEOUT=10
# Parallel Processing
MAX_CHAPTER_WORKERS=5
MAX_COMIC_WORKERS=2
ENABLE_PARALLEL=TrueFor GitHub Actions, add secrets in Repository Settings:
SUPABASE_URLSUPABASE_KEYBUCKET_NAME- And other configuration variables
π See README_ENV.md for complete configuration guide
The scraper can automatically check existing comics for new chapters:
- Set
AUTO_UPDATE_MODE = Truein environment variables - The scraper will check all comics in
merger_link.json - Only new chapters will be processed and uploaded
- Progress is saved to resume interrupted runs
The scraper saves progress after each comic:
- Resume interrupted scraping sessions
- Skip already processed comics
- Track upload status to Supabase
- Progress saved in
scrape_links_progress.json
The scraper supports multi-threaded processing:
- Chapter-level parallelism: Process multiple chapters simultaneously
- Comic-level parallelism: Process multiple comics simultaneously (normal mode only)
- Configurable thread counts via environment variables
- Thread-safe operations to prevent race conditions
To scrape without uploading to Supabase:
ENABLE_SUPABASE_UPLOAD=FalseAll data will be saved to manga_local_image_links.json.
This project includes automated workflows for continuous scraping:
-
Update-Chapter.yml: Scrapes new chapters and uploads to Supabase
- Scheduled: Daily at 14:00 UTC
- Manual trigger available
- Uses GitHub repository secrets for configuration
-
manifest-komik.yaml: Generates manifest files from Supabase
- Scheduled: Daily at 14:00 UTC
- Manual trigger available
- Updates comics listing for frontend
Both workflows can be triggered manually:
- Go to Actions tab in GitHub repository
- Select the desired workflow
- Click Run workflow
- Monitor execution in real-time
π Complete GitHub Actions guide: GITHUB_SECRETS_SETUP.md
- requests: HTTP requests
- beautifulsoup4: HTML parsing
- supabase: Supabase client
- lxml: XML/HTML parser
- Pillow: Image processing
- websockets: WebSocket support for Supabase realtime
- python-dotenv: Environment variables management
- storage3: Supabase storage client
- realtime: Supabase realtime client
See requirements.txt for specific versions.
- The scraper includes delays between requests to avoid overloading servers
- Adjust
DELAY_BETWEEN_CHAPTERSandDELAY_BETWEEN_COMICSas needed - Be respectful of the target website's resources
- This scraper is for educational purposes only
- Ensure you have permission to scrape the target website
- Respect the website's
robots.txtand terms of service - Do not use scraped data for commercial purposes without permission
- The scraper includes error handling for network issues
- Failed requests are logged and skipped
- Progress is saved regularly to prevent data loss
-
Environment variables not loading:
- Ensure
.envfile exists in project root - Check that
python-dotenvis installed:pip install python-dotenv - Verify
.envfile format (no quotes around values)
- Ensure
-
Supabase connection fails:
- Verify
SUPABASE_URLandSUPABASE_KEYin.env - Ensure Supabase project is active and accessible
- Check that service role key is used (not anon key)
- Verify
-
Missing dependencies:
- Run
pip install -r requirements.txt - Check for any installation errors
- Run
-
Workflow fails to start:
- Verify all required secrets are added to repository settings
- Check that workflow files are properly formatted
- Ensure repository has proper permissions
-
Script can't read environment variables:
- Confirm secrets are correctly named in workflow files
- Check workflow logs for detailed error messages
- Verify secret values are properly set
-
Supabase upload fails in Actions:
- Verify
SUPABASE_KEYhas proper permissions - Check bucket name and existence
- Ensure service role key is used (not anon key)
- Verify
-
Connection errors:
- Check your internet connection
- Verify the target website is accessible
- Increase
REQUEST_TIMEOUTvalue in environment variables
-
Missing data in output:
- Website structure may have changed
- Check console output for error messages
- Verify the CSS selectors in the code
-
Scraper stops unexpectedly:
- Check
scrape_links_progress.jsonfor last processed item - Run again to resume from last position
- Review error logs in console or GitHub Actions
- Check
For local development, create a .env file from .env.example:
# Required
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your-service-role-key
BUCKET_NAME=manga-data
# Optional (with defaults)
ENABLE_SUPABASE_UPLOAD=True
JSON_FILE=merger_link.json
MAX_COMICS_TO_PROCESS=50
AUTO_UPDATE_MODE=True
# ... see .env.example for all optionsFor GitHub Actions, add these secrets in repository settings:
Required:
SUPABASE_URLSUPABASE_KEYBUCKET_NAME
Optional:
ENABLE_SUPABASE_UPLOADJSON_FILEMAX_COMICS_TO_PROCESSAUTO_UPDATE_MODE- And other configuration variables
π Complete configuration guide: README_ENV.md
.env files or expose secrets in code!
This project has been updated to support modern development practices:
- Environment-based configuration (.env files and GitHub secrets)
- Automated workflows via GitHub Actions
- Parallel processing for improved performance
- Progress tracking with resume capability
- Auto-update mode for new chapters detection
- Supabase integration for cloud storage
scrape_links_only.py- Main scraper scriptgenerate-manifest.py- Manifest generation scriptmerger_link.json- Input data (manga list).env.example- Environment variables templaterequirements.txt- Python dependencies
- Update-Chapter.yml - Automated chapter scraping
- manifest-komik.yaml - Automated manifest generation
README_ENV.md- Environment setup guideGITHUB_SECRETS_SETUP.md- GitHub Actions secrets guide
# 1. Setup environment
cp .env.example .env
# Edit .env with your credentials
# 2. Install dependencies
pip install -r requirements.txt
# 3. Run scraper
python scrape_links_only.py# 1. Add repository secrets (see GITHUB_SECRETS_SETUP.md)
# 2. Push code to repository
# 3. Workflows will run automatically or can be triggered manuallyFeel free to submit issues or pull requests to improve the scraper.
This project is provided as-is for educational purposes.
This tool is intended for personal use and educational purposes only. Users are responsible for ensuring their use complies with applicable laws and the terms of service of the websites they scrape.
- Never commit
.envfiles or secrets to version control - Use service role keys for Supabase (not anon keys)
- Respect website rate limits and terms of service
- Only scrape content you have permission to access