A Python-based web scraper for extracting post data from Hacker News (https://news.ycombinator.com/) into a CSV.
- Python 3.10+
- pip (Python package manager)
-
Clone or download the project:
git clone https://github.com/azamos/harvester cd harvester -
Create a virtual environment:
python -m venv venv
-
Activate the virtual environment:
On Windows:
venv\Scripts\activate
On macOS/Linux:
source venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
Note: Currently tested on Windows Command Prompt, but should work on macOS and Linux as well. The application supports both online mode (fetching from web) and offline mode (using locally stored pages for development/testing).
Development Note: Use the
--offlineflag to test with locally stored HTML pages instead of fetching from the web. This is useful for development and testing purposes.
python -m src.main --num_post NUM_POST --min_score MIN_SCORE --max_score MAX_SCORE --list_string LIST_STRING [--offline] [--debug]--num_post: Number of posts to process (positive integer, max 900)--min_score: Minimum score threshold (non-negative integer)--max_score: Maximum score threshold (non-negative integer, must be >= min_score)--list_string: Comma-separated list of positive integers (duplicates allowed)- No spaces:
1,2,3 - With spaces:
"1, 2, 3"(quotes required)
- No spaces:
--offline: Use locally stored pages instead of fetching from web (optional)--debug: Enable debug output (optional)
# Online mode (default) - fetches from web
python -m src.main --num_post 50 --min_score 0 --max_score 1000 --list_string 1,2,3
# Offline mode - uses locally stored pages
python -m src.main --num_post 50 --min_score 0 --max_score 1000 --list_string 1,2,3 --offline
# With spaces in list_string (quotes required)
python -m src.main --num_post 50 --min_score 0 --max_score 1000 --list_string "1, 2, 3"To view debug output, add the --debug flag:
# Online mode with debug
python -m src.main --num_post 50 --min_score 0 --max_score 1000 --list_string 1,2,3 --debug
# Offline mode with debug
python -m src.main --num_post 50 --min_score 0 --max_score 1000 --list_string 1,2,3 --offline --debugrequests- For making HTTP requestsbeautifulsoup4- For HTML parsingpytest- For unit testing
To run the test suite:
pytest -vA sample CSV output file is available at sampleOutput/result.csv.
This sample was generated using the following command:
python -m src.main --num_post 100 --min_score 0 --max_score 200000 --list_string 5,10-4,1-3 --debugBelow is a screenshot comparing the original Hacker News page and the extracted CSV output:
- Adding a
--forceflag to override the num_post built-in limit of 900