This is a simple scraper for the French website "yelp". It is written in Python and uses the scrapling
library to bypass the Cloudflare protection.
Don't forget that search parameters can be modified in the
inputs/yelp_config.jsonfile.
This project is licensed under the GNU Affero General Public License v3.0.
- You are free to use, modify, and distribute the code, but commercial use is prohibited without explicit permission.
- Any significant modifications or derived works must credit the original author,
Yassine EL IDRISSI, and must also be distributed under the same license.
Before the first step, be sure to have installed pip, the conda package manager and git on your machine !
- Pip (should be installed by default with Python)
- Miniconda (for package management)
- Git (for version control)
git clone https://github.com/jnoundu89/YelpScraper.git cd YelpScraper conda create --name *env_name* python=3.12 conda activate *env_name*- Be sure to replace
env_namewith the name you want to give to the environment. Then, install the required libraries in the new conda environment:
pip install -r requirements.txt- Then you launch the installation of all inside dependencies:
scrapling install- First, rename the file
inputs/yelp_config[DON'T FORGET TO RENAME].jsontoinputs/yelp_config.jsonand fill it with the search parameters you want to use like :
{
"params": {
"find_desc": "Restaurants",
"find_loc": "Lyon"
}
}- Then, rename the file
inputs/setup_database[DON'T FORGET TO RENAME].jsontoinputs/setup_database.jsonand fill it with the database credentials you want to use like :
{
"engine": "mysql",
"hostname": "127.0.0.1",
"username": "root",
"password": "root",
"port": "3306",
"schema": "yelp"
}In progress : if you want to use postgresql, you can change the
enginevalue topostgresql.
python main.py/!\ There is two optional arguments that you can use :
--no-database: to not save the data in the database you have set up in theinputs/setup_database.jsonfile--no-csv: to not save the data in a CSV file in theoutputs/directory
python main.py --no-database --no-csvResults will be saved in a CSV file in the newly created
outputsdirectory, with the name containing the search parameters, plus the current formatted date, i.e.restaurants_lyon_DD_MM_YYYY.csv.
It will also save line by line each row in the database that you have set up with the credentials in
"inputs/setup_database.json"file.
Don't forget to check the logs in your terminal or in the
logs/directory to see if the scraper has run successfully !
- If you want to update all the pip packages (because of an update of the scraper for example), you can run the following command after activating the conda environment:
pip install --upgrade -r requirements.txt- If you want to have the last up-to-date version of this repository, don't forget to run the following command :
git pull└── jnoundu89-yelpscraper/
├── README.md
├── LICENSE
├── main.py
├── requirements.txt
├── scraper.py
├── data_processing/
│ ├── data_processing.py
│ └── models/
│ └── business_model.py
├── database/
│ ├── database_engine.py
│ ├── generate_orm_tables.py
│ └── sql_requests.py
├── inputs/
│ ├── setup_database.json
│ └── yelp_config.json
├── pages/
│ └── yelp.py
└── utilities/
├── config_loader.py
├── helper.py
├── logging_utils.py
└── request_utils.pygraph LR
A[Clone repository + navigate to root directory] --> B[Create + activate new conda environment]
B --> C[Install inside required libraries + dependencies]
C --> D[Edit inputs files]
D --> E[Connect database with inputs/setup_database.json]
D --> F[Run main script]
D --> G[Modify query parameters with inputs/yelp_config.json]
F --> H[Check progress in terminal]
F --> I[Check if line in database]
F --> J[Check results in CSV file]