Crawler for Business Analysis

This Jupyter Notebook uses the Google Custom Search API (Programmable Search Engine) to discover Instagram accounts related to nail artists in different cities.
It queries Google with prompts like:

site:instagram.com HAIRSTYLE KOREA

…and extracts username, followers (when available), and bio (from snippet text) from the search results.

Features

Query Google for Instagram profiles by location.
Extract basic profile info from SERP snippets (no profile clicks needed).
Filter accounts by minimum follower count (configurable).
Load lists of cities from a JSON file, so non-technical collaborators can edit cities easily.
Export results to CSV/JSON.

Project Structure

├── .env                        # Environment variables (API keys, configs)
├── .gitignore                  # Git ignore rules
├── requirements.txt            # Python dependencies
├── README.md                   # Project documentation
├── artist_list.ipynb           # Main Jupyter notebook
│
├── data/                       # Data folder (for raw/processed data)
│   └── models/                 # Trained/stored models
│
├── queries/                    # JSON configuration files for building queries
│   ├── cities.json             # Cities (by continent) + top cities
│   ├── keywords.json           # Keywords
│   ├── patterns.json           # Regex patterns for parsing snippets
│   ├── sites.json              # Sites/domain
│   └── views.json              # Other filtering ways
│
└── Out/                        # Output folder (results, exports)

Setup

1. Python environment

Open the project in PyCharm Pro.
Create a new virtual environment (Python 3.11+).
Install dependencies:
```
 pip install -r requirements.txt
```

2. Configure Google API

Create a Programmable Search Engine.
Enable Search the entire web and bias toward instagram.com/*.
Copy the Search engine ID (CX_ID).
In Google Cloud Console: • Create an API key. • Enable the Custom Search API. • (Optional but recommended) Restrict the key to Custom Search API.
Copy .env.example to .env and fill in:

GOOGLE_API_KEY=your_api_key
GOOGLE_CX=your_search_engine_id
PAGES=num_of_pages
FOLLOWER_MIN=follower_restrictions
# add more filter criterias here

3. City list

All cities are stored in cities.json.
It has two sections:
- all_major_cities: grouped by continent.
- top_major_cities: smaller set of priority cities.
Example:

{
  "all_cities": {
    "Asia": ["Tokyo", "Seoul", "Shanghai"],
    "Europe": ["London", "Paris"]
  },
  "top_cities": ["Tokyo", "London"]
}

Non-technical collaborators can safely edit this file without touching Python code.

Usage

Open the notebook: notebooks/01_search_instagram_serp.ipynb
Run through the cells:
- Load .env keys.
- Fetch search results from Google.
- Parse usernames, bios, and followers.
- (Optional) Loop through cities.json for multiple queries.
Filter results (e.g., accounts with ≥2000 followers).
Export to out/filtered.csv and out/filtered.json.
Example code (loading cities):

import json

with open("queries/cities.json", "r", encoding="utf-8") as f:
    city_data = json.load(f)

# Flatten all continents into one list
all_cities = [city for cities in city_data["all_cities"].values() for city in cities]

# Or just use the priority list
top_cities = city_data["top_cities"]

Notes

Be mindful of API quotas: free tier allows 100 queries/day.
Each query returns up to 10 results. To fetch multiple pages, set PAGES=2 or higher in .env or code.
Google snippets don’t always contain follower counts. Some rows may have None.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawler for Business Analysis

Features

Project Structure

Setup

1. Python environment

2. Configure Google API

3. City list

Usage

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
queries		queries
.gitignore		.gitignore
README.md		README.md
analysis.ipynb		analysis.ipynb
artist_list.ipynb		artist_list.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Crawler for Business Analysis

Features

Project Structure

Setup

1. Python environment

2. Configure Google API

3. City list

Usage

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages