A simple search engine built in Python to understand how search engines work internally.
At a high level, a search engine does three things:
- Crawls pages (collects content)
- Indexes pages (builds data structures for fast lookup)
- Searches (answers user queries with ranked results)
- Crawler: Fetches web pages and extracts outgoing links to discover more pages.
- Indexer: Converts crawled content into an inverted index for fast search.
- Query Engine: Tokenizes the query, finds matching documents, ranks them, and returns results.
- Web UI: A minimal Flask UI to enter queries and view results.
Search-Engine/
├── .github/workflows/black.yml
├── src/
│ └── search_engine/
│ ├── models/
│ │ ├── PageModel.py
│ │ └── TokenType.py
│ ├── utils/
│ │ ├── loggers.py
│ │ ├── parse_html.py
│ │ ├── requests.py
│ │ ├── string_utils.py
│ │ └── variables.py
│ ├── crawler.py
│ ├── indexer.py
│ └── query_response.py
├── templates/
│ └── index.html
├── app.py
├── pyproject.toml
├── uv.lock
└── README.md
You do not need to update any of the project structure to start the search engine.
This project uses uv as the Python package manager.
All dependencies are declared in pyproject.toml and locked in uv.lock.
From the project root, run:
uv syncAfter installing dependencies, start the backend server using:
uv run python app.pyhttp://127.0.0.1:5000- A background asyncio event loop is created
- The crawler starts discovering web pages
- The indexer builds an inverted index
- Flask serves HTTP requests
- Queries are executed against the in-memory index
Crawling, indexing, and searching run concurrently.
Configuration (CommonVariables)
src/search_engine/utils/variables.pyGitHub Actions enforces code formatting using Black.
Workflow location:
.github/workflows/black.ymlTo run formatting locally:
uv run black .