Skip to content

Commit 5eb9fa8

Browse files
committed
Update README with Disk-Based Indexing and Aggregate Functions
1 parent 279c329 commit 5eb9fa8

1 file changed

Lines changed: 28 additions & 60 deletions

File tree

README.md

Lines changed: 28 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,9 @@
11
# MiniDB: A Custom RDBMS from First Principles
22

3-
**MiniDB** is a lightweight, relational database engine built from scratch in Python. It was designed to demonstrate core database internals—including B-Tree indexing concepts, Hash Joins, and Atomic Persistence—without relying on external database libraries like SQLite.
3+
**MiniDB** is a lightweight, relational database engine built from scratch in Python. It was designed to demonstrate core database internals—including Disk-Based Binary Indexing (O(log N)), Hash Joins, and Atomic Persistence—without relying on external database libraries like SQLite.
44

55
> **Note:** This project was built for the **Pesapal Junior Dev Challenge '26**.
66
7-
## 🎥 Demo Video
8-
[Click here to watch the 2-minute System Demo]
9-
107
## 🏗️ Architecture Overview
118

129
The system is organized into four modular layers, designed to mimic a production RDBMS:
@@ -17,100 +14,71 @@ graph TD
1714
API -->|SQL String| P[SQL Parser]
1815
P -->|Command Object| E[Execution Engine]
1916
E -->|Read/Write| S[Storage Layer]
20-
S -->|JSONL & Fsync| D[(Data Files)]
21-
E -.->|O(1) Lookups| H{Hash Index}
17+
S -->|JSONL & Binary| D[(Data & .idx Files)]
18+
E -.->|O(log N) Search| BI{Disk Binary Index}
2219
```
2320

24-
- **UI Layer**: A Flask-based Admin Dashboard (`app.py`) for visual schema management, data entry, and SQL execution.
25-
- **SQL Parser**: A regex-based engine (`parser.py`) that translates SQL into command objects. Supports `CREATE`, `INSERT`, `SELECT` (with specific columns & nested subqueries), `UPDATE`, `DELETE`, and `JOIN`.
26-
- **Database Engine**: The query coordinator (`database.py`). It replaces naive $O(N^2)$ loops with $O(N)$ Hash Joins and supports recursive subquery resolution.
27-
- **Storage Layer**: Handles data persistence (`table.py`). Uses **JSON Lines (.jsonl)** for streaming I/O and implements Atomic Writes & File Locking.
21+
- **UI Layer**: A Flask-based Admin Dashboard (`app.py`) with a premium, independent-scrolling layout for schema management, data entry, and SQL execution.
22+
- **SQL Parser**: A regex-based engine (`parser.py`) supporting `CREATE`, `INSERT`, `SELECT` (with aggregates & subqueries), `UPDATE`, `DELETE`, and `JOIN`.
23+
- **Database Engine**: The query coordinator (`database.py`). Implements $O(N)$ Hash Joins, SQL Aggregate Functions, and recursive subquery resolution.
24+
- **Storage Layer**: Handles persistence (`table.py`). Uses **JSON Lines (.jsonl)** for streaming I/O and **Binary Search Indexes (.idx)** for memory-efficient lookups.
2825

2926
## 🧠 Key Engineering Decisions
3027

3128
### 1. Scalability: JSON Lines (.jsonl) Storage
3229
Unlike standard JSON arrays which require loading the entire file into memory, MiniDB uses JSON Lines:
3330
- **Streaming Scans**: Rows are yielded one-by-one using Python generators, keeping memory usage constant even for million-row tables.
3431
- **O(1) Persistence**: New records are appended to the end of the file instead of rewriting the entire dataset.
35-
- **Fast Lookups**: The engine uses the file stream to validate unique constraints and perform subquery filters without bulk loading.
3632

37-
### 2. Performance: Hash Joins over Nested Loops
33+
### 2. Efficiency: Disk-Based Binary Indexing (O(log N))
34+
To solve the "Memory Residency" limitation, MiniDB implements a custom disk-persistent binary index:
35+
- **Binary Search on Disk**: Primary keys and file offsets are stored in `.idx` files as fixed-size binary records.
36+
- **O(1) Memory Footprint**: Instead of loading a massive hash map into RAM, the engine performs a **Binary Search** directly on the disk file to locate rows.
37+
- **Ordered Maintenance**: The Indexer maintains sort order during insertions, enabling efficient $O(\log N)$ point lookups without expensive memory overhead.
38+
39+
### 3. Performance: Hash Joins over Nested Loops
3840
Naive database implementations use Nested Loop Joins ($O(N \times M)$). MiniDB implements a Hash Join algorithm:
3941
- **Build Phase**: Constructs an in-memory Hash Map of the smaller table.
4042
- **Probe Phase**: Scans the larger table and performs $O(1)$ lookups against the map.
41-
- **Result**: Reduces query time from linear growth to near-constant time for lookups.
43+
- **Result**: Reduces join time from linear growth to near-constant time for lookups.
4244

43-
### 3. Intelligence: Nested Subqueries & Projection
45+
### 4. Intelligence: SQL Aggregates & Subqueries
4446
MiniDB supports advanced SQL features usually found in mature engines:
45-
- **Recursive Execution**: Subqueries in `WHERE col IN (...)` clauses are resolved recursively before the outer query runs.
46-
- **Column Projection**: Reduces data transfer by only returning requested columns (e.g., `SELECT name FROM users`) rather than full records.
47-
- **Depth-Limited Scans**: The `LIMIT` clause short-circuits the storage generator, stopping disk reads as soon as the quota is met.
47+
- **Aggregate Functions**: Supports `COUNT`, `SUM`, `AVG`, `MIN`, and `MAX` in a single-pass execution for maximum efficiency.
48+
- **Recursive Subqueries**: Clauses like `WHERE col IN (...)` are resolved recursively before the outer query runs.
49+
- **Strict Validation**: Enforces numeric types for mathematical aggregates (e.g., preventing `SUM` on `STR` columns).
4850

49-
### 4. Reliability: Atomic Writes (Crash Safety)
51+
### 5. Reliability: Atomic Writes (Crash Safety)
5052
To prevent data corruption during power failures, MiniDB uses an atomic save strategy:
5153
1. Writes data to a temporary file (`table.tmp`).
5254
2. Forces a hardware flush using `os.fsync`.
5355
3. Performs an atomic swap using `os.replace`.
54-
- **Result**: The database is never in a "half-written" state.
5556

56-
### 5. Consistency: ACID Transactions
57+
### 6. Consistency: ACID Transactions
5758
MiniDB implements a robust Transaction Manager within the engine:
5859
- **Staging Area**: Changes during a transaction are kept in a session-specific buffer.
5960
- **Atomicity**: Supports `BEGIN`, `COMMIT`, and `ROLLBACK` for multi-statement workflows.
60-
- **Integrity**: Ensures that if a process crashes mid-transaction, no partial data is committed to disk.
6161

62-
### 6. Concurrency: Multi-Process File Locking
63-
To support multiple users/processes, MiniDB implements a global `LockManager`:
64-
- **Pessimistic Locking**: Leverages file-based locks (`.lock` files) to prevent race conditions during write operations.
65-
- **Timeout & Retry**: Includes a retry mechanism with configurable timeouts for busy database scenarios.
66-
- **Stale Lock Cleanup**: Includes logic to detect and remove "dead" locks left behind by crashed processes.
62+
### 7. Concurrency: Multi-Process File Locking
63+
Leverages a global `LockManager` with pessimistic file-based locks to prevent race conditions during concurrent write operations across multiple processes.
6764

6865
## 🚀 How to Run
6966

70-
### Prerequisites
71-
- Python 3.10+
72-
- Flask (for the web dashboard)
73-
7467
### Installation
7568
1. Clone the repository:
7669
```bash
7770
git clone https://github.com/collins-odhiambo/minidb.git
7871
cd minidb
7972
```
80-
2. Install dependencies:
81-
```bash
82-
pip install -r requirements.txt
83-
```
73+
2. Install dependencies: `pip install -r requirements.txt`
8474

8575
### Running the App
86-
1. Start the Web Admin Dashboard:
87-
```bash
88-
python app.py
89-
```
90-
Visit `http://127.0.0.1:5000` to access the UI.
91-
92-
2. To use the CLI (REPL) mode:
93-
```bash
94-
python main.py
95-
```
96-
97-
### 🐳 Run with Docker
98-
If you have Docker installed, you can spin up the entire system with a single command:
99-
```bash
100-
docker-compose up --build
101-
```
102-
- **Web UI**: Access at `http://localhost:5000`
103-
- **Persistence**: Database files are automatically mapped to the `./data` folder on your host machine.
104-
105-
## ⚠️ Known Limitations (Prototype Scope)
106-
- **SQL Breadth**: Supports a subset of SQL syntax. Complex multi-level aggregate functions (SUM, AVG) are not yet implemented.
107-
- **Memory Residency**: While storage is streaming, the primary key index is kept in-memory for $O(1)$ speed. Extremely large keyspace may require a B-Tree file-baked index.
76+
1. Start the Web Admin Dashboard: `python app.py` (Visit `http://127.0.0.1:5000`)
77+
2. CLI mode: `python main.py`
10878

109-
## 🙏 Acknowledgements & AI Usage
79+
## 🙏 Acknowledgements
11080
This project was built as part of the Pesapal Junior Dev Challenge '26.
11181
- **Architecture & Logic**: Designed by Collins Odhiambo.
112-
- **Code Generation**: Boilerplate regex parsing and Flask templates generated by AI (Gemini 2.0).
113-
- **Algorithm Optimization**: AI assisted in refactoring the Join algorithm, implementing the concurrency lock manager, and the JSONL storage migration.
114-
- **Verification**: All code was manually reviewed, tested, and integrated by the author.
82+
- **Code Generation & Optimization**: AI (Gemini 2.0) assisted in developing regex patterns, refactoring algorithms, and generating premium UI components.
11583

11684
Built with code, sweat, and Python.

0 commit comments

Comments
 (0)