You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**MiniDB** is a lightweight, relational database engine built from scratch in Python. It was designed to demonstrate core database internals—including B-Tree indexing concepts, Hash Joins, and Atomic Persistence—without relying on external database libraries like SQLite.
3
+
**MiniDB** is a lightweight, relational database engine built from scratch in Python. It was designed to demonstrate core database internals—including Disk-Based Binary Indexing (O(log N)), Hash Joins, and Atomic Persistence—without relying on external database libraries like SQLite.
4
4
5
5
> **Note:** This project was built for the **Pesapal Junior Dev Challenge '26**.
6
6
7
-
## 🎥 Demo Video
8
-
[Click here to watch the 2-minute System Demo]
9
-
10
7
## 🏗️ Architecture Overview
11
8
12
9
The system is organized into four modular layers, designed to mimic a production RDBMS:
@@ -17,100 +14,71 @@ graph TD
17
14
API -->|SQL String| P[SQL Parser]
18
15
P -->|Command Object| E[Execution Engine]
19
16
E -->|Read/Write| S[Storage Layer]
20
-
S -->|JSONL & Fsync| D[(Data Files)]
21
-
E -.->|O(1) Lookups| H{Hash Index}
17
+
S -->|JSONL & Binary| D[(Data & .idx Files)]
18
+
E -.->|O(log N) Search| BI{Disk Binary Index}
22
19
```
23
20
24
-
-**UI Layer**: A Flask-based Admin Dashboard (`app.py`) for visual schema management, data entry, and SQL execution.
25
-
-**SQL Parser**: A regex-based engine (`parser.py`) that translates SQL into command objects. Supports `CREATE`, `INSERT`, `SELECT` (with specific columns & nested subqueries), `UPDATE`, `DELETE`, and `JOIN`.
26
-
-**Database Engine**: The query coordinator (`database.py`). It replaces naive $O(N^2)$ loops with $O(N)$ Hash Joins and supports recursive subquery resolution.
27
-
-**Storage Layer**: Handles data persistence (`table.py`). Uses **JSON Lines (.jsonl)** for streaming I/O and implements Atomic Writes & File Locking.
21
+
-**UI Layer**: A Flask-based Admin Dashboard (`app.py`) with a premium, independent-scrolling layout for schema management, data entry, and SQL execution.
22
+
-**SQL Parser**: A regex-based engine (`parser.py`) supporting `CREATE`, `INSERT`, `SELECT` (with aggregates & subqueries), `UPDATE`, `DELETE`, and `JOIN`.
23
+
-**Database Engine**: The query coordinator (`database.py`). Implements $O(N)$ Hash Joins, SQL Aggregate Functions, and recursive subquery resolution.
24
+
-**Storage Layer**: Handles persistence (`table.py`). Uses **JSON Lines (.jsonl)** for streaming I/O and **Binary Search Indexes (.idx)** for memory-efficient lookups.
28
25
29
26
## 🧠 Key Engineering Decisions
30
27
31
28
### 1. Scalability: JSON Lines (.jsonl) Storage
32
29
Unlike standard JSON arrays which require loading the entire file into memory, MiniDB uses JSON Lines:
33
30
-**Streaming Scans**: Rows are yielded one-by-one using Python generators, keeping memory usage constant even for million-row tables.
34
31
-**O(1) Persistence**: New records are appended to the end of the file instead of rewriting the entire dataset.
35
-
-**Fast Lookups**: The engine uses the file stream to validate unique constraints and perform subquery filters without bulk loading.
To solve the "Memory Residency" limitation, MiniDB implements a custom disk-persistent binary index:
35
+
-**Binary Search on Disk**: Primary keys and file offsets are stored in `.idx` files as fixed-size binary records.
36
+
-**O(1) Memory Footprint**: Instead of loading a massive hash map into RAM, the engine performs a **Binary Search** directly on the disk file to locate rows.
37
+
-**Ordered Maintenance**: The Indexer maintains sort order during insertions, enabling efficient $O(\log N)$ point lookups without expensive memory overhead.
38
+
39
+
### 3. Performance: Hash Joins over Nested Loops
38
40
Naive database implementations use Nested Loop Joins ($O(N \times M)$). MiniDB implements a Hash Join algorithm:
39
41
-**Build Phase**: Constructs an in-memory Hash Map of the smaller table.
40
42
-**Probe Phase**: Scans the larger table and performs $O(1)$ lookups against the map.
41
-
-**Result**: Reduces query time from linear growth to near-constant time for lookups.
43
+
-**Result**: Reduces join time from linear growth to near-constant time for lookups.
MiniDB supports advanced SQL features usually found in mature engines:
45
-
-**Recursive Execution**: Subqueries in `WHERE col IN (...)` clauses are resolved recursively before the outer query runs.
46
-
-**Column Projection**: Reduces data transfer by only returning requested columns (e.g., `SELECT name FROM users`) rather than full records.
47
-
-**Depth-Limited Scans**: The `LIMIT` clause short-circuits the storage generator, stopping disk reads as soon as the quota is met.
47
+
-**Aggregate Functions**: Supports `COUNT`, `SUM`, `AVG`, `MIN`, and `MAX` in a single-pass execution for maximum efficiency.
48
+
-**Recursive Subqueries**: Clauses like `WHERE col IN (...)` are resolved recursively before the outer query runs.
49
+
-**Strict Validation**: Enforces numeric types for mathematical aggregates (e.g., preventing `SUM` on `STR` columns).
48
50
49
-
### 4. Reliability: Atomic Writes (Crash Safety)
51
+
### 5. Reliability: Atomic Writes (Crash Safety)
50
52
To prevent data corruption during power failures, MiniDB uses an atomic save strategy:
51
53
1. Writes data to a temporary file (`table.tmp`).
52
54
2. Forces a hardware flush using `os.fsync`.
53
55
3. Performs an atomic swap using `os.replace`.
54
-
-**Result**: The database is never in a "half-written" state.
55
56
56
-
### 5. Consistency: ACID Transactions
57
+
### 6. Consistency: ACID Transactions
57
58
MiniDB implements a robust Transaction Manager within the engine:
58
59
-**Staging Area**: Changes during a transaction are kept in a session-specific buffer.
59
60
-**Atomicity**: Supports `BEGIN`, `COMMIT`, and `ROLLBACK` for multi-statement workflows.
60
-
-**Integrity**: Ensures that if a process crashes mid-transaction, no partial data is committed to disk.
61
61
62
-
### 6. Concurrency: Multi-Process File Locking
63
-
To support multiple users/processes, MiniDB implements a global `LockManager`:
64
-
-**Pessimistic Locking**: Leverages file-based locks (`.lock` files) to prevent race conditions during write operations.
65
-
-**Timeout & Retry**: Includes a retry mechanism with configurable timeouts for busy database scenarios.
66
-
-**Stale Lock Cleanup**: Includes logic to detect and remove "dead" locks left behind by crashed processes.
62
+
### 7. Concurrency: Multi-Process File Locking
63
+
Leverages a global `LockManager` with pessimistic file-based locks to prevent race conditions during concurrent write operations across multiple processes.
If you have Docker installed, you can spin up the entire system with a single command:
99
-
```bash
100
-
docker-compose up --build
101
-
```
102
-
-**Web UI**: Access at `http://localhost:5000`
103
-
-**Persistence**: Database files are automatically mapped to the `./data` folder on your host machine.
104
-
105
-
## ⚠️ Known Limitations (Prototype Scope)
106
-
-**SQL Breadth**: Supports a subset of SQL syntax. Complex multi-level aggregate functions (SUM, AVG) are not yet implemented.
107
-
-**Memory Residency**: While storage is streaming, the primary key index is kept in-memory for $O(1)$ speed. Extremely large keyspace may require a B-Tree file-baked index.
76
+
1. Start the Web Admin Dashboard: `python app.py` (Visit `http://127.0.0.1:5000`)
77
+
2. CLI mode: `python main.py`
108
78
109
-
## 🙏 Acknowledgements & AI Usage
79
+
## 🙏 Acknowledgements
110
80
This project was built as part of the Pesapal Junior Dev Challenge '26.
111
81
-**Architecture & Logic**: Designed by Collins Odhiambo.
112
-
-**Code Generation**: Boilerplate regex parsing and Flask templates generated by AI (Gemini 2.0).
113
-
-**Algorithm Optimization**: AI assisted in refactoring the Join algorithm, implementing the concurrency lock manager, and the JSONL storage migration.
114
-
-**Verification**: All code was manually reviewed, tested, and integrated by the author.
82
+
-**Code Generation & Optimization**: AI (Gemini 2.0) assisted in developing regex patterns, refactoring algorithms, and generating premium UI components.
0 commit comments