Skip to content

ArnavGarg7/Engineering-student-chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SPEED AI — Engineering Student Database Chatbot

A full-stack conversational assistant over a synthetic Indian engineering-college database. Ask questions in plain English (or Hindi, Spanish, and more); SPEED AI understands intent, generates safe SQL, and replies naturally in your language.

Overview

This project models a fictional Indian engineering college with 1,280 students across 10 departments and 8 semesters of academic history. A chat interface lets you query the database in natural language.

Under the hood it is a conversational Text-to-SQL system, driven end-to-end by the LLM with no hardcoded keyword/language/entity lists:

  1. A single LLM router reads the message in any language and decides whether it is conversational (a greeting, small talk, an on-theme question it can answer directly, or an off-theme request it politely declines) or a database question. For conversational messages it writes the reply in the user's own language.
  2. For a database question the same router returns a complete, self-contained English query — translating from the user's language and resolving follow-up references ("how many of them are female?" → "…female students in Computer Science") using the recent conversation.
  3. Text-to-SQL generates a SELECT grounded in the live schema, guarded by deterministic safety checks and a self-correction loop.
  4. A natural-language answer is written from the query results, in the user's language.

The project was built as an internship exercise covering relational database design, FastAPI backend development, a React + Vite frontend, multilingual LLM integration with provider failover, and reproducible synthetic data generation.


Features

Category Details
Database SQLite with 4 normalised tables, indexes, and foreign-key constraints
Students 1,280 synthetic student records across 10 departments, 4 academic years
Conversational Handles any greeting / small talk / identity / help / off-topic message naturally
Multilingual Understands and replies in the user's language — any language the model knows (Hindi, Spanish, French, Arabic, Japanese, Tamil, …), no hardcoded language list
Text-to-SQL Generates custom SELECT queries for open-ended questions, grounded in the live schema
Multi-turn context Follow-ups ("how many of them are female?", "what about 3rd year?") are resolved by the LLM into self-contained queries using recent conversation
RAG examples 32 curated question→SQL examples retrieved from ChromaDB to steer generation
Safety guardrails SELECT-only, keyword blacklist, forced row cap, EXPLAIN parse check, table/column grounding, read-only DB connection
Self-correction Failed queries are automatically repaired (up to 2 attempts) using the error message
Provider failover Free multi-provider chain (Groq 70B → Groq 8B → Gemini) with health monitoring; rate-limits fail over instantly without degrading the provider
Visualizations Shape-based stat cards, bar charts, and paginated tables with CSV export
Synthetic data Faker (en_IN locale) generates realistic Indian names and cities; Gaussian marks model

Technology Stack

Backend

Tool Purpose
Python 3.13+ Application runtime
FastAPI / Uvicorn REST API + ASGI server
SQLite (stdlib) Embedded relational database + chat memory
ChromaDB Vector store for RAG (SQL examples + business rules)
python-dotenv Environment configuration

AI providers (all free-tier)

Provider Model Role
Groq llama-3.3-70b-versatile Primary — routing, Text-to-SQL, and responses
Groq llama-3.1-8b-instant Fallback (registered as groq_fast)
Google Gemini gemini-2.5-flash Second fallback

Groq is the reliable free workhorse; its daily token budget is per-model, so the 70B and 8B models are registered as two providers with two independent budgets. All models and the provider order are configurable via environment variables (ENABLED_PROVIDERS, etc.), and the app degrades gracefully if no keys are set. OpenAI and OpenRouter are supported in code but disabled by default (their free tiers were unreliable).

Frontend

Tool Purpose
React 18 + Vite 5 Chat UI, visualizations, session history, dark mode

Database Design

departments(department_id PK, department_name UNIQUE, department_code UNIQUE)
students(roll_no PK, student_name, age, gender, home_city,
         department_id FK, current_year 1-4, current_semester 1-8, batch_year)
subjects(subject_id PK, department_id FK, semester 1-8, subject_code UNIQUE, subject_name)
marks(mark_id PK, roll_no FK, subject_id FK, semester 1-8, marks 0-100, result 'Pass'|'Fail')

Chat state is persisted in two additional tables: chat_sessions (with a rolling conversation_summary) and chat_messages.

Academic-year → completed-semesters rule: marks exist only for completed semesters (2 × current_year − 1); the current semester is ongoing and has no marks.


System Architecture

User (browser)
      │  POST /api/chat
      ▼
React Frontend (Vite, port 5173)
      ▼
FastAPI Backend (Uvicorn, port 8000)  ── app_server.py
      │
      ├── conversation_router.py  ONE LLM call: classify (conversational vs database_query),
      │        │                  translate any language → English, resolve follow-ups
      │        └── conversational → natural reply written in the user's language
      │
      └── (database_query, as a self-contained English query) ───────
               ├── query_engine.py       orchestrator
               └── text_to_sql.py
                     ├── vector_store.py      RAG: SQL examples + business rules (ChromaDB)
                     ├── database_introspection.py   full live schema
                     ├── generate → validate → ground → EXPLAIN → execute
                     ├── self-correction repair loop (up to 2 attempts)
                     └── ai_service.py    write the answer from results, in the user's language
      │
      ▼
SQLite (engineering_college.db, opened read-only for query execution)

language_detection.py               ── reply-language instruction (anchored to the user's words)
memory_manager.py                   ── chat session/message persistence + recent-context retrieval
llm_provider.py + provider_health.py ── free multi-provider failover & health monitoring

Safety: the SQL engine only ever runs SELECT statements. Generated SQL must start with SELECT, contain no mutating keywords, reference only real tables/columns (deterministic grounding check), pass an EXPLAIN QUERY PLAN parse, and is executed against a read-only database connection with a hard row cap.

Resilience: if a provider is rate-limited (429) or fails, llm_provider.py fails over instantly to the next provider; a background health monitor marks failing providers degraded and periodically re-checks them for recovery.


Supported Queries

Because questions are answered by Text-to-SQL, the range is open-ended. Examples:

  • "List all students in Computer Science." / "Who studies in CSE?"
  • "Show all 3rd year Mechanical Engineering students."
  • "How many students passed in Computer Science, 2nd year?"
  • "Show the full academic history of roll number 2025-CSE-001."
  • "Show the toppers in Information Technology."
  • "Which department has the best academic performance?"
  • "How many students are from Delhi?" / "Which city produces the most toppers?"
  • Follow-ups: "…now only the female ones", "what about 3rd year?"

Recognised abbreviations

CSE/CS → Computer Science · ME → Mechanical Engineering · CE → Civil Engineering · EEE → Electrical Engineering · ECE → Electronics & Communication · CHE → Chemical · BT → Biotechnology · IT → Information Technology · AE → Automobile · ASE → Aerospace


Installation

Prerequisites

  • Python 3.13+, Node.js 18+ and npm
  • At least one working LLM API key — Groq is recommended (free and reliable). The system is fully LLM-driven, so without a working key it will report that the AI service is unavailable rather than answering.

1. Install Python dependencies

py -3.13 -m pip install -r requirements.txt

2. Configure environment variables

copy .env.example .env
# edit .env and paste your key(s)

3. Generate the database

py -3.13 generate_db.py

4. Build the vector index (RAG examples + rules)

py -3.13 build_vector_index.py

5. Install frontend dependencies

cd frontend && npm install && cd ..

Environment Variables

Variable Description
GROQ_API_KEY Groq key (primary provider, free tier). Get one at console.groq.com
GEMINI_API_KEY Google Gemini key (fallback). aistudio.google.com
ENABLED_PROVIDERS Comma list of active providers (default groq,groq_fast,gemini)
TEXT_TO_SQL_PROVIDER / CONVERSATION_PROVIDER Active provider per task (default groq)
FALLBACK_PROVIDER / SECOND_FALLBACK_PROVIDER Ordered fallbacks (default groq_fast, gemini)
GROQ_MODEL / GROQ_FAST_MODEL / GEMINI_MODEL Model id overrides for the two Groq models and Gemini

OPENAI_API_KEY / OPENROUTER_API_KEY are also read if you re-enable those providers via ENABLED_PROVIDERS. The .env file is git-ignored and never committed.


Running

Backend:

py -3.13 -m uvicorn app_server:app --reload --port 8000

Docs at http://127.0.0.1:8000/docs.

Frontend:

cd frontend && npm run dev

Open http://localhost:5173.

On Windows, start.bat launches both together.

Key endpoints

Method Path Description
POST /api/chat Send a message; returns a natural-language reply, structured data, and metadata
GET /api/sessions List recent chat sessions
GET /api/chat/{session_id}/history Full message history for a session
GET /api/sample_questions Example questions for the UI
GET /api/health Health check: database, ChromaDB, and per-provider status

Testing

Unit tests cover the deterministic surface — the SQL safety guardrails, the router's JSON parsing, and the language reply instruction — with no API keys or network required (classification/translation/routing are the LLM's job and aren't unit-tested):

py -3.13 -m pytest

Synthetic Data

Demographics are generated with Faker (en_IN) under a fixed seed for reproducibility. Marks use a Gaussian model: each student has a latent ability score ~ N(68 + year·1.5, 9); subject marks ~ N(ability, 11) clamped to [18, 100], with an 8% chance of a difficulty event. Marks ≥ 40 → Pass, else Fail.


Project Structure

Speed Project/
├── app_server.py            FastAPI app — endpoints, chat orchestration, startup checks
├── conversation_router.py   ONE-call LLM router: classify + translate + resolve follow-ups
├── language_detection.py    Reply-language instruction (anchored to the user's words)
├── query_engine.py          Query orchestrator + QueryResult
├── text_to_sql.py           SQL generation, guardrails, self-correction, execution
├── ai_service.py            Natural-language answer generation
├── vector_store.py          ChromaDB RAG retrieval
├── build_vector_index.py    Builds the RAG index (schema, rules, SQL examples)
├── database_introspection.py Live schema introspection
├── memory_manager.py        Chat session/message persistence + summarization
├── assistant_personality.py Persona/system-prompt text
├── llm_provider.py          Multi-provider client + retry/failover
├── provider_health.py       Provider health monitoring & recovery
├── generate_db.py           Synthetic database generator (Faker)
├── schema.sql               SQLite schema
├── pytest.ini               Test configuration
├── requirements.txt
└── frontend/                React + Vite chat UI
    └── src/  (App.jsx, api.js, main.jsx, styles.css)

Limitations

  • Synthetic data — marks are statistically modelled, not real.
  • Read-only — the API only answers questions; there is no create/update/delete.
  • No authentication — intended for local development and demos.
  • Text-to-SQL is capped to SELECT and a maximum row count for safety.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors