Linguist AI is a high-performance machine learning application designed to identify the language of any given text. Built using Multinomial Naive Bayes and Natural Language Processing (NLP) techniques, it can detect 22 different languages with high precision.
- Multilingual Support: Detects 22 languages including English, Hindi, Spanish, French, Chinese, and more.
- Micro-Cleaning Engine: Advanced preprocessing that strips noise (numbers/special chars) while preserving linguistic integrity.
- Premium Web Interface: A sleek, glassmorphic UI built with Flask for real-time interaction.
- Technical Rigor: Follows a full ML lifecycle from EDA to deployment.
The core detection engine uses the Multinomial Naive Bayes (MNB) classifier.
- How it works: MNB is based on Bayes' Theorem and is particularly suited for text classification with discrete features (like word counts).
- Probabilistic Logic: It calculates the probability of a text belonging to a specific language based on the frequency of its words relative to the overall dataset.
- Efficiency: Unlike deep learning models, MNB is extremely fast and effective for medium-sized text datasets.
To convert text into numerical data that the algorithm can understand, we use CountVectorizer:
- It creates a Vocabulary of all unique words across 22,000 samples.
- Every input text is converted into a Sparse Matrix representing word frequencies.
Before training, the raw dataset undergoes "Data Cleaning":
- Normalization: Converting all text to lowercase.
- Noise Removal: Stripping numbers and special characters to focus on alphabetic patterns unique to each language.
- Standardization: Removing extra whitespaces for consistent vectorization.
The model is evaluated using:
- Accuracy Score: Achieving over 91% accuracy on unseen test data.
- Confusion Matrix: Visualizing precisely where the model might confuse similar languages.
- Classification Report: Precision, Recall, and F1-score for every individual language.
language detection.ipynb: The research and development notebook (The "ML Back").app.py: The Flask production server for the web application.train_model.py: Utility script for model retraining and persistence.model.pkl&vectorizer.pkl: Serialized trained models for fast inference.language.csv: The core dataset (22,000 rows).
- Clone the repository (or navigate to the project folder).
- Install Dependencies:
pip install pandas numpy scikit-learn flask joblib matplotlib seaborn
- Run the Training (Optional):
python train_model.py
- Launch the Application:
python app.py
- Access the UI: Open
http://127.0.0.1:5000in your browser.
The system supports 22 languages, including but not limited to:
- English
- Hindi
- Spanish
- French
- Chinese
- Russian
- Arabic
- Dutch
- Turkish
- ...and many more!
Built with ❤️