Skip to content

Arnav-Naive/language-Detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌍 Linguist AI - Language Detection System

Linguist AI is a high-performance machine learning application designed to identify the language of any given text. Built using Multinomial Naive Bayes and Natural Language Processing (NLP) techniques, it can detect 22 different languages with high precision.


🚀 Key Features

  • Multilingual Support: Detects 22 languages including English, Hindi, Spanish, French, Chinese, and more.
  • Micro-Cleaning Engine: Advanced preprocessing that strips noise (numbers/special chars) while preserving linguistic integrity.
  • Premium Web Interface: A sleek, glassmorphic UI built with Flask for real-time interaction.
  • Technical Rigor: Follows a full ML lifecycle from EDA to deployment.

🛠️ Technical Architecture

1. The Algorithm: Multinomial Naive Bayes

The core detection engine uses the Multinomial Naive Bayes (MNB) classifier.

  • How it works: MNB is based on Bayes' Theorem and is particularly suited for text classification with discrete features (like word counts).
  • Probabilistic Logic: It calculates the probability of a text belonging to a specific language based on the frequency of its words relative to the overall dataset.
  • Efficiency: Unlike deep learning models, MNB is extremely fast and effective for medium-sized text datasets.

2. NLP Technique: Bag of Words (BoW)

To convert text into numerical data that the algorithm can understand, we use CountVectorizer:

  • It creates a Vocabulary of all unique words across 22,000 samples.
  • Every input text is converted into a Sparse Matrix representing word frequencies.

3. Data Preprocessing

Before training, the raw dataset undergoes "Data Cleaning":

  • Normalization: Converting all text to lowercase.
  • Noise Removal: Stripping numbers and special characters to focus on alphabetic patterns unique to each language.
  • Standardization: Removing extra whitespaces for consistent vectorization.

4. Evaluation Metrics

The model is evaluated using:

  • Accuracy Score: Achieving over 91% accuracy on unseen test data.
  • Confusion Matrix: Visualizing precisely where the model might confuse similar languages.
  • Classification Report: Precision, Recall, and F1-score for every individual language.

📦 Project Structure

  • language detection.ipynb: The research and development notebook (The "ML Back").
  • app.py: The Flask production server for the web application.
  • train_model.py: Utility script for model retraining and persistence.
  • model.pkl & vectorizer.pkl: Serialized trained models for fast inference.
  • language.csv: The core dataset (22,000 rows).

💻 Installation & Setup

  1. Clone the repository (or navigate to the project folder).
  2. Install Dependencies:
    pip install pandas numpy scikit-learn flask joblib matplotlib seaborn
  3. Run the Training (Optional):
    python train_model.py
  4. Launch the Application:
    python app.py
  5. Access the UI: Open http://127.0.0.1:5000 in your browser.

🌐 Supported Languages

The system supports 22 languages, including but not limited to:

  • English
  • Hindi
  • Spanish
  • French
  • Chinese
  • Russian
  • Arabic
  • Dutch
  • Turkish
  • ...and many more!

Built with ❤️

About

Linguist AI is a high-performance machine learning application designed to identify the language of any given text. Built using **Multinomial Naive Bayes** and **Natural Language Processing (NLP)** techniques, it can detect 22 different languages with high precision.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors