Skip to content

Tahsine/CodeAlpha_FAQ_Chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

💬 FAQ Chatbot — Microsoft Clarity

CodeAlpha AI Internship · Task 2
Multilingual FAQ chatbot with hybrid NLP matching — EN · FR · ES

Python Streamlit scikit-learn spaCy


Demo

🚀 Live app: codealphafaqchatbot-by-tahsine.streamlit.app


Overview

This chatbot answers questions about Microsoft Clarity (a free web analytics tool) using real FAQ data scraped directly from the official documentation. It combines two NLP techniques to find the best answer:

  1. Intent classification (Logistic Regression) — understands paraphrases and reformulations
  2. Cosine similarity (TF-IDF) — fallback when the classifier isn't confident enough

Both run entirely locally — no external AI API calls, no costs.


Features

  • 🌍 Multilingual — EN, FR, ES with automatic language detection
  • 🔀 Hybrid matching — Intent Classifier → Cosine Similarity fallback
  • 📊 266 real FAQ entries scraped from learn.microsoft.com (EN), plus 29 FR and 29 ES entries
  • Disk cache — models trained once, restored on subsequent cold starts
  • 🎛️ Configurable thresholds — live sliders in the sidebar
  • 💬 Chat UI — Streamlit with confidence badges, method tags, and suggested questions

Architecture

flowchart TD
    A([User question])
    A --> B[Language Detection\nKeyword markers · langdetect fallback]
    B --> C[Preprocessing\nspaCy lemmatization · NLTK stopwords & tokenization]
    C --> D{Intent Classifier\nTF-IDF + LogisticRegression\nconfidence >= 0.5 ?}
    D -- Yes --> E([Answer])
    D -- No  --> F{Cosine Similarity\nTF-IDF vectorizer\nscore >= 0.35 ?}
    F -- Yes --> G([Answer])
    F -- No  --> H([No answer found])
Loading

Project Structure

CodeAlpha_FAQ_Chatbot/
├── app.py                     # Streamlit entry point
├── scraper.py                 # Generic JSON-LD FAQPage scraper
├── requirements.txt
├── chatbot/
│   ├── config.py              # Thresholds + supported languages
│   ├── preprocessor.py        # NLTK + spaCy + langdetect
│   ├── intent_classifier.py   # LogisticRegression pipeline
│   └── similarity_matcher.py  # TF-IDF cosine similarity
└── data/
    ├── urls.txt               # URLs to scrape (EN only — FR/ES are pre-loaded)
    ├── en/faq.csv             # 267 entries — scraped from Microsoft Clarity docs
    ├── fr/faq.csv             # 29 entries — manually translated
    └── es/faq.csv             # 29 entries — manually translated

Setup

Prerequisites

  • Python 3.10+
  • pip

Install

git clone https://github.com/Tahsine/CodeAlpha_FAQ_Chatbot
cd CodeAlpha_FAQ_Chatbot
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt

The spaCy language models (EN, FR, ES) are included in requirements.txt and install automatically via pip — no python -m spacy download needed.

Run

streamlit run app.py

The app opens at http://localhost:8501. On first launch, models train and cache to disk (~10–20s). Subsequent launches restore from cache in under 3 seconds.

Refresh FAQ data

To re-scrape EN data from Microsoft Clarity:

python scraper.py

Then click 🔄 Reload FAQs in the sidebar to retrain the models.


Technical Choices

NLP: NLTK + spaCy

Both libraries are used because they complement each other:

Task NLTK spaCy
Stopword removal ✅ Multilingual corpus ❌ Not built-in
Tokenization word_tokenize() ✅ Faster (Cython)
Lemmatization ❌ WordNet (EN only) ✅ 75+ languages

Decision: NLTK handles stopwords and tokenization, spaCy handles multilingual lemmatization.

Hybrid Matching

The two methods cover each other's weaknesses:

Method Strength Weakness
Intent Classifier (LogisticRegression) Handles paraphrases Needs labeled training data
Cosine Similarity (TF-IDF) Works with a single example Sensitive to exact wording

The intent classifier runs first. If its confidence is below the threshold (default: 0.5), cosine similarity takes over.

Generic Scraper

scraper.py is not hardcoded to Microsoft Clarity. It targets the FAQPage structured data format (<script type="application/ld+json">), a standard used by most major documentation sites. To scrape a different source, add the URL to data/urls.txt.


Example Questions

🇬🇧 English

Question Expected topic
What is Clarity? Product definition
How much does Clarity cost? It's free
Is Clarity GDPR compliant? Data compliance
How do I set up Clarity? Setup guide
Where is my data stored? Data centers

🇫🇷 Français

Question Sujet attendu
Qu'est-ce que Microsoft Clarity ? Définition
Clarity est-il gratuit ? Tarification
Comment configurer Clarity ? Installation
Où sont stockées mes données ? Centres de données
Clarity est-il conforme au RGPD ? Conformité

🇪🇸 Español

Pregunta Tema esperado
¿Qué es Clarity? Definición
¿Es Clarity gratuito? Precio
¿Cómo configuro Clarity? Instalación
¿Dónde se almacenan mis datos? Centros de datos
¿Clarity cumple con el RGPD? Cumplimiento

🧪 Out-of-scope (should return "no answer found")

Question Why it should fail
Who is the CEO of Microsoft? Outside FAQ scope
What is the weather today? Not in the dataset
Tell me a joke Not a FAQ question

Dependencies

Package Role
streamlit Web UI
scikit-learn TF-IDF, LogisticRegression, cosine similarity
spacy Multilingual lemmatization
nltk Stopwords, tokenization
langdetect Language detection fallback
requests + beautifulsoup4 FAQ scraping
pandas CSV handling
joblib Model disk cache

License

MIT

About

A Streamlit-based FAQ chatbot for Microsoft Clarity that answers questions in English, French, and Spanish. Uses TF-IDF vectorization + Logistic Regression for intent classification, with cosine similarity fallback and automatic suggestion generation when no exact answer is found. Fully local — no external API calls.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages