💬 FAQ Chatbot — Microsoft Clarity

CodeAlpha AI Internship · Task 2
Multilingual FAQ chatbot with hybrid NLP matching — EN · FR · ES

Demo

🚀 Live app: codealphafaqchatbot-by-tahsine.streamlit.app

Overview

This chatbot answers questions about Microsoft Clarity (a free web analytics tool) using real FAQ data scraped directly from the official documentation. It combines two NLP techniques to find the best answer:

Intent classification (Logistic Regression) — understands paraphrases and reformulations
Cosine similarity (TF-IDF) — fallback when the classifier isn't confident enough

Both run entirely locally — no external AI API calls, no costs.

Features

🌍 Multilingual — EN, FR, ES with automatic language detection
🔀 Hybrid matching — Intent Classifier → Cosine Similarity fallback
📊 266 real FAQ entries scraped from learn.microsoft.com (EN), plus 29 FR and 29 ES entries
⚡ Disk cache — models trained once, restored on subsequent cold starts
🎛️ Configurable thresholds — live sliders in the sidebar
💬 Chat UI — Streamlit with confidence badges, method tags, and suggested questions

Architecture

flowchart TD
    A([User question])
    A --> B[Language Detection\nKeyword markers · langdetect fallback]
    B --> C[Preprocessing\nspaCy lemmatization · NLTK stopwords & tokenization]
    C --> D{Intent Classifier\nTF-IDF + LogisticRegression\nconfidence >= 0.5 ?}
    D -- Yes --> E([Answer])
    D -- No  --> F{Cosine Similarity\nTF-IDF vectorizer\nscore >= 0.35 ?}
    F -- Yes --> G([Answer])
    F -- No  --> H([No answer found])

Project Structure

CodeAlpha_FAQ_Chatbot/
├── app.py                     # Streamlit entry point
├── scraper.py                 # Generic JSON-LD FAQPage scraper
├── requirements.txt
├── chatbot/
│   ├── config.py              # Thresholds + supported languages
│   ├── preprocessor.py        # NLTK + spaCy + langdetect
│   ├── intent_classifier.py   # LogisticRegression pipeline
│   └── similarity_matcher.py  # TF-IDF cosine similarity
└── data/
    ├── urls.txt               # URLs to scrape (EN only — FR/ES are pre-loaded)
    ├── en/faq.csv             # 267 entries — scraped from Microsoft Clarity docs
    ├── fr/faq.csv             # 29 entries — manually translated
    └── es/faq.csv             # 29 entries — manually translated

Setup

Prerequisites

Python 3.10+
pip

Install

git clone https://github.com/Tahsine/CodeAlpha_FAQ_Chatbot
cd CodeAlpha_FAQ_Chatbot
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt

The spaCy language models (EN, FR, ES) are included in requirements.txt and install automatically via pip — no python -m spacy download needed.

Run

streamlit run app.py

The app opens at http://localhost:8501. On first launch, models train and cache to disk (~10–20s). Subsequent launches restore from cache in under 3 seconds.

Refresh FAQ data

To re-scrape EN data from Microsoft Clarity:

python scraper.py

Then click 🔄 Reload FAQs in the sidebar to retrain the models.

Technical Choices

NLP: NLTK + spaCy

Both libraries are used because they complement each other:

Task	NLTK	spaCy
Stopword removal	✅ Multilingual corpus	❌ Not built-in
Tokenization	✅ `word_tokenize()`	✅ Faster (Cython)
Lemmatization	❌ WordNet (EN only)	✅ 75+ languages

Decision: NLTK handles stopwords and tokenization, spaCy handles multilingual lemmatization.

Hybrid Matching

The two methods cover each other's weaknesses:

Method	Strength	Weakness
Intent Classifier (LogisticRegression)	Handles paraphrases	Needs labeled training data
Cosine Similarity (TF-IDF)	Works with a single example	Sensitive to exact wording

The intent classifier runs first. If its confidence is below the threshold (default: 0.5), cosine similarity takes over.

Generic Scraper

scraper.py is not hardcoded to Microsoft Clarity. It targets the FAQPage structured data format (<script type="application/ld+json">), a standard used by most major documentation sites. To scrape a different source, add the URL to data/urls.txt.

Example Questions

🇬🇧 English

Question	Expected topic
What is Clarity?	Product definition
How much does Clarity cost?	It's free
Is Clarity GDPR compliant?	Data compliance
How do I set up Clarity?	Setup guide
Where is my data stored?	Data centers

🇫🇷 Français

Question	Sujet attendu
Qu'est-ce que Microsoft Clarity ?	Définition
Clarity est-il gratuit ?	Tarification
Comment configurer Clarity ?	Installation
Où sont stockées mes données ?	Centres de données
Clarity est-il conforme au RGPD ?	Conformité

🇪🇸 Español

Pregunta	Tema esperado
¿Qué es Clarity?	Definición
¿Es Clarity gratuito?	Precio
¿Cómo configuro Clarity?	Instalación
¿Dónde se almacenan mis datos?	Centros de datos
¿Clarity cumple con el RGPD?	Cumplimiento

🧪 Out-of-scope (should return "no answer found")

Question	Why it should fail
Who is the CEO of Microsoft?	Outside FAQ scope
What is the weather today?	Not in the dataset
Tell me a joke	Not a FAQ question

Dependencies

Package	Role
`streamlit`	Web UI
`scikit-learn`	TF-IDF, LogisticRegression, cosine similarity
`spacy`	Multilingual lemmatization
`nltk`	Stopwords, tokenization
`langdetect`	Language detection fallback
`requests` + `beautifulsoup4`	FAQ scraping
`pandas`	CSV handling
`joblib`	Model disk cache

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💬 FAQ Chatbot — Microsoft Clarity

Demo

Overview

Features

Architecture

Project Structure

Setup

Prerequisites

Install

Run

Refresh FAQ data

Technical Choices

NLP: NLTK + spaCy

Hybrid Matching

Generic Scraper

Example Questions

🇬🇧 English

🇫🇷 Français

🇪🇸 Español

🧪 Out-of-scope (should return "no answer found")

Dependencies

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
chatbot		chatbot
data		data
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
scraper.py		scraper.py

Folders and files

Latest commit

History

Repository files navigation

💬 FAQ Chatbot — Microsoft Clarity

Demo

Overview

Features

Architecture

Project Structure

Setup

Prerequisites

Install

Run

Refresh FAQ data

Technical Choices

NLP: NLTK + spaCy

Hybrid Matching

Generic Scraper

Example Questions

🇬🇧 English

🇫🇷 Français

🇪🇸 Español

🧪 Out-of-scope (should return "no answer found")

Dependencies

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages