CodeAlpha AI Internship · Task 2
Multilingual FAQ chatbot with hybrid NLP matching — EN · FR · ES
🚀 Live app: codealphafaqchatbot-by-tahsine.streamlit.app
This chatbot answers questions about Microsoft Clarity (a free web analytics tool) using real FAQ data scraped directly from the official documentation. It combines two NLP techniques to find the best answer:
- Intent classification (Logistic Regression) — understands paraphrases and reformulations
- Cosine similarity (TF-IDF) — fallback when the classifier isn't confident enough
Both run entirely locally — no external AI API calls, no costs.
- 🌍 Multilingual — EN, FR, ES with automatic language detection
- 🔀 Hybrid matching — Intent Classifier → Cosine Similarity fallback
- 📊 266 real FAQ entries scraped from
learn.microsoft.com(EN), plus 29 FR and 29 ES entries - ⚡ Disk cache — models trained once, restored on subsequent cold starts
- 🎛️ Configurable thresholds — live sliders in the sidebar
- 💬 Chat UI — Streamlit with confidence badges, method tags, and suggested questions
flowchart TD
A([User question])
A --> B[Language Detection\nKeyword markers · langdetect fallback]
B --> C[Preprocessing\nspaCy lemmatization · NLTK stopwords & tokenization]
C --> D{Intent Classifier\nTF-IDF + LogisticRegression\nconfidence >= 0.5 ?}
D -- Yes --> E([Answer])
D -- No --> F{Cosine Similarity\nTF-IDF vectorizer\nscore >= 0.35 ?}
F -- Yes --> G([Answer])
F -- No --> H([No answer found])
CodeAlpha_FAQ_Chatbot/
├── app.py # Streamlit entry point
├── scraper.py # Generic JSON-LD FAQPage scraper
├── requirements.txt
├── chatbot/
│ ├── config.py # Thresholds + supported languages
│ ├── preprocessor.py # NLTK + spaCy + langdetect
│ ├── intent_classifier.py # LogisticRegression pipeline
│ └── similarity_matcher.py # TF-IDF cosine similarity
└── data/
├── urls.txt # URLs to scrape (EN only — FR/ES are pre-loaded)
├── en/faq.csv # 267 entries — scraped from Microsoft Clarity docs
├── fr/faq.csv # 29 entries — manually translated
└── es/faq.csv # 29 entries — manually translated
- Python 3.10+
- pip
git clone https://github.com/Tahsine/CodeAlpha_FAQ_Chatbot
cd CodeAlpha_FAQ_Chatbot
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtThe spaCy language models (EN, FR, ES) are included in requirements.txt and install automatically via pip — no python -m spacy download needed.
streamlit run app.pyThe app opens at http://localhost:8501. On first launch, models train and cache to disk (~10–20s). Subsequent launches restore from cache in under 3 seconds.
To re-scrape EN data from Microsoft Clarity:
python scraper.pyThen click 🔄 Reload FAQs in the sidebar to retrain the models.
Both libraries are used because they complement each other:
| Task | NLTK | spaCy |
|---|---|---|
| Stopword removal | ✅ Multilingual corpus | ❌ Not built-in |
| Tokenization | ✅ word_tokenize() |
✅ Faster (Cython) |
| Lemmatization | ❌ WordNet (EN only) | ✅ 75+ languages |
Decision: NLTK handles stopwords and tokenization, spaCy handles multilingual lemmatization.
The two methods cover each other's weaknesses:
| Method | Strength | Weakness |
|---|---|---|
| Intent Classifier (LogisticRegression) | Handles paraphrases | Needs labeled training data |
| Cosine Similarity (TF-IDF) | Works with a single example | Sensitive to exact wording |
The intent classifier runs first. If its confidence is below the threshold (default: 0.5), cosine similarity takes over.
scraper.py is not hardcoded to Microsoft Clarity. It targets the FAQPage structured data format (<script type="application/ld+json">), a standard used by most major documentation sites. To scrape a different source, add the URL to data/urls.txt.
| Question | Expected topic |
|---|---|
| What is Clarity? | Product definition |
| How much does Clarity cost? | It's free |
| Is Clarity GDPR compliant? | Data compliance |
| How do I set up Clarity? | Setup guide |
| Where is my data stored? | Data centers |
| Question | Sujet attendu |
|---|---|
| Qu'est-ce que Microsoft Clarity ? | Définition |
| Clarity est-il gratuit ? | Tarification |
| Comment configurer Clarity ? | Installation |
| Où sont stockées mes données ? | Centres de données |
| Clarity est-il conforme au RGPD ? | Conformité |
| Pregunta | Tema esperado |
|---|---|
| ¿Qué es Clarity? | Definición |
| ¿Es Clarity gratuito? | Precio |
| ¿Cómo configuro Clarity? | Instalación |
| ¿Dónde se almacenan mis datos? | Centros de datos |
| ¿Clarity cumple con el RGPD? | Cumplimiento |
| Question | Why it should fail |
|---|---|
| Who is the CEO of Microsoft? | Outside FAQ scope |
| What is the weather today? | Not in the dataset |
| Tell me a joke | Not a FAQ question |
| Package | Role |
|---|---|
streamlit |
Web UI |
scikit-learn |
TF-IDF, LogisticRegression, cosine similarity |
spacy |
Multilingual lemmatization |
nltk |
Stopwords, tokenization |
langdetect |
Language detection fallback |
requests + beautifulsoup4 |
FAQ scraping |
pandas |
CSV handling |
joblib |
Model disk cache |
MIT