-
π Text β Vector
- TFβIDF on unigrams + bigrams
- Configurable vocab size,
min_df,max_df
-
π Text cleaning
- Unicode fixes, lowercasing
- Optional removal of URLs / @mentions / #hashtags
- Caching of cleaned text for fast iteration
-
π€ Models (v1 baselines)
- Keyword / Bag-of-Words Weighted Classifier
- Nearest-Centroid (cosine) Classifier
-
π€ Evaluation
- Top-1 / Top-3 / Top-5 accuracy
- Macro + weighted precision / recall / F1
- Per-class reports and qualitative top-k examples
-
π¬ Trained models
- Logistic Regression (OvR) on TFβIDF
- Linear SVM & Multinomial Naive Bayes
-
π€¨ Better UX metrics
- Confusion matrices
- Per-emoji βfailure storiesβ (where the model gets the vibe wrong)
-
π Integration experiments
- Minimal REST API (FastAPI/Flask) for
/predictcalls - Tiny web demo: type a message, see top-5 emojis live
- Minimal REST API (FastAPI/Flask) for
-
π Stretch goals
- fastText-style baseline
- Tiny transformer/embedding model
- Browser / VS Code prototype extension for emoji suggestion
βββββββββββββββββββββββββ
β CSV Data β
β (TEXT, Label, Map) β
βββββββββββ¬ββββββββββββββ
β
βββββββββββΌββββββββββββββ
β Data Layer β
β load + clean + cache β
βββββββββββ¬ββββββββββββββ
β
βββββββββββΌββββββββββββββ
β Features Layer β
β TFβIDF (uni/bi-gram) β
βββββββββββ¬ββββββββββββββ
β
βββββββββββββββΌβββββββββββββββ
β Model Layer β
β Keyword / Centroid / LR β
βββββββββββββββ¬βββββββββββββββ
β
βββββββββββΌββββββββββββββ
β Evaluation & Reports β
β top-k, F1, plots, ex β
βββββββββββββββββββββββββ