Skip to content

Latest commit

Β 

History

History
65 lines (56 loc) Β· 2.51 KB

File metadata and controls

65 lines (56 loc) Β· 2.51 KB

πŸ’¬ Text2Moji 😏

😏 Feature Overview

  • πŸ˜„ Text β†’ Vector

    • TF–IDF on unigrams + bigrams
    • Configurable vocab size, min_df, max_df
  • πŸ˜‡ Text cleaning

    • Unicode fixes, lowercasing
    • Optional removal of URLs / @mentions / #hashtags
    • Caching of cleaned text for fast iteration
  • πŸ€– Models (v1 baselines)

    • Keyword / Bag-of-Words Weighted Classifier
    • Nearest-Centroid (cosine) Classifier
  • πŸ€“ Evaluation

    • Top-1 / Top-3 / Top-5 accuracy
    • Macro + weighted precision / recall / F1
    • Per-class reports and qualitative top-k examples
  • 😬 Trained models

    • Logistic Regression (OvR) on TF–IDF
    • Linear SVM & Multinomial Naive Bayes
  • 🀨 Better UX metrics

    • Confusion matrices
    • Per-emoji β€œfailure stories” (where the model gets the vibe wrong)
  • πŸ˜ƒ Integration experiments

    • Minimal REST API (FastAPI/Flask) for /predict calls
    • Tiny web demo: type a message, see top-5 emojis live
  • 😈 Stretch goals

    • fastText-style baseline
    • Tiny transformer/embedding model
    • Browser / VS Code prototype extension for emoji suggestion

😌 Architecture at a Glance

          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚       CSV Data        β”‚
          β”‚  (TEXT, Label, Map)   β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚     Data Layer        β”‚
          β”‚  load + clean + cache β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚   Features Layer      β”‚
          β”‚  TF–IDF (uni/bi-gram) β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚       Model Layer          β”‚
      β”‚ Keyword / Centroid / LR    β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚ Evaluation & Reports  β”‚
          β”‚  top-k, F1, plots, ex β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜