This project builds a sentiment analysis system for Indonesian text using deep learning models. The goal is to classify user reviews into sentiment categories such as positive, negative, or neutral.
The project uses modern NLP models (transformers) and covers the full process from data preparation to model evaluation.
- Process and clean Indonesian text data
- Build a sentiment classification model
- Compare two models: IndoELECTRA and IndoBERT
- Evaluate how well the models perform
- IndoELECTRA: ChristopherA08/IndoELECTRA
- IndoBERT: indobenchmark/indobert-base-p1
Both models are trained to understand Indonesian text and classify sentiment.
- Python
- PyTorch
- HuggingFace Transformers
- Scikit-learn
- Pandas & NumPy
- Matplotlib & Seaborn
The dataset used in this project is obtained from Kaggle:
👉 https://www.kaggle.com/datasets/izzahfathiyyah123/dataset-ulasan-gojek-clean
This dataset contains Indonesian user reviews of the Gojek application collected from Google Play Store.
In general, datasets like this include:
- User review text (content)
- Rating score (usually 1–5)
- Additional metadata such as date or app version ([Kaggle][1])
In this project, the data is organized into:
- Raw data → original reviews
- Processed data → cleaned text after preprocessing
- Labeled data → reviews with sentiment labels
- Clean text (remove symbols and noise)
- Convert text to lowercase
- Remove stopwords
- Normalize text
- Assign sentiment labels (positive, negative, neutral)
- Labels can be derived from rating scores or manual labeling
- Convert text into tokens using tokenizer
- Train models using IndoELECTRA and IndoBERT
- Optimization using AdamW
- Accuracy
- Precision, Recall, F1-score
- Confusion Matrix
- The model learns quickly at the beginning but improves slowly after several epochs
- IndoELECTRA performs well but needs proper tuning
- IndoBERT gives stable results and works well as a baseline
- Good preprocessing significantly improves results
- Indonesian text is often noisy (slang, typos, informal words)
- Model performance can stop improving after several epochs
- Environment issues (e.g., Colab vs Kaggle dependencies)
- Hyperparameter tuning (learning rate, batch size)
- Improve data quality or add more data
- Combine models (ensemble learning)
- Deploy as a web or API-based application
sentiment-analysis-indoelectra/ │ ├── data/ ├── notebook/ ├── results/ ├── requirements.txt └── README.md
-
Install dependencies: pip install -r requirements.txt
-
Run the notebook: jupyter notebook
- Dataset: https://www.kaggle.com/datasets/izzahfathiyyah123/dataset-ulasan-gojek-clean
- IndoELECTRA: https://huggingface.co/ChristopherA08/IndoELECTRA
- IndoBERT: https://huggingface.co/indobenchmark/indobert-base-p1
Wani Syafitri
| LPDP Awardee | Master's in Computer Science
Focus:
Artificial Intelligence Natural Language Processing (NLP) AI-based Education Systems Backend Development (PHP, JavaScript, MySQL)
- GitHub: https://github.com/wanisya
- Email: wanisyf@gmail.com
- Kaggle: https://www.kaggle.com/wanisyafitri20
- LinkedIn: https://www.linkedin.com/in/wani-syafitri-3bbb90258/
- ResearchGate: https://www.researchgate.net/profile/Wani-Syafitri?ev=hdr_xprf
Give it a star ⭐ on GitHub!