Hi, I’m João Saraiva, a Data Analyst with a strong interest in analytics, process improvement, and data-driven problem solving.
At the moment, I’m working on a risk prevention project focused on the analysis of operational incidents. By applying text mining techniques such as LDA and BERTopic, I aim to uncover patterns, detect recurring themes, and generate insights that can help strengthen prevention and decision-making processes.
My main tools include Python, KNIME, Power BI, Excel, SPSS, and Power Automate.
You can contact me via email at joaomariosaraiva.99@gmail.com or connect with me on LinkedIn.
Project description: This is an HR analytics project where I explored employee performance evaluation, workforce segmentation, and rater-effect validation using a combination of dashboarding, clustering, and regression analysis.
- Data was based on an HR evaluation dataset covering multiple fiscal periods
- Built an interactive dashboard in Power BI to analyse workforce structure, score distribution, visibility needs, promotion patterns, and supervisory trends
- Applied K-Modes clustering to segment employees into distinct categorical profiles
- Used statistical tests and regression models to validate whether the Nine Block Score was meaningfully linked to talent-related outcomes
- Main tools used: Python, Pandas, Power BI, KModes, Logistic Regression
- Key findings showed clear employee segments with different performance, development, and promotion patterns, while also highlighting the limitations of the score in predicting attrition risk
Project description:
This is a wine analytics project where I explored product segmentation, catalogue profiling, and cluster interpretation using a combination of exploratory analysis, dimensionality reduction, clustering, and AI-assisted interpretation.
- Data was based on a wine catalogue dataset containing 178 wines described across 13 chemical attributes
- Performed data quality checks, descriptive analysis, and normalisation to prepare the dataset for modelling
- Applied correlation analysis and PCA to reduce dimensionality and identify the most relevant variables for segmentation
- Used K-Means clustering to segment the catalogue into 4 distinct wine profiles
- Complemented the analysis with silhouette plots, Self-Organizing Maps (SOMs), and a decision tree to validate and interpret the clusters
- Used the ChatGPT API to help translate technical outputs into business-oriented interpretations related to wine quality, classification, and marketing potential
- Main tools used: Python, Pandas, Scikit-learn, Seaborn, Plotly, PCA, K-Means, SOMs, Decision Trees, OpenAI API
- Key findings showed that the catalogue can be segmented into 4 distinct wine profiles, providing a clearer basis for targeted advertising, stronger product positioning, and more efficient catalogue promotion
Project description: This is a customer analytics project where I explored customer behavior, feature engineering, and predictive modeling to estimate the likelihood of purchasing a transfer service, using a combination of data preprocessing, classification models, and performance evaluation. Data was based on real-world operational datasets from a short-term rental company, combining reservations, apartments, and transfer records into a unified analytical dataset
- Performed extensive data cleaning, including handling missing values, standardizing categorical variables, and resolving inconsistencies typical of real business data
- Engineered relevant features such as check-in time categories, distance to airport, booking lead time, and seasonality indicators to capture behavioral patterns
- Applied one-hot encoding and dataset balancing techniques (RandomOverSampler) to improve model performance on imbalanced data
- Tested multiple classification models, including Decision Trees, Bagging, Random Forest, Gradient Boosting, and XGBoost
- Main tools used: Python, Pandas, Scikit-learn, XGBoost, Imbalanced-learn
- Key findings showed that customer purchase behavior is strongly influenced by booking timing, check-in period, and location-related factors. Gradient Boosting achieved the best performance in identifying potential buyers, enabling more effective targeting strategies and supporting data-driven decision-making for service promotion.
Project description:
This is a fraud analytics project where I explored anomaly detection and classification techniques using a combination of statistical methods, machine learning models, and deployment workflows in KNIME.
- Applied outlier detection (IQR), Decision Trees, Random Forest, Gradient Boosting, Logistic Regression, and Autoencoders
- Performed feature engineering and model-specific preprocessing to improve detection performance
- Designed a 3-layer hybrid architecture combining Autoencoder, Decision Tree, and Quartile-based validation
- Simulated deployment with email alerts for suspicious transactions
- Main tools used: KNIME Analytics Platform, Machine Learning, Feature Engineering
- Key findings showed that combining models improves fraud detection performance, with a strong focus on identifying the minority class and enabling real-time monitoring
Project description:
This project focuses on automating customer support email handling using AI and workflow automation tools, improving efficiency, reducing manual effort, and enhancing response time.
- Automated classification of emails into complaints or information requests using NLP
- Performed sentiment analysis, language detection, urgency detection, and TIN extraction
- Generated structured request summaries to support operational decision-making
- Implemented automated workflows with Power Automate for email processing and task orchestration
- Integrated AI Builder to extract invoice data from PDF attachments
- Stored and managed data using Excel and OneDrive
- Created tasks and approval flows for customer support teams
- Sent real-time alerts via Microsoft Teams for urgent requests
- Main tools used: Power Automate, AI Builder, NLP, Excel, OneDrive, Microsoft Teams
- The system streamlines customer support operations by automating repetitive tasks, improving data consistency, and enabling faster and more structured request resolution
Project description:
This is a customer analytics and machine learning project focused on predicting airline passenger satisfaction and identifying the key drivers behind customer experience.
- Performed EDA, correlation analysis, and PCA, identifying 6 latent components of service experience
- Tested multiple models including Logistic Regression, Decision Trees, Random Forest, Bagging, Gradient Boosting, XGBoost and Neural Networks
- Applied GridSearchCV for model tuning
- Main tools used: Python, Pandas, Scikit-learn, XGBoost, SHAP
- Key findings showed that satisfaction is primarily driven by Seat comfort, Customer Type, Type of Travel and service-related variables, while PCA reduced model performance and neural networks were not necessary given the strong performance of tree-based models.