This project is a small machine learning exploration focused on predicting the energy value of meals from nutritional information. I created it as one of my high school ML projects for the home round of the Slovak AI Olympics 2025/26 competition.
The goal was not to build a production-ready nutrition system, but to understand the full pipeline: data preparation, feature engineering, model training, evaluation, and critical analysis of model results.
The original idea was to:
- estimate meal calories,
- compare a simple linear model with a neural-network-based regressor,
- and experiment with generating a healthier meal alternative.
The notebook follows this workflow:
- downloads food data from USDA FoodData Central,
- extracts relevant nutrient information,
- builds a small synthetic meal dataset,
- engineers additional nutritional ratio features,
- trains two regression models,
- compares their prediction quality using standard regression metrics.
The project uses the USDA FoodData Central foundation food dataset as the nutritional source and then generates a custom meal-level dataset.
- 100 synthetic meals were created from ingredient templates,
- each meal is represented by a text input such as ingredient names and gram amounts,
- nutrient totals are computed by matching ingredients to USDA entries,
- the generated meal dataset is saved as meals_dataset.csv.
The model uses both direct nutrient totals and engineered ratios:
proteincarbohydratestotal_fatssaturated_fatsfiberwaterprotein_per_caloriefiber_per_caloriecarbohydrates_per_caloriessaturated_fats_to_total_fatswater_per_caloriesfiber_to_carbs
The notebook compares three reference points:
A simple baseline that predicts the training-set mean for every test example.
This model was chosen as a strong interpretable baseline because calorie estimation can have a strong linear relationship with nutritional quantities.
A multilayer perceptron pipeline with feature scaling was used as a more flexible nonlinear alternative.
Test split: 80 training samples / 20 test samples
| Model | MAE | RMSE | R² | MAPE (%) |
|---|---|---|---|---|
| Baseline (train mean) | 569.022 | 663.923 | -0.005 | 87.656 |
| Linear Regression | 0.713 | 1.001 | 1.000 | 0.152 |
| Neural Network (MLP) | 76.318 | 100.899 | 0.977 | 11.512 |
Shows how close predictions are to true calorie values for both models.
Compares residual error spread between Linear Regression and the Neural Network.
Shows how calorie values are distributed across the synthetic meal dataset.
Important note: the extremely strong Linear Regression result is not realistic evidence of a near-perfect model. The dataset contains target-leaking information because calories can be reconstructed very closely from macronutrient totals such as protein, carbohydrates, and fats. In other words, the model is partially learning a nutritional identity rather than a genuinely difficult prediction task.
This was one of the most useful findings in the project.
At first, the Linear Regression metrics looked almost perfect, which would normally suggest an excellent model. After reviewing the feature set more carefully, I realized the result was heavily affected by data leakage.
In nutritional science, calories are calculated directly from macronutrients using the Atwater system (approx. 4 kcal per gram of protein and carbohydrates, and 9 kcal per gram of fat). Because the Linear Regression model is designed to find linear mathematical relationships, it simply learned this exact 4-4-9 formula. It essentially decoded the rule used to calculate the calories in the dataset in the first place, rather than actually finding hidden patterns, making the task unrealistically easy. This is a data leak.
So the most important conclusion is not that the model is perfect, but that feature selection matters just as much as model choice.
The notebook also includes an experimental section that tries to propose a healthier alternative for a meal by:
- lowering calories,
- lowering carbohydrates,
- and increasing protein share.
This part should be treated as a prototype rather than a finished feature. The main completed part of the project is the calorie prediction and the analysis of why the best-looking result was misleading.
- energia_jedlo.ipynb — main notebook with data preparation, dataset generation, model training, evaluation, and visualizations
- meals_dataset.csv — generated meal dataset used for training
- README.md — project summary
- Open energia_jedlo.ipynb.
- Run the notebook cells in order.
- I practiced data preparation and learned how important high-quality data is.
- I compared a simple interpretable model with a neural network.
- I learned that excellent metrics can be misleading when the feature set leaks target information.
- I learned how to design a weighting algorithm and generate healthier alternatives without using machine learning.
If I continue this project, the next steps would be:
- remove leakage-prone features,
- predict calories from ingredient-level representations instead of direct nutrient totals,
- use a larger and more realistic meal dataset,
- finish and refine the healthier-alternative generator.
This project is best understood as a learning-focused ML case study. The strongest part is not the raw score itself, but the fact that the notebook identifies why that score is misleading and what should be improved next.


