Calorie Prediction Model

This project is a small machine learning exploration focused on predicting the energy value of meals from nutritional information. I created it as one of my high school ML projects for the home round of the Slovak AI Olympics 2025/26 competition.

The goal was not to build a production-ready nutrition system, but to understand the full pipeline: data preparation, feature engineering, model training, evaluation, and critical analysis of model results.

Project goal

The original idea was to:

estimate meal calories,
compare a simple linear model with a neural-network-based regressor,
and experiment with generating a healthier meal alternative.

What the project does

The notebook follows this workflow:

downloads food data from USDA FoodData Central,
extracts relevant nutrient information,
builds a small synthetic meal dataset,
engineers additional nutritional ratio features,
trains two regression models,
compares their prediction quality using standard regression metrics.

Dataset

The project uses the USDA FoodData Central foundation food dataset as the nutritional source and then generates a custom meal-level dataset.

Meal dataset construction

100 synthetic meals were created from ingredient templates,
each meal is represented by a text input such as ingredient names and gram amounts,
nutrient totals are computed by matching ingredients to USDA entries,
the generated meal dataset is saved as meals_dataset.csv.

Features used for modeling

The model uses both direct nutrient totals and engineered ratios:

protein
carbohydrates
total_fats
saturated_fats
fiber
water
protein_per_calorie
fiber_per_calorie
carbohydrates_per_calories
saturated_fats_to_total_fats
water_per_calories
fiber_to_carbs

Models

The notebook compares three reference points:

1. Baseline

A simple baseline that predicts the training-set mean for every test example.

2. Linear Regression

This model was chosen as a strong interpretable baseline because calorie estimation can have a strong linear relationship with nutritional quantities.

3. Neural Network (MLPRegressor)

A multilayer perceptron pipeline with feature scaling was used as a more flexible nonlinear alternative.

Results

Test split: 80 training samples / 20 test samples

Model	MAE	RMSE	R²	MAPE (%)
Baseline (train mean)	569.022	663.923	-0.005	87.656
Linear Regression	0.713	1.001	1.000	0.152
Neural Network (MLP)	76.318	100.899	0.977	11.512

Visualizations

Model performance (Real vs Predicted)

Shows how close predictions are to true calorie values for both models.

Error distribution (Residuals)

Compares residual error spread between Linear Regression and the Neural Network.

Data distribution (Calories)

Shows how calorie values are distributed across the synthetic meal dataset.

Important note: the extremely strong Linear Regression result is not realistic evidence of a near-perfect model. The dataset contains target-leaking information because calories can be reconstructed very closely from macronutrient totals such as protein, carbohydrates, and fats. In other words, the model is partially learning a nutritional identity rather than a genuinely difficult prediction task.

Interpretation

This was one of the most useful findings in the project.

At first, the Linear Regression metrics looked almost perfect, which would normally suggest an excellent model. After reviewing the feature set more carefully, I realized the result was heavily affected by data leakage.

In nutritional science, calories are calculated directly from macronutrients using the Atwater system (approx. 4 kcal per gram of protein and carbohydrates, and 9 kcal per gram of fat). Because the Linear Regression model is designed to find linear mathematical relationships, it simply learned this exact 4-4-9 formula. It essentially decoded the rule used to calculate the calories in the dataset in the first place, rather than actually finding hidden patterns, making the task unrealistically easy. This is a data leak.

So the most important conclusion is not that the model is perfect, but that feature selection matters just as much as model choice.

Healthier-alternative idea

The notebook also includes an experimental section that tries to propose a healthier alternative for a meal by:

lowering calories,
lowering carbohydrates,
and increasing protein share.

This part should be treated as a prototype rather than a finished feature. The main completed part of the project is the calorie prediction and the analysis of why the best-looking result was misleading.

Project structure

energia_jedlo.ipynb — main notebook with data preparation, dataset generation, model training, evaluation, and visualizations
meals_dataset.csv — generated meal dataset used for training
README.md — project summary

How to run

Open energia_jedlo.ipynb.
Run the notebook cells in order.

Main takeaways

I practiced data preparation and learned how important high-quality data is.
I compared a simple interpretable model with a neural network.
I learned that excellent metrics can be misleading when the feature set leaks target information.
I learned how to design a weighting algorithm and generate healthier alternatives without using machine learning.

Future improvements

If I continue this project, the next steps would be:

remove leakage-prone features,
predict calories from ingredient-level representations instead of direct nutrient totals,
use a larger and more realistic meal dataset,
finish and refine the healthier-alternative generator.

Final note

This project is best understood as a learning-focused ML case study. The strongest part is not the raw score itself, but the fact that the notebook identifies why that score is misleading and what should be improved next.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
images		images
.gitignore		.gitignore
README.md		README.md
energia_jedlo.ipynb		energia_jedlo.ipynb
meals_dataset.csv		meals_dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Calorie Prediction Model

Project goal

What the project does