University: University of Piraeus Department: Informatics Academic Year: 2022–2023 Semester: 5th
Course: Pattern Recognition Assignment: Computational Course Assignment
This project was developed in Python using Visual Studio Code and several key libraries that greatly supported our work:
- pandas
- matplotlib
- scikit-learn
- keras
- seaborn
- numpy
These libraries were installed through the terminal using standard pip commands (see bibliography in the report).
-
Data Loading: We begin by loading the dataset and verifying its structure using the
.head()function. -
Feature Separation: The target feature is
median_house_value.X: all features exceptmedian_house_valuez: onlymedian_house_value
-
Numerical vs. Categorical Features: Using
.info(), we identify data types:ocean_proximity: categorical- All others: numerical
-
Scaling Data: We apply Min-Max Scaling to bring numerical features into the 0–1 range:
scaler = MinMaxScaler() X_scaled = pd.DataFrame(scaler.fit_transform(X[numerical]), columns=numerical) X_temp = X.drop(numerical, axis=1) X = pd.concat([X_temp, X_scaled], axis=1)
Similar scaling is applied to
z. -
One-Hot Encoding: We one-hot encode the categorical feature:
encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) X_enc = pd.DataFrame(encoder.fit_transform(X[categorical]), columns=oc_prox) X_temp = X.drop(categorical, axis=1) X = pd.concat([X_temp, X_enc], axis=1)
-
Handling Missing Values: Using
SimpleImputer(strategy='median'), we fill missing values (e.g., intotal_bedrooms) with the median.
-
Histograms: With Seaborn, we visualize distributions:
sns.histplot(data[column], bins=50, kde=True, lw=2)
-
Scatter Plots: Using Pandas and Seaborn:
dataset.plot(kind='scatter', x='longitude', y='median_house_value') sns.scatterplot(x=data['median_income'], y=data['median_house_value'], hue=data['NEAR OCEAN'])
These plots reveal correlations (e.g., between median_income and median_house_value) and geographic patterns.
We implemented two core functions:
def least_squares_train(X, y):
mul1 = X.T.dot(X)
inv1 = np.linalg.pinv(mul1)
mul2 = X.T.dot(y)
weight = np.matmul(inv1, mul2)
return weight
def least_squares_predict(X, w):
return np.matmul(X, w)-
Evaluation: We applied 10-fold cross-validation using scikit-learn’s
KFold, calculating:- Mean Squared Error (MSE)
- Mean Absolute Error (MAE)
We used Keras Sequential to build a neural network with:
- Four dense layers
reluandsoftmaxactivations- Optimizer:
adam - Loss:
mean_squared_error - Metrics:
mae
We applied K-Fold Cross-Validation for performance evaluation.
main.py: Main script for data processing and model executionreport.pdf: Detailed report with screenshots, visualizations, and analysisrequirements.txt: List of required Python libraries
-
Install dependencies:
pip install -r requirements.txt
-
Run the main script:
python main.py
If you want, I can also generate the full README.md file in markdown format ready for copy-paste. Would you like me to prepare that for you?