- This repository was created for the Machine Learning and Statistics module, Higher Diploma in Science in Data Analytics at Atlantic Technological University, September to December 2023.
- It contains tasks labeled from 1 to 5, addressing distinct analyses related to the Square root, chi-squared testing, t-testing, the Iris data set, Principal Component Analysis, and an independent Iris dataset project.
- Python 3.11.4 or later
- Anaconda
- Visual Studio Code
- Jupyter Notebooks (detailed below)
- Libraries: pandas, matplotlib, scikit-learn (imports detailed in notebook)
-
Clone the Repository:
git clone https://github.com/nexlanglxm/machine-learning-and-statistics.git
-
Open the folder with:
- Jupyter Notebook
- Visual Studio Code (with Jupyter support installed)
-
Navigate to the Specific Task:
- For Tasks:
Access the
tasks.ipynbfile. - For Iris Dataset Project: Open and explore the
project.ipynbfile.
- For Tasks:
Access the
Square roots are difficult to calculate. In Python, you typically use the power operator (a double asterisk) or a package such as math. In this task, you should write a function sqrt(x) to approximate the square root of a floating point number x without using the power operator or a package.
Rather, you should use the Newton’s method. Start with an initial guess for the square root called
Consider the below contingency table (see task) based on a survey asking respondents whether they prefer coffee or tea and whether they prefer plain or chocolate biscuits. Use scipy.stats to perform a chi-squared test to see whether there is any evidence of an association between drink preference and biscuit preference in this instance.
Perform a t-test on the famous penguins data set to investigate whether there is evidence of a significant difference in body mass of male and female gentoo penguins.
Using the famous iris data set, suggest whether the setosa class is easily separable from the other two classes. Provide evidence for your answer.
Perform Principal Component Analysis on the iris data set, reducing the number of dimensions to two. Explain the purpose of the analysis and your results.
This project aims to explore classification algorithms using the famous Iris flower dataset associated with Ronald A Fisher. The focus is on demonstrating an understanding of supervised learning, classification algorithms, and implementing at least one common classification algorithm using the scikit-learn Python library.
-
Introduction to Supervised Learning:
- Definition and principles of supervised learning.
- Importance of labeled data for training.
-
Understanding Supervised Learning:
- Explanation of classification algorithms in supervised learning.
- Discussion on the significance of classifying data.
-
Exploring the Iris Dataset:
- Data visualization, statistical analysis, and characteristics of the Iris dataset.
-
The chosen Learning Algorithms:
- Explanation of k-NN and a brief section on Decision Trees and their roles in classification tasks.
- The principles explored.
-
Implementation of chosen algorithms using scikit-learn:
- Code walkthrough demonstrating the implementation of a k-NN classifier, and DT on the Iris dataset.
- Evaluation metrics and visualizations to assess the classifier's performance.
References are hyperlinked throughout the Jupyter notebook