Skip to content
njen edited this page Feb 1, 2016 · 2 revisions

Welcome to the AppsOfDataAnalysis wiki!

A simple Tutorial of Applications of K Nearest Neighbors with several dataset.

1 / UCI Car Data set :

DATA SET INFORMATION: Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX, M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure: CAR car acceptability . PRICE overall price . . buying buying price . . maint price of the maintenance . TECH technical characteristics . . COMFORT comfort . . . doors number of doors . . . persons capacity in terms of persons to carry . . . lug_boot the size of luggage boot . . safety estimated safety of the car Input attributes are printed in lowercase.

ATTRIBUTE INFORMATION: Class Values: 

unacc, acc, good, vgood 

 Attributes: 

 buying: vhigh, high, med, low. 
 maint: vhigh, high, med, low. 
 doors: 2, 3, 4, 5more. 
 persons: 2, 4, more. 
 lug_boot: small, med, big. 
 safety: low, med, high.

EXPLORATORY ANALYSIS The data is clean and contain no missing error. Distribution of variables is showed as below:

We can see that all features are uniformed distributed except the targeted feature, where ‘unacc’ is dominant. Let’s distill each variable by targeted feature to grab a more detailed view.

It can be seen that in each group, the distilled features mostly follow the exact distribution as targeted feature. However, all cars with 2 seats, or low safety are classified as ‘unacc’. This seems to make sense in reality, and it perhaps will help further feature engineering with the dataset.

FEATURE EXTRACTION

Since all the features are ordinal. I encoded them to incrementally numerical values as below.

Class Values: 

unacc = 0, acc = 1, good = 2, vgood 

= 3 Attributes: 



buying: vhigh = 3, high = 2, med = 1, low= 0. 
 maint: vhigh = 3, high = 2, med = 1, low = 0. 
 doors: 2 = 0, 3 = 1, 4 = 2, 5more = 3. persons: 2 = 0, 4 = 1, more = 2. 
 lug_boot: small = 0, med = 1, big = 2. 
 safety: low = 0, med = 1, high = 2.

TRAIN & EVALUATE The figure below shows the overall misclassification rate among different K values.

It can be observed that k = 7 generally performs best in this interval at 0.05 misclassication rate on the test set. The misclassification rate stabilizes at 30 percent as K increases. Hypothetically, since ‘unacc’ is dominant (70%) among the values of targeted feature, this misclassification might just fall mostly into the other class, which leads to the upper bound of misclassification rate (30% = 100% - 70%). This could be tested by plotting misclassification rate for each class.

PERFORMANCE Loop 200 times for k, 10 times for folds with shuffle, over 1700rows. First implementation using Python native array and iterative matrix operation runs in 4 hours. Second implementation using Linear Algebra Utility of Numpy runs in 3~ minutes.

2 / KDD Cup 99 Cyber Attack Classification dataset

3 / Regression predict Metal Ion in Water dataset

Clone this wiki locally