Classification, Decision Tree Classifier, Accuracy, K-Folding, Overfitting, Random Forest Classifier

In this tutorial we will learn about the classification of data, how to use a Decision Tree in order to predict a new instance's class membership, how accuracy can be measured and what K-Folding is and how it works. We will learn this by implementing all of these concepts in Python.

There is a lot of work ahead of us - so let's get started!

Read CSV

What do we need to analyse data? Right - we need data! Here is where I got the data for this tutorial from: https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt It is a txt-file but by changing the txt suffix to csv we are able to use the Pandas library's read_csv()-function. Smart! The data is about the authentication of banknotes. If you like you can read more about the data here https://archive.ics.uci.edu/ml/datasets/banknote+authentication# or you can just go on with this tutorial because it provides all necessary information. Hurray!

Attribute Information:

  1. variance of Wavelet Transformed image (continuous)
  2. skewness of Wavelet Transformed image (continuous)
  3. curtosis of Wavelet Transformed image (continuous)
  4. entropy of image (continuous)
  5. class (integer)

In the following code our first step is to load the Pandas library because it is great for building DataFrames and other cool stuff.

Afterwards we use the attribute information and make a list out of them since the csv-file does not contain column names. Then, finally, we read the data using our attribute names as column names. As mentioned before, we will learn about the classification of data in this tutorial:
In which classes could we divide our data if we want to figure out whether a banknote is a fake or real banknote? Right! Into the classes "real" and "fake"! In our data these classes are represented by "0" for real and "1" for fake. In other cases it can be useful to divide data into more classes but with our data the two classes are enough. Since the data is ordered by 0 and 1, we shuffle the data which means that we bring it into a random order.

In [1]:
#Import Pandas library
import pandas as pd

#List with attribute names (it is optional to do this but it gives a better understanding of the data for a human reader)
attribute_names = ['variance_wavelet_transformed_image', 'skewness_wavelet_transformed_image', 'curtosis_wavelet_transformed_image', 'entropy_image', 'class']

#Read csv-file
data = pd.read_csv('data_banknote_authentication.csv', names=attribute_names)

#Shuffle data
data = data.sample(frac=1)

#Shows the first 5 rows of the data
data.head()
Out[1]:
variance_wavelet_transformed_image skewness_wavelet_transformed_image curtosis_wavelet_transformed_image entropy_image class
1149 0.33325 3.3108 -4.5081 -4.01200 1
351 0.51950 -3.2633 3.0895 -0.98490 0
764 -1.66770 -7.1535 7.8929 0.96765 1
212 2.61400 8.0081 -3.7258 -1.30690 0
306 0.27451 9.2186 -3.2863 -4.84480 0

Understanding the Base Rate and Accuracy of a model

Great! Now we have data that we can use to build a model that helps us to predict whether a new bill that is given to us is fake or real by analysing its features' values.

Before we create the model, let's first have a look at the percentages of our data being real and fake bills. Why this extra work? Let's think about an example! Imagine you have data where 90% of the values belong to class 1 and 10% to class 0. The model we create predicts the right class membership in 80% of the cases. 80% does not sound too bad. Whereas, considering the class distribution of our data, using the model would actually be worse than using no model at all. If we did not use our model and just assume that every new instance is member of class 1, we would make accurate predictions in 90% of the cases. While our model only reaches an accuracy of 80%. Therefore, those 90% are our Base Rate which we should try to exceed with our model.

So, let's do the calculations of the percentages and then hope that our model achieves a better performance.

In [2]:
#Get the absolute number of how many instances in our data belong to class zero
count_real = len(data.loc[data['class']==0])
print('Real bills absolute: ' + str(count_real))

#Get the absolute number of how many instances in our data belong to class one
count_fake = len(data.loc[data['class']==1])
print('Fake bills abolute: ' +str(count_fake))

#Get the relative number of how many instances in our data belong to class zero
percentage_real = count_real/(count_fake+count_real)
print('Real bills in percent: ' + str(round(percentage_real,3)))

#Get the relative number of how many instances in our data belong to class one
percentage_fake = count_fake/(count_real+count_fake)
print('Fake bills in percent: ' + str(round(percentage_fake,3)))
Real bills absolute: 762
Fake bills abolute: 610
Real bills in percent: 0.555
Fake bills in percent: 0.445

Splitting data into Training and Test Data

First, we have to define what our prediction goal is. As we found out before our aim is to predict whether a bill is real (0) or fake (1). Therefore, we split our data into two parts: The column with the class memberships is our y-variable (the target variable) and the rest of our data are our x-variables (the attributes/features).

In [3]:
#'class'-column
y_variable = data['class']

#all columns that are not the 'class'-column -> all columns that contain the attributes
x_variables = data.loc[:, data.columns != 'class']

If we tested our data with the same data which we use for building our model, the test results would probably be far better than our model actually is. The reason is that our model is specialized for this data and new instances with new data will not fit our model as good as our previous data.

To avoid wrong estimations of our model's performance we split our data into test and training data in order to get a more accurate result on how good our model actually performs.

In order to split our data with minimum effort into test and training data, we use a function provided by Scikit-learn. Scikit-learn is a free software machine learning library for the Python programming language and we will use it a lot of times in this and the upcoming tutorials.
To verify whether this split was done correctly, we then have a look at the shape of the train and test data sets and can see that it worked perfectly.

In [4]:
#import method from sklearn to split our data into training and test data
from sklearn.model_selection import train_test_split
import numpy as np

#splits into training and test data
x_train, x_test, y_train, y_test = train_test_split(x_variables, y_variable, test_size=0.2)

# shapes of our data splits
print(x_train.shape) 
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(1097, 4)
(275, 4)
(1097,)
(275,)

Decision Tree Classifier

Now that we have our training and test data we can finally create our model. Well, luckily there are libraries that we can use that will do most of the work for us. Nevertheless, it is always good to have a basic understanding of what the algorithm actually does in order to evaluate its outcomes.

Today we will use a Decision Tree Classifier in order to make our predictions. While using this model the data is divided into different subdata sets, each of them represented by a node or a leaf of the tree (we will see how such a tree can look like later). The instances are divided by their feature values starting with the feature that has the strongest impact on the instance's class membership. Thus, if the data is divided by that feature value the possibility of a wrong classfication is smaller compared to all the other possible splits. Each node is then divided into smaller subdata sets until every instance is correctly classified (we will learn why it can be useful to stop the tree from growing too many nodes later). Sklearn uses as a default value the Gini Impurity as the criterion to split the data. Wikipedia explains Gini Impurity as follows: "Gini Impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset." Therefore, the smaller the Gini Impurity the better. This calculation is done for each variable and the smallest one builds the root. The same is done for every node that follows.

Several other criterions exist and can be used instead of Gini Impurity, e.g. Entropy. In order to use an other criterion it has to be specified as a parameter while instantiating the DecisionTreeClassifier object, e.g. criterion = "entropy". In this case the algorithm finds the lowest Entropies and highest Information Gains to make the splits.

In summary: Usually both measures end up in very similiar results when being used as a criterion in a Decision Tree. While using Entropy, the attributes are interpreted as categorical and while using Gini Impurity as continuous. Entropy can cause slightly higher computational costs since it contains logarithmic caluclations.

We will use the default, the Gini Impurity, and - out of curiosity - have a look at each feature's importance calculated by our model.

In [5]:
#import DecisionTreeClassifier from the Sklearn library
from sklearn.tree import DecisionTreeClassifier

#Create a classifier object 
classifier = DecisionTreeClassifier() 

#Classfier builds Decision Tree with training data
classifier = classifier.fit(x_train, y_train) 

#Shows importances of the attributes according to our model 
classifier.feature_importances_
Out[5]:
array([0.59718715, 0.23649959, 0.12485108, 0.04146218])

Alright, we built a Decision Tree! Nevertheless, so far we have no idea how it looks like. Therefore, let's create an image of our tree. After running the code the image can be found within the folder of this document.

In [6]:
from sklearn.tree import export_graphviz

#Export as dot file
export_graphviz(classifier, out_file='tree_bills.dot', class_names = ['0','1'], feature_names = attribute_names[0:4])

#Export dot to png 
from subprocess import check_call
check_call(['dot','-Tpng','tree_bills.dot','-o','tree_bills.png'])
Out[6]:
0

Testing

After we had a look at this beautiful Decision Tree, let's see if it actually does a good job. Let's give our model our test data to make predictions whether the instances of the test data are fake or real bills. Therefore, we let our classifier predict the y-values with our x_test data set.

In [7]:
#Get predicted values from test data 
y_pred = classifier.predict(x_test) 

In order to find out how good the performance of our Decision Tree was, we compare our tree's predictions with the actual y-values of our test data.

Wow! Look at this accruacy! Remember that our Base Rate was only 55.53%. What an improvement!

In [8]:
from sklearn.metrics import classification_report, confusion_matrix  

#Create the matrix that shows how often predicitons were done correctly and how often theey failed.
conf_mat = confusion_matrix(y_test, y_pred)

#The diagonal ones are the correctly predicted instances. The sum of this number devided by the number of all instances gives us the accuracy in percent.
accuracy = (conf_mat[0,0] + conf_mat[1,1]) /(conf_mat[0,0]+conf_mat[0,1]+ conf_mat[1,0]+conf_mat[1,1])

print('Accuracy: ' + str(round(accuracy,4)))
print('Confusion matrix:')
print(conf_mat)  
print('classification report:')
print(classification_report(y_test, y_pred)) 
Accuracy: 0.9855
Confusion matrix:
[[135   3]
 [  1 136]]
classification report:
             precision    recall  f1-score   support

          0       0.99      0.98      0.99       138
          1       0.98      0.99      0.99       137

avg / total       0.99      0.99      0.99       275

K-Folding

Ok, so we reached an amazing score - but what happens if we were just extremely lucky regarding our test data. Maybe our test data was by chance extremely similar to our training data. To avoid such an unlucky luck, we use K-Folding to get a more accurate score about how well our model works. What is the idea behind K-Folding? First, the data - as before - is split into different parts. In our case we split it into five different parts by giving the "n_splits=5" parameter to the KFold constructor. We could then lean back and enjoy that all of work is done for us but let's instead have a quick overview of what exactly is done for us: Those five different splits are used to create different combinations of test and training data sets. For example the first one could look like this: training data splits -> 2,3,4,5, test data split -> 1. The second one could look like this: training data splits -> 1,2,3,4 test data split -> 5 and so on.
The function cross_val_score provides us with a list of the accuracies reached with those different sets of training and test data. In order to get a more accurate score of how well our model works, we then take the arithmetic mean of the accuracies. Looking at the new accuracy we see that it does not differ that much from the first accuracy that we had reached with our single training and test data set. I guess our tree must simply be awesome or making predictions out of our data is simply not so difficult. Or maybe both.

In [9]:
from sklearn.model_selection import KFold, cross_val_score

#k_fold object
k_fold = KFold(n_splits=5, shuffle=True, random_state=0)

#scores reached with different splits of training/test data 
k_fold_scores = cross_val_score(classifier, x_variables, y_variable, cv=k_fold, n_jobs=1)

#arithmetic mean of accuracy scores 
mean_accuracy = np.mean(k_fold_scores)

print(round(mean_accuracy, 4))
0.9818

While our tree makes fantastic predictions, in other cases these predictions are usually not instantly this good. This is often related to Overfitting. Overfitting means that the model fits the training data perfectly but usually does not achieve good results prediciting new instance's class memberships. The reason is the lack of generalization. So far, we just let the tree grow until every istance of the test dataset was correctly classified. We did not give our tree any parameters about the maximal depth until which we allow it to grow nor the minimal number of samples for each node nor did we test the model with different criterions. Luckily, there is also a function by Sklearn that will do this for us:
First, we define the parameters and their intervals which we want to be tested. The function will then test all possible models with these given parameters. Depending on the amount of data this can take a while and cause high computational costs. Luckily our data is not that big so we do not have to be extremely patient for the calculations to be done. After everything was calculated and compared we have a look at the parameters which reached the highest score. We use these parameters for building a new tree. We test it using K-Folding and then have a look at the arithmetic mean of the accuracies of our new tree. Due to the fantastic accuracy that we had already reached with our initial tree, there is only a tiny performance difference. Nevertheless, we can observe a slight improvement.

In [10]:
from sklearn.model_selection import GridSearchCV

#tree parameters which shall be tested
tree_para = {'criterion':['gini','entropy'],'max_depth':[i for i in range(1,20)], 'min_samples_split':[i for i in range (2,20)]}

#GridSearchCV object
grd_clf = GridSearchCV(classifier, tree_para, cv=5)

#creates differnt trees with all the differnet parameters out of our data
grd_clf.fit(x_variables, y_variable)

#best paramters that were found
best_parameters = grd_clf.best_params_  
print(best_parameters)  

#new tree object with best parameters
model_with_best_tree_parameters = grd_clf.best_estimator_

#k_fold object
k_fold = KFold(n_splits=5, shuffle=True, random_state=0)

#scores reached with different splits of training/test data 
k_fold_scores = cross_val_score(model_with_best_tree_parameters, x_variables, y_variable, cv=k_fold, n_jobs=1)

#arithmetic mean of accuracy scores 
mean_accuracy_best_parameters_tree = np.mean(k_fold_scores)

print(round(mean_accuracy_best_parameters_tree, 4))
{'criterion': 'entropy', 'max_depth': 10, 'min_samples_split': 5}
0.9854

Random Forest

Another possibility to not depend too much on the result of one tree is to use a Random Forest. There is also a library from Sklearn that does most of the work for us. It creates different trees out of different training sets and then makes predictions out of a majority vote; in our case this means that if the majority of trees predicts that a new bill is real, it is considered as real, as well as the other way around.

In [12]:
from sklearn.ensemble import RandomForestClassifier

#RandomForestClassifier object
random_forest_classifier = RandomForestClassifier(n_estimators=10)

#list with accuracies with different test and training sets of Random Forest
accuracies_rand_forest = cross_val_score(random_forest_classifier, x_variables, y_variable, cv=k_fold, n_jobs=1)

##arithmetic mean of the list with the accuracies of the Random Forest
accuracy_rand = np.mean(accuracies_rand_forest)

print('Accuracy Random Forest ' + str(round(accuracy_rand,4)))
print('Old accuracy: ' + str(round(mean_accuracy,4)))
print('Best tree accuracy: ' + str(round(mean_accuracy_best_parameters_tree,4)))
Accuracy Random Forest 0.992
Old accuracy: 0.9818
Best tree accuracy: 0.9854

I hope you liked this tutorial - soon there will be more. Hurray! One last thing: Remember, each time we run this programm the outcoming results will slightly differ due to new training and test data sets since they are randomly shuffled at the beginning of this notebook. Therefore, don't be confused about this. Now, have fun building your own Decision Trees!