Comparing Tree-Based Classification Methods via the Kaggle Otto Competition

In this post, I’m going to be looking at the progressive performance of different tree-based classification methods in R, using the Kaggle Otto Group Product Classification Challenge as an example. This competition challenges participants to correctly classify products into 1 of 9 classes based on data in 93 features. I’ll start with basic decision trees and move into ensemble methods – bagging, random forests, boosting.

Basic Decision Tree – tree package

Let’s start by looking at one of the most basic decision tree learners in R, the tree function in the “tree” package. This is a very simple model to implement and run.

# Clear environment workspace
rm(list=ls())
# Load data
train <- read.csv("/Users/vabraham24/Documents/RStudio/kaggle_otto/data/train.csv")
test <- read.csv("/Users/vabraham24/Documents/RStudio/kaggle_otto/data/test.csv")
samplesub <- read.csv("/Users/vabraham24/Documents/RStudio/kaggle_otto/data/sampleSubmission.csv")
# Remove id column so it doesn't get picked up by the current classifier
train <- train[,-1]
summary(train)
summary(test)
# Create sample train and test datasets for prototyping new models from the train dataset
strain <- train[sample(nrow(train), 6000, replace=FALSE),]
stest <- train[sample(nrow(train), 2000, replace=FALSE),]

# Basic Decision Tree
# Install tree package
install.packages('tree')
library(tree)
# Set a unique seed number so you get the same results everytime you run the below model,
# the number does not matter
set.seed(12)
# Create a decision tree model using the target field as the response and all 93 features as inputs (.)
fit1 <- tree(as.factor(target) ~ ., data=strain)
# Plot the decision tree
plot(fit1)
title(main="tree")
text(fit1)
# Test the tree model on the holdout test dataset
fit1.pred <- predict(fit1, stest, type="class")
# Create a confusion matrix of predictions vs actuals
table(fit1.pred,stest$target)
# Determine the error rate for the model
fit1$error <- 1-(sum(fit1.pred==stest$target)/length(stest$target))
fit1$error

I was able to achieve an error rate of 0.402 (incorrect classifications) with this model, which seems alright at first glance. But, you’ll immediately notice the deficiencies of this model when you look at the actual tree. The tree does not have any terminal-nodes of Class_1, Class_3, Class_4, or Class_7. That’s going to be a major problem when doing any further prediction. Not being able to predict nearly half of the possible classes is a sign of under-fitting and a poor model. I think we can safely say this is not something worth using for new predictions.

Basic Decision Tree – rpart package

Let’s try another recursive partitioning tree, the rpart function from the “rpart” package. Now I don’t expect this model to work that much differently than the tree model above, but a lot of modeling is about trying different things and seeing what works.

# Install rpart package
install.packages('rpart')
library(rpart)
# Set a unique seed number so you get the same results everytime you run the below model,
# the number does not matter
set.seed(13)
# Create a decision tree model using the target field as the response and all 93 features as inputs (.)
fit2 <- rpart(as.factor(target) ~ ., data=train, method="class")
# Plot the decision tree
par(xpd=TRUE)
plot(fit2, compress=TRUE)
title(main="rpart")
text(fit2)
# Test the rpart (tree) model on the holdout test dataset
fit2.pred <- predict(fit2, stest, type="class")
# Create a confusion matrix of predictions vs actuals
table(fit2.pred,stest$target)
# Determine the error rate for the model
fit2$error <- 1-(sum(fit2.pred==stest$target)/length(stest$target))
fit2$error

This model achieved an error rate of .374, a slight improvement from the tree model. Looking at the plot of the rpart decision tree, it’s also slightly different in how it partitions from the tree decision tree but the terminal-nodes are still the same. This model also does not have any nodes that end in Class_1, Class_3, Class_4, or Class_7. From the last two models I can see that recursive partitioning is not ideal when working with lots of features and classification categories. Once you make a decision split, you only consider features that are in alignment with that split, this means that you can’t find signal from features that are upstream of that split. This leads to very simple models, which discard lots of useful signal. Anyway, the point of this exercise was to compare the performance of different decision tree models. Basic decision trees have their place in that they are easy to understand, but as seen in their performance above, they can under-fit, lose good signal easily, and have low practical performance.

Bagging – adabag package

Now that we understand the deficiencies plaguing the basic decision tree models, we can start to look at more complex tree-based models that aim to correct those deficiencies by growing multiple trees and aggregating them together to make better predictions. Bagging, or bootstrap aggregating, reduces the bias found in a simple decision tree model by making multiple predictions for each observation and selecting the most commonly occurring response (or class in our case). Theoretically this should reduce the under-fitting found in a basic decision tree model.

# Install adabag package
install.packages('adabag')
library(adabag)
# Set a unique seed number so you get the same results everytime you run the below model,
# the number does not matter
set.seed(14)
# Begin recording the time it takes to create the model
ptm3 <- proc.time()
# Create a bagging model using the target field as the response and all 93 features as inputs (.)
fit3 <- bagging(target ~ ., data=strain, mfinal=50)
# Finish timing the model
fit3$time <- proc.time() - ptm3
# Test the baggind model on the holdout test dataset
fit3.pred <- predict(fit3, stest, newmfinal=50)
# Create a confusion matrix of predictions vs actuals
table(as.factor(fit3.pred$class),stest$target)
# Determine the error rate for the model
fit3$error

Confusion Matrix – bagging

	Class_1	Class_2	Class_3	Class_4	Class_5	Class_6	Class_7	Class_8	Class_9
Class_2	38	487	248	77	3	45	60	61	70
Class_5	0	5	5	0	73	0	1	1	0
Class_6	4	4	0	2	0	388	9	19	9
Class_8	8	18	24	5	0	19	15	181	22
Class_9	9	3	0	0	0	9	4	10	64

_{*Columns – actual classes, rows – predicted classes}

Wow, this model has an error rate of 0.741, even worse than the standard decision trees. This model did not classify any of the observations as Class_1, Class_3, Class_4, or Class_7; it also heavily leans towards classifying observations as Class_2. My intuitive guess as to why this method did not perform that well is that because it averages predictions from several trees and Class_2 is the most commonly occurring class, it became highly biased and classified many observations as Class_2.

Random Forest – randomForest package

The random forest model should be an improvement over the bagging model. Random forests also use bootstrap aggregation to make multiple predictions for each observation. The difference when compared to bagging is that at each branch split, a specific random sample of all the features is taken. Out of those features, the strongest one is chosen to perform the next split. When I reviewed the basic decision tree models above (tree and rpart), I stated that one of their weaknesses was that they could lose signal at each split because you can’t go back to using features that aren’t downstream of the current split. With random forests, that deficiency is solved for by randomly sampling out of all features at every split.

# Install randomForest package
install.packages('randomForest')
library(randomForest)
# Set a unique seed number so you get the same results everytime you run the below model,
# the number does not matter
set.seed(16)
# Use the tuneRF function to determine an ideal value for the mtry parameter
mtry <- tuneRF(strain[,1:93], strain[,94], mtryStart=1, ntreeTry=50, stepFactor=2, improve=0.05,
 trace=TRUE, plot=TRUE, doBest=FALSE)
# The ideal mtry value was found to be 8
# Begin recording the time it takes to create the model
ptm4 <- proc.time()
# Create a random forest model using the target field as the response and all 93 features as inputs (.)
fit4 <- randomForest(as.factor(target) ~ ., data=strain, importance=TRUE, ntree=100, mtry=8)
# Finish timing the model
fit4.time <- proc.time() - ptm4
# Create a dotchart of variable/feature importance as measured by a Random Forest
varImpPlot(fit4)
# Test the randomForest model on the holdout test dataset
fit4.pred <- predict(fit4, stest, type="response")
# Create a confusion matrix of predictions vs actuals
table(fit4.pred,stest$target)
# Determine the error rate for the model
fit4$error <- 1-(sum(fit4.pred==stest$target)/length(stest$target))
fit4$error

Confusion Matrix – random forest

	Class_1	Class_2	Class_3	Class_4	Class_5	Class_6	Class_7	Class_8	Class_9
Class_1	15	1	0	0	0	0	1	0	1
Class_2	6	454	154	46	1	8	24	7	8
Class_3	0	52	118	14	0	0	4	0	0
Class_4	0	1	2	20	0	0	1	0	0
Class_5	0	2	0	0	75	0	1	1	0
Class_6	8	2	0	3	0	430	11	16	6
Class_7	1	1	1	1	0	4	32	0	1
Class_8	14	3	2	0	0	12	12	244	13
Class_9	15	1	0	0	0	7	3	4	136

*Columns – actual classes, rows – predicted classes

The performance of the random forest model turned out to be pretty good. Its error rate is at 0.238, which is the lowest we have scored so far. When looking at the confusion matrix, this model has made predictions for every class, which means that we can actually use this on a larger test dataset to make useful predictions.

Boosting – gbm package

Boosting is the last tree aggregation method I’ll review. It works by creating an initial classification tree and upon iteration, weighting the mis-classified observations as more significant before creating subsequent trees. The goal of this process is to reduce error on the poorly classified observations. There are many papers and websites that can explain this much better than me so I won’t go into further detail. Let’s run the model and look at the results.

# Install gbm package
install.packages('gbm')
library(gbm)
# Set a unique seed number so you get the same results everytime you run the below model,
# the number does not matter
set.seed(17)
# Begin recording the time it takes to create the model
ptm5 <- proc.time()
# Create a random forest model using the target field as the response and all 93 features as inputs (.)
fit5 <- gbm(target ~ ., data=strain, distribution="multinomial", n.trees=1000,
 shrinkage=0.05, interaction.depth=12, cv.folds=2)
# Finish timing the model
fit5.time <- proc.time() - ptm5
# Test the boosting model on the holdout test dataset
trees <- gbm.perf(fit5)
fit5.stest <- predict(fit5, stest, n.trees=trees, type="response")
fit5.stest <- as.data.frame(fit5.stest)
names(fit5.stest) <- c("Class_1","Class_2","Class_3","Class_4","Class_5","Class_6","Class_7","Class_8","Class_9")
fit5.stest.pred <- rep(NA,2000)
for (i in 1:nrow(stest)) {
 fit5.stest.pred[i] <- colnames(fit5.stest)[(which.max(fit5.stest[i,]))]}
fit5.pred <- as.factor(fit5.stest.pred)
# Create a confusion matrix of predictions vs actuals
table(fit5.pred,stest$target)
# Determine the error rate for the model
fit5$error <- 1-(sum(fit5.pred==stest$target)/length(stest$target))
fit5$error

Confusion Matrix – boosting

	Class_1	Class_2	Class_3	Class_4	Class_5	Class_6	Class_7	Class_8	Class_9
Class_1	25	0	0	0	0	0	0	4	4
Class_2	2	483	119	34	5	8	13	4	5
Class_3	2	59	113	10	0	0	10	2	0
Class_4	0	9	5	42	0	0	0	0	0
Class_5	1	0	0	1	86	0	0	0	1
Class_6	8	0	0	1	0	411	9	5	1
Class_7	2	4	8	0	0	3	54	1	0
Class_8	15	2	0	0	0	8	3	245	10
Class_9	4	2	0	1	0	7	2	6	141

_{*Columns – actual classes, rows – predicted classes}

The error rate from this model is 0.20, better than all the other models we’ve tried so far, impressive. When I first used this model I didn’t have very good results because I didn’t quite understand the parameter settings. After testing various models with alternate parameter values (especially interaction.depth), I was able to find some useful parameter values. You can see some examples of parameter testing is this paper to get an idea of what to do.

Conclusion

I expected that the basic tree models probably wouldn’t perform that well. I had no idea that bagging ensemble model would also perform so poorly. I’m guessing that with some tweaking of the parameters the horrendous error rate that I achieved could be reduced. The clear winner from the data above is the boosting model.

There are many factors, such as data size and type, which have lots of bearing on model performance. It’s not always the case that boosting performs so well relative to other ensemble decision tree classifiers. I also only tested the models on a small subset of the training data, they could act completely different once more data is added in. The packages I used are some of the most popular for bagging, random forests, and boosting, but there are several others out there that are probably some variation or combination of what we used above and will attain slightly different results. Special thanks to WillemM who provided some insight on the gbm parameters, I was able to make some simple tweaks that really improved the performance afterwards. If you’re interested in this competition, there’s still 3 weeks left and I would definitely encourage you to get involved. All of the above code is combined into a single script in my github for reference. Please note that the code in this post is not in the competition submission format. There’s a separate script for the random forest model in my github which produces an output that can be submitted.

** Update as of 4/29/13

I’ve made some changes to the post to help clean things up a bit.

Added more comments to the code snippets.
Clarified the axis labels for the confusion matrices.
Modified the boosting model based upon comments provided to the original post.
Made general post edits corresponding to the changes made.

Comparing Tree-Based Classification Methods via the Kaggle Otto Competition

Basic Decision Tree – tree package

Basic Decision Tree – rpart package

Bagging – adabag package

Confusion Matrix – bagging

Random Forest – randomForest package

Confusion Matrix – random forest

Boosting – gbm package

Confusion Matrix – boosting

Conclusion

** Update as of 4/29/13

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List