R – Decision Tree Example
Let us now examine this concept with the help of an example, which in this case is the most widely used “readingSkills” dataset by visualizing a decision tree for it and examining its accuracy.
Installing the required libraries
R
install.packages ( 'datasets' ) install.packages ( 'caTools' ) install.packages ( 'party' ) install.packages ( 'dplyr' ) install.packages ( 'magrittr' ) |
Import required libraries and Load the dataset readingSkills and execute head(readingSkills)
R
library (datasets) library (caTools) library (party) library (dplyr) library (magrittr) data ( "readingSkills" ) head (readingSkills) |
Output:
As you can see clearly there 4 columns nativeSpeaker, age, shoeSize, and score. Thus basically we are going to find out whether a person is a native speaker or not using the other criteria and see the accuracy of the decision tree model developed in doing so.
Splitting dataset into 4:1 ratio for train and test data
R
sample_data = sample.split (readingSkills, SplitRatio = 0.8) train_data <- subset (readingSkills, sample_data == TRUE ) test_data <- subset (readingSkills, sample_data == FALSE ) |
Separating data into training and testing sets is an important part of evaluating data mining models. Hence it is separated into training and testing sets. After a model has been processed by using the training set, you test the model by making predictions against the test set. Because the data in the testing set already contains known values for the attribute that you want to predict, it is easy to determine whether the model’s guesses are correct.
Create the decision tree model using ctree and plot the model
R
model<- ctree (nativeSpeaker ~ ., train_data) plot (model) |
The basic syntax for creating a decision tree in R is:
ctree(formula, data)
where, formula describes the predictor and response variables and data is the data set used. In this case, nativeSpeaker is the response variable and the other predictor variables are represented by, hence when we plot the model we get the following output.
Output:
From the tree, it is clear that those who have a score less than or equal to 31.08 and whose age is less than or equal to 6 are not native speakers and for those whose score is greater than 31.086 under the same criteria, they are found to be native speakers.
Making a prediction
R
# testing the people who are native speakers # and those who are not predict_model<- predict (ctree_, test_data) # creates a table to count how many are classified # as native speakers and how many are not m_at <- table (test_data$nativeSpeaker, predict_model) m_at |
Output
The model has correctly predicted 13 people to be non-native speakers but classified an additional 13 to be non-native, and the model by analogy has misclassified none of the passengers to be native speakers when actually they are not.
Determining the accuracy of the model developed
R
ac_Test < - sum ( diag (table_mat)) / sum (table_mat) print ( paste ( 'Accuracy for test is found to be' , ac_Test)) |
Output:
Here the accuracy-test from the confusion matrix is calculated and is found to be 0.74. Hence this model is found to predict with an accuracy of 74 %.
Decision Tree in R Programming
Decision Trees are useful supervised Machine learning algorithms that have the ability to perform both regression and classification tasks. It is characterized by nodes and branches, where the tests on each attribute are represented at the nodes, the outcome of this procedure is represented at the branches and the class labels are represented at the leaf nodes. Hence it uses a tree-like model based on various decisions that are used to compute their probable outcomes. These types of tree-based algorithms are one of the most widely used algorithms due to the fact that these algorithms are easy to interpret and use. Apart from this, the predictive models developed by this algorithm are found to have good stability and a descent accuracy due to which they are very popular
Contact Us