Decision tree

date
Aug 17, 2023
slug
ML_1
author
status
Public
tags
ML
summary
type
Post
thumbnail
https://s3-us-west-2.amazonaws.com/secure.notion-static.com/f7438445-af8f-44d9-aa83-f162acd3ff41/Untitled.png
updatedAt
Aug 20, 2023 09:56 PM
A decision tree is an interpretable machine learning algorithm that breaks down a complex decision-making process into a series of simpler decisions, leading to a final outcome. In this article, we explain how the CART algorithm builds a decision tree and provide a quick example using Sklearn to build a tree. We also discuss the pros and cons of decision trees at the end of the article.

CART Algorithm

CART (Classification And Regression Tree) is a variation of the decision tree algorithm that can handle both classification and regression tasks. It was first developed in 1984 [1]. Here's how the CART algorithm works:

CART Tree Building

  1. Data Splitting: CART starts by taking the entire dataset and selecting a single feature and a threshold value to split the data into two subsets. The goal is to find the feature and threshold that best separates the data points based on the target variable.
  1. Recursive Process: CART recursively searches for the "best" feature and threshold to split on, using Gini impurity (for classification) or mean squared error (for regression) to determine it. The goal is to minimize this impurity or error in each resulting subset. The stopping criterion could be a maximum depth of the tree, a minimum number of data points in a node, or the impurity/error falling below a certain threshold.

Exercise

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# load data set
x , y = load_iris(return_X_y = True)

# spilt val data
X_train ,X_test ,y_train, y_test = train_test_split( x , y,test_size=0.3)

# build a tree
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train , y_train)

# check the score
print( 'train data acc : ' , accuracy_score(y_train , clf.predict(X_train) ) ) 
print( 'val data acc : ' , accuracy_score(y_test , clf.predict(X_test) ) )
train data acc :  1.0
val data acc :  0.9555555555555556
# visualize the rules
import matplotlib.pyplot as plt
plt.figure(figsize=(60,60))  # set plot size (denoted in inches)
tree.plot_tree(clf, fontsize=60)
plt.show()
notion image

Conclusion

Decision tree is advantageous for their interpretability, making them suitable when transparency is essential. It is also robust to outliers since it’s nonlinearity. Moreover, decision tree is scale irrelevant [3], which make applying it more convenient. However, it struggle in overfitting and the time complexity since the greedy search process.

Pros:

  • Interpretability: Easy to understand and explain, suitable for transparent models.
  • Robust to Outliers: Less influenced by outliers due to binary splitting.

Cons:

  • Overfitting: Prone to overfitting, especially with deep trees.
  • Exponential Growth: Complexity can increase rapidly with data and features.

Reference

  • [1] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.