Decision tree

date

Aug 17, 2023

slug

ML_1

author

status

Public

CART Algorithm

CART (Classification And Regression Tree) is a variation of the decision tree algorithm that can handle both classification and regression tasks. It was first developed in 1984 [1]. Here's how the CART algorithm works:

CART Tree Building

Data Splitting: CART starts by taking the entire dataset and selecting a single feature and a threshold value to split the data into two subsets. The goal is to find the feature and threshold that best separates the data points based on the target variable.

Recursive Process: CART recursively searches for the "best" feature and threshold to split on, using Gini impurity (for classification) or mean squared error (for regression) to determine it. The goal is to minimize this impurity or error in each resulting subset. The stopping criterion could be a maximum depth of the tree, a minimum number of data points in a node, or the impurity/error falling below a certain threshold.

Exercise

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# load data set
x , y = load_iris(return_X_y = True)

# spilt val data
X_train ,X_test ,y_train, y_test = train_test_split( x , y,test_size=0.3)

# build a tree
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train , y_train)

# check the score
print( 'train data acc : ' , accuracy_score(y_train , clf.predict(X_train) ) ) 
print( 'val data acc : ' , accuracy_score(y_test , clf.predict(X_test) ) )

train data acc :  1.0
val data acc :  0.9555555555555556

# visualize the rules
import matplotlib.pyplot as plt
plt.figure(figsize=(60,60))  # set plot size (denoted in inches)
tree.plot_tree(clf, fontsize=60)
plt.show()

Conclusion

Decision tree is advantageous for their interpretability, making them suitable when transparency is essential. It is also robust to outliers since it’s nonlinearity. Moreover, decision tree is scale irrelevant [3], which make applying it more convenient. However, it struggle in overfitting and the time complexity since the greedy search process.

Pros:

Interpretability: Easy to understand and explain, suitable for transparent models.

Robust to Outliers: Less influenced by outliers due to binary splitting.

Cons:

Overfitting: Prone to overfitting, especially with deep trees.

Exponential Growth: Complexity can increase rapidly with data and features.

Reference

[1] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.

https://scikit-learn.org/stable/modules/tree.html

https://forecastegy.com/posts/do-decision-trees-need-feature-scaling-or-normalization/