8 Decision Trees

Author

phonchi

Published

November 14, 2022

8.1 Setup

We first use classification trees to analyze the Carseats data set. In these data, Sales is a continuous variable, and so we begin by recoding it as a binary variable.

!pip install git+https://github.com/JakeColtman/bartpy.git -qq
!pip install xgboost -U -qq
!pip install lightgbm -U -qq
!pip install catboost -U -qq

  Building wheel for bartpy (setup.py) ... done
  Building wheel for sklearn (setup.py) ... done
     |████████████████████████████████| 255.9 MB 37 kB/s 
     |████████████████████████████████| 2.0 MB 5.3 MB/s 
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp37-none-manylinux1_x86_64.whl (76.6 MB)
     |████████████████████████████████| 76.6 MB 128 kB/s 
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from catboost) (1.7.3)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from catboost) (1.15.0)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.3.5)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.21.6)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from catboost) (3.2.2)
Requirement already satisfied: graphviz in /usr/local/lib/python3.7/dist-packages (from catboost) (0.10.1)
Requirement already satisfied: plotly in /usr/local/lib/python3.7/dist-packages (from catboost) (5.5.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2022.5)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (3.0.9)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (1.4.4)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from kiwisolver>=1.0.1->matplotlib->catboost) (4.1.1)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.7/dist-packages (from plotly->catboost) (8.1.0)
Installing collected packages: catboost
Successfully installed catboost-1.1.1

!pip install mlxtend --upgrade --no-deps

Requirement already satisfied: mlxtend in /usr/local/lib/python3.7/dist-packages (0.14.0)
Collecting mlxtend
  Downloading mlxtend-0.19.0-py2.py3-none-any.whl (1.3 MB)
     |████████████████████████████████| 1.3 MB 5.4 MB/s 
Installing collected packages: mlxtend
  Attempting uninstall: mlxtend
    Found existing installation: mlxtend 0.14.0
    Uninstalling mlxtend-0.14.0:
      Successfully uninstalled mlxtend-0.14.0
Successfully installed mlxtend-0.19.0

Unable to display output for mime type(s): application/vnd.colab-display-data+json

import pandas as pd
import numpy as np
import time

from sklearn import datasets
from sklearn.inspection import plot_partial_dependence
from sklearn.model_selection import train_test_split,KFold, cross_val_score, GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, GradientBoostingRegressor, AdaBoostClassifier, AdaBoostRegressor, StackingClassifier, VotingClassifier
from sklearn import tree
from sklearn.metrics import confusion_matrix, roc_auc_score, mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import LabelEncoder

# Require extra installation
from bartpy.sklearnmodel import SklearnModel 
from xgboost import XGBRegressor
from xgboost import XGBClassifier
from lightgbm import LGBMRegressor
import lightgbm as lgb
from catboost import CatBoostClassifier, CatBoostRegressor, Pool

import graphviz 
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
plt.style.use('seaborn-white')
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})

8.2 Fitting classification trees

The current data we have is of an organization who sells car seats, each row represents a region. Therefore for each region, the columns are -

Sales: unit sales in thousands
CompPrice: price charged by competitor at each location
Income: community income level in 1000s of dollars
Advertising: local ad budget at each location in 1000s of dollars
Population: regional pop in thousands
Price: price for car seats at each site
ShelveLoc: Bad, Good or Medium indicates quality of shelving location
Age: age level of the population
Education: ed level at location
Urban: Yes/No
US: Yes/No

The company wants to set up a busniess in some new region, we have the information of that region and want to predict what the sales would be. To make the model we will use this data as training data source.

We create a variable, called High, which takes on a value of Yes if the Sales variable exceeds \(8\), and takes on a value of No otherwise.

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

Carseats = pd.read_csv('/content/drive/MyDrive/Lab/Data/Carseats.csv')

print(Carseats.shape)
# Check for missing values
assert Carseats.isnull().sum().sum() == 0

# Create binary variable High 1 if Sales > 8
Carseats['High'] = (Carseats['Sales'] > 8).astype(np.float64)

Carseats.head()

(400, 11)

	Sales	CompPrice	Income	Advertising	Population	Price	ShelveLoc	Age	Education	Urban	US	High
0	9.50	138	73	11	276	120	Bad	42	17	Yes	Yes	1.0
1	11.22	111	48	16	260	83	Good	65	10	Yes	Yes	1.0
2	10.06	113	35	10	269	80	Medium	59	12	Yes	Yes	1.0
3	7.40	117	100	4	466	97	Medium	55	14	Yes	Yes	0.0
4	4.15	141	64	3	340	128	Bad	38	13	Yes	No	0.0

predictors = Carseats.drop(["Sales","High"], axis=1).columns
X = pd.get_dummies(Carseats[predictors], drop_first=True)
y = Carseats["High"]

	CompPrice	Income	Advertising	Population	Price	Age	Education	ShelveLoc_Good	ShelveLoc_Medium	Urban_Yes	US_Yes
0	138	73	11	276	120	42	17	0	0	1	1
1	111	48	16	260	83	65	10	1	0	1	1
2	113	35	10	269	80	59	12	0	1	1	1
3	117	100	4	466	97	55	14	0	1	1	1
4	141	64	3	340	128	38	13	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...
395	138	108	17	203	128	33	14	1	0	1	1
396	139	23	3	37	120	55	11	0	1	0	1
397	162	26	12	368	159	40	18	0	1	1	1
398	100	79	7	284	95	50	12	0	0	1	1
399	134	37	0	27	120	49	16	1	0	1	1

400 rows × 11 columns

The best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

clf_tree = DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, random_state=1, criterion='entropy')
clf_tree.fit(X, y)

DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=5,
                       random_state=1)

One of the most attractive properties of trees is that they can be graphically displayed. We use the plot_tree() function to display the tree structure, and the export_text function to display the node labels. The argument feature_names instructs Python to include the feature names for any predictors, rather than simply displaying a number.

plt.figure(figsize=(40,24))
tree.plot_tree(clf_tree, feature_names=X.columns,  
                    class_names=['No','Yes'],
                    filled=True)
plt.show()

dot_data = tree.export_graphviz(clf_tree, feature_names=X.columns,  
                    class_names=['No','Yes'],
                    filled=True, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("tree1")
graph

r = tree.export_text(clf_tree, feature_names=list(X.columns))
print(r)

|--- ShelveLoc_Good <= 0.50
|   |--- Price <= 92.50
|   |   |--- Income <= 57.00
|   |   |   |--- class: 0.0
|   |   |--- Income >  57.00
|   |   |   |--- class: 1.0
|   |--- Price >  92.50
|   |   |--- Advertising <= 13.50
|   |   |   |--- class: 0.0
|   |   |--- Advertising >  13.50
|   |   |   |--- class: 1.0
|--- ShelveLoc_Good >  0.50
|   |--- Price <= 135.00
|   |   |--- US_Yes <= 0.50
|   |   |   |--- class: 1.0
|   |   |--- US_Yes >  0.50
|   |   |   |--- class: 1.0
|   |--- Price >  135.00
|   |   |--- Income <= 46.00
|   |   |   |--- class: 0.0
|   |   |--- Income >  46.00
|   |   |   |--- class: 1.0

The most important indicator of Sales appears to be shelving location, since the first branch differentiates Good locations from Bad and Medium locations.

For more information about understanding the structure and decision path https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py.

#clf_tree = DecisionTreeClassifier(max_depth=8, min_samples_leaf=5, random_state=1, criterion='entropy')
clf_tree = DecisionTreeClassifier(max_leaf_nodes=30, random_state=1, criterion='entropy')
clf_tree.fit(X, y)
clf_tree.score(X, y)

0.9375

We see that the training error rate is \(6.25\%\) when we set the max_leaf_nodes to 30. In order to properly evaluate the performance of a classification tree on these data, we must estimate the test error rather than simply computing the training error. We split the observations into a training set and a test set, build the tree using the training set, and evaluate its performance on the test data. The predict() function can be used for this purpose. This approach leads to correct predictions for around \(73\,\%\) of the locations in the test data set.

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=1)
clf_tree = DecisionTreeClassifier(max_leaf_nodes=30, random_state=1, criterion='entropy')
clf_tree.fit(X_train, y_train)

y_pred = clf_tree.predict(X_test)
y_pred_prob = clf_tree.predict_proba(X_test)

conf_mat = confusion_matrix(y_test, y_pred)     
print(conf_mat)
TP = conf_mat[1,1]
TN = conf_mat[0,0]
FP = conf_mat[0,1]
FN = conf_mat[1,0]

print("Accuracy: ", (TP+TN) / (TP+TN+FP+FN) )
print("Sensitivity: ",  TP / (FN + TP) )
print("Precision: ",  TP / (FP + TP) )
print("False Positive Rate: ",  FP / (FP + TN) )
print("AUC Score: ", roc_auc_score(y_test, y_pred_prob[:,1]) )

[[100  19]
 [ 35  46]]
Accuracy:  0.73
Sensitivity:  0.5679012345679012
Precision:  0.7076923076923077
False Positive Rate:  0.15966386554621848
AUC Score:  0.7327523602033406

Again, If you re-run the fit() function then you might get slightly different results, due to “ties”: for instance, this can happen when the training observations corresponding to a terminal node are evenly split between Yes and No response values. So remember to set the random_state

Next, we consider whether pruning the tree might lead to improved results. We use cross-validation in order to determine the optimal size of tree complexity.

The tree now consider the number of terminal nodes.

tree_sizes = range(2, 50)

k = 5
kf5 = KFold(n_splits=k, shuffle=True, random_state=1)

ACC_scores = []
AUC_scores = []

# 10-Fold CV for each tree size 
for size in tree_sizes:
    clf_tree = DecisionTreeClassifier(max_leaf_nodes=size, random_state=1, criterion='entropy')
    ACC_s = cross_val_score(clf_tree, X, y, cv=kf5, scoring='accuracy')
    AUC_s = cross_val_score(clf_tree, X, y, cv=kf5, scoring='roc_auc')
    ACC_scores.append(np.mean(ACC_s))
    AUC_scores.append(np.mean(AUC_s))
    
# plot CV-Accuracy and AUC socres w.r.t tree sizes 
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))          
fig.subplots_adjust(wspace=0.5)

ax1.plot(tree_sizes, ACC_scores)
ax1.set_title('5-CV ACC')
ax1.set_xlabel('Tree Size')
ax1.set_ylabel('Accuracy')
ax1.grid(True)

ax2.plot(tree_sizes, AUC_scores)
ax2.set_title('5-CV AUC scores')
ax2.set_xlabel('Tree Size')
ax2.set_ylabel('AUC scores')
ax2.grid(True)

np.array(ACC_scores).argmax()

How well does this pruned tree perform on the test data set? Once again, we apply the predict() function.

clf_tree = DecisionTreeClassifier(max_leaf_nodes=18, random_state=1, criterion='entropy')
clf_tree.fit(X_train, y_train)

y_pred = clf_tree.predict(X_test)
y_pred_prob = clf_tree.predict_proba(X_test)

conf_mat = confusion_matrix(y_test, y_pred)     
print(conf_mat)
TP = conf_mat[1,1]
TN = conf_mat[0,0]
FP = conf_mat[0,1]
FN = conf_mat[1,0]

print("Accuracy: ", (TP+TN) / (TP+TN+FP+FN) )
print("Sensitivity: ",  TP / (FN + TP) )
print("Precision: ",  TP / (FP + TP) )
print("False Positive Rate: ",  FP / (FP + TN) )
print("AUC Score: ", roc_auc_score(y_test, y_pred_prob[:,1]) )

[[100  19]
 [ 32  49]]
Accuracy:  0.745
Sensitivity:  0.6049382716049383
Precision:  0.7205882352941176
False Positive Rate:  0.15966386554621848
AUC Score:  0.772849880693018

Now \(74.5\,\%\) of the test observations are correctly classified, so not only has the pruning process produced a more interpretable tree, but it has also slightly improved the classification accuracy.

# Plot feature by importance in this model

plot_df = pd.DataFrame({'feature': X.columns, 'importance': clf_tree.feature_importances_})

plt.figure(figsize=(10,10))
sns.barplot(x='importance', y='feature', data=plot_df.sort_values('importance', ascending=False),
            color='b')
plt.xticks(rotation=90);

cost complexity pruning can also be used in order to select a sequence of trees for consideration.

clf_tree = DecisionTreeClassifier(random_state=1, criterion='entropy')
path = clf_tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

fig, ax = plt.subplots()
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")

Text(0.5, 1.0, 'Total Impurity vs effective alpha for training set')

clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, criterion='entropy', ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)

Number of nodes in the last tree is: 1 with ccp_alpha: 0.0974240009819235

clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1)
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

Check https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html and https://scikit-learn.org/stable/modules/tree.html for more details

train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]

fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()

Other metrics min_samples_leaf and max_depth can also be used to prevent a tree from overfiting

8.3 Fitting Regression Trees

Here we fit a regression tree to the Boston data set. First, we create a training set, and fit the tree to the training data.

# Load 'Boston' data
Boston = pd.read_csv('/content/drive/MyDrive/Lab/Data/Boston.csv')
Boston = Boston.drop('Unnamed: 0', axis=1)
Boston.info()    # all numeric.  no Null,  Nice 
Boston.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   indus    506 non-null    float64
 3   chas     506 non-null    int64  
 4   nox      506 non-null    float64
 5   rm       506 non-null    float64
 6   age      506 non-null    float64
 7   dis      506 non-null    float64
 8   rad      506 non-null    int64  
 9   tax      506 non-null    int64  
 10  ptratio  506 non-null    float64
 11  lstat    506 non-null    float64
 12  medv     506 non-null    float64
dtypes: float64(10), int64(3)
memory usage: 51.5 KB

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	lstat	medv
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	5.33	36.2

X = Boston.drop('medv', axis=1)
y = Boston.medv
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=1)

reg_tree = DecisionTreeRegressor(max_depth=3, min_samples_leaf=5, random_state=1)
reg_tree.fit(X_train, y_train)

DecisionTreeRegressor(max_depth=3, min_samples_leaf=5, random_state=1)

plt.figure(figsize=(40,24))
tree.plot_tree(reg_tree, feature_names=X.columns,  
                    class_names=['No','Yes'],
                    filled=True)
plt.show()

The variable lstat measures the percentage of individuals with {lower socioeconomic status}, while the variable rm corresponds to the average number of rooms. The tree indicates that larger values of rm, or lower values of lstat, correspond to more expensive houses.

Now we use the cross validation to see whether pruning the tree will improve performance.

tree_sizes = range(2, 50)

k = 5
kf5 = KFold(n_splits=k, shuffle=True, random_state=1)

MSE_scores = []

# 7-Fold CV for each tree size 
for size in tree_sizes:
    reg_tree = DecisionTreeRegressor(max_leaf_nodes=size, random_state=1)
    MSE_s = -cross_val_score(reg_tree, X, y, cv=kf5, scoring='neg_mean_squared_error')
    MSE_scores.append(np.mean(MSE_s))
    
# plot CV-MSE as tree size gets bigger 
plt.plot(tree_sizes, MSE_scores)
plt.title('5-CV MSE')
plt.xlabel('Tree Size')
plt.ylabel('MSE')
plt.grid(True)

print("best tree size = ", tree_sizes[np.argmin(MSE_scores)]) #np.argmin(MSE_scores)=5

best tree size =  7

reg_tree = DecisionTreeRegressor(max_leaf_nodes=7, random_state=1)
reg_tree.fit(X_train, y_train)
pred = reg_tree.predict(X_test)
mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mse)
print(mse, rmse)

17.924645209941158 4.2337507260042075

plt.scatter(pred, y_test, label='medv')
plt.plot([0, 1], [0, 1], '--k', transform=plt.gca().transAxes)
plt.xlabel('pred')
plt.ylabel('y_test')

Text(0, 0.5, 'y_test')

In other words, the test set MSE associated with the regression tree is \(17.92\). The square root of the MSE is therefore around \(4.234\), indicating that this model leads to test predictions that are (on average) within approximately \(4,234\) of the true median home value for the census tract.

8.4 Bagging and Random Forests

Here we apply bagging and random forests to the Boston data. Recall that bagging is simply a special case of a random forest with \(m = p\). Therefore, the RandomForestRegressor() function can be used to perform both random forests and bagging. Let’s start with bagging:

# Bagging: using all features
max_features = X.shape[1]
tree_count   = 100
regr1 = RandomForestRegressor(max_features=max_features, random_state=2, n_estimators=tree_count)
regr1.fit(X_train, y_train)

RandomForestRegressor(max_features=12, random_state=2)

The argument max_features=12 indicates that all 12 predictors should be considered for each split of the tree – in other words, that bagging should be done. How well does this bagged model perform on the test set?

pred = regr1.predict(X_test)
plt.scatter(pred, y_test, label='medv')
plt.plot([0, 1], [0, 1], '--k', transform=plt.gca().transAxes)
plt.xlabel('pred')
plt.ylabel('y_test')
mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mse)
print(mse, rmse)

11.56611499209486 3.4008991446520227

We can grow a random forest in exactly the same way, except that we’ll use a smaller value of the max_features argument. Here we’ll use max_features = 6:

# Random forests: using 6 features
tree_count   = 100
regr2 = RandomForestRegressor(max_features=6, random_state=2, n_estimators=tree_count)
regr2.fit(X_train, y_train)

pred = regr2.predict(X_test)
mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mse)
print(mse, rmse)

11.210173703557313 3.3481597488108767

The test set MSE is even lower; this indicates that random forests yielded an improvement over bagging in this case.

Using the feature_importances attribute of the RandomForestRegressor, we can view the importance of each variable:

Importance = pd.DataFrame({'Importance':regr2.feature_importances_}, index=X.columns)
Importance.sort_values(by='Importance', axis=0, ascending=True).plot(kind='barh', color='r', )
plt.xlabel('Variable Importance')
plt.gca().legend_ = None

The results indicate that across all of the trees considered in the random forest, the wealth of the community (lstat) and the house size (rm) are by far the two most important variables.

8.5 Boosting

Here we use the GradientBoostingRegressor() function, to fit boosted regression trees to the Boston data set.

# Gradient boosting

params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2, 'max_features': 'sqrt', 'subsample' : 0.9,
          'learning_rate': 0.01, 'loss': 'squared_error', 'random_state': 1}

regr3   = GradientBoostingRegressor(**params)
regr3.fit(X_train, y_train)

pred = regr3.predict(X_test)
mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mse)
print(mse, rmse)

10.585712502747185 3.2535691943997724

# Gradient boosting

params = {'n_estimators': 100, 'max_features': 'sqrt', 
          'learning_rate': 0.1, 'random_state': 10}

regr3   = GradientBoostingRegressor(**params)
regr3.fit(X_train, y_train)

pred = regr3.predict(X_test)
mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mse)
print(mse, rmse)

10.377996557875113 3.2214898040929936

The test MSE obtained is \(10.38\): this is superior to the test MSE of random forests and bagging.

test_score = np.zeros((params['n_estimators'],), dtype=np.float64)

# compute test set deviance
for i, y_pred in enumerate(regr3.staged_predict(X_test)):   
    test_score[i] = regr3.loss_(y_test, y_pred)             

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title('Deviance (MSE) w.r.t. no. of estimators')
plt.plot(np.arange(params['n_estimators']) + 1, regr3.train_score_, 'b-',
         label='Training Set Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-',
         label='Test Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance (MSE)')
print("Train/Test Split GBR with 500 estimators: %.4f" % test_score[-1])

Train/Test Split GBR with 500 estimators: 10.3780

feature_importance = regr3.feature_importances_
rel_imp = pd.Series(feature_importance, index=X.columns).sort_values(inplace=False)
print(rel_imp)
rel_imp.T.plot(kind='barh', color='r', )
plt.xlabel('Variable Importance')
plt.gca().legend_ = None

chas       0.001323
rad        0.002557
zn         0.020765
age        0.023923
tax        0.051350
crim       0.061762
ptratio    0.075320
dis        0.096075
indus      0.098711
nox        0.105943
lstat      0.219331
rm         0.242939
dtype: float64

We see that lstat and rm are by far the most important variables. We can also produce partial dependence plots for these two variables. These plots illustrate the marginal effect of the selected variables on the response after integrating out the other variables. In this case, as we might expect, median house prices are increasing with rm and decreasing with lstat.

# Partial Dependence Plots
disp1 = plot_partial_dependence(regr3,  X_train, [5, 11], feature_names=X.columns)

/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_partial_dependence is deprecated; Function `plot_partial_dependence` is deprecated in 1.0 and will be removed in 1.2. Use PartialDependenceDisplay.from_estimator instead
  warnings.warn(msg, category=FutureWarning)

8.5.1 AdaBoost

ada_regr = AdaBoostRegressor(n_estimators=100, random_state=1)

ada_regr.fit(X_train, y_train)

pred = ada_regr.predict(X_test)
mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mse)
print(mse, rmse)

15.134274042481255 3.8902794298714913

Check https://scikit-learn.org/stable/modules/ensemble.html#adaboost for more detail

8.6 Bayesian Additive Regression Trees

In this section we use the bartpy package, and within it the SklearnModel() function, to fit a Bayesian additive regression tree model to the Boston housing data set. On colab the following code may takes 5~10 minutes to run with default value.

model = SklearnModel() # Use default parameters
model.fit(X_train, y_train) # Fit the model

pred = model.predict(X_test)
mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mse)
print(mse, rmse)

13.771499662519378 3.710997125102548

Chcek https://jakecoltman.github.io/bartpy/ for more details

8.7 XGBoost

8.7.1 Classification task

Here are the essential steps to build an XGBoost classification model in scikit-learn using cross-validation.

iris = datasets.load_iris()
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],columns= iris['feature_names'] + ['target'])
df.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], random_state=42)

The following template is for building an XGBoost classifier

xgb = XGBClassifier(booster='gbtree', objective='multi:softprob', 
                    learning_rate=0.1, n_estimators=100, random_state=42, n_jobs=-1)

booster='gbtree': The booster is the base learner. It’s the machine learning model that is constructed during every round of boosting. You may have guessed that ‘gbtree’ stands for gradient boosted tree, the XGBoost default base learner. It’s uncommon but possible to work with other base learners,
objective='multi:softprob': Standard options for the objective can be viewed in the XGBoost official documentation, https://xgboost.readthedocs.io/en/latest/parameter.html, under Learning Task Parameters. The multi:softprob objective is a standard alternative to binary:logistic when the dataset includes multiple classes. It computes the probabilities of classification and chooses the highest one. If not explicitly stated, XGBoost will often find the right objective for you.
max_depth=6: The max_depth of a tree determines the number of branches each tree has. It’s one of the most important hyperparameters in making balanced predictions. XGBoost uses a default of 6, unlike random forests, which don’t provide a value unless explicitly programmed.
learning_rate=0.1: Within XGBoost, this hyperparameter is often referred to as eta. This hyperparameter limits the variance by reducing the weight of each tree to the given percentage.
n_estimators=100: Popular among ensemble methods, n_estimators is the number of boosted trees in the model. Increasing this number while decreasing learning_rate can lead to more robust results.

xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
score = accuracy_score(y_pred, y_test)
print('Score: ' + str(score))

Score: 1.0

xgb.get_params()

{'objective': 'multi:softprob',
 'use_label_encoder': False,
 'base_score': 0.5,
 'booster': 'gbtree',
 'callbacks': None,
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'early_stopping_rounds': None,
 'enable_categorical': False,
 'eval_metric': None,
 'gamma': 0,
 'gpu_id': -1,
 'grow_policy': 'depthwise',
 'importance_type': None,
 'interaction_constraints': '',
 'learning_rate': 0.1,
 'max_bin': 256,
 'max_cat_to_onehot': 4,
 'max_delta_step': 0,
 'max_depth': 6,
 'max_leaves': 0,
 'min_child_weight': 1,
 'missing': nan,
 'monotone_constraints': '()',
 'n_estimators': 100,
 'n_jobs': -1,
 'num_parallel_tree': 1,
 'predictor': 'auto',
 'random_state': 42,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'sampling_method': 'uniform',
 'scale_pos_weight': None,
 'subsample': 1,
 'tree_method': 'exact',
 'validate_parameters': 1,
 'verbosity': None}

8.7.2 Regression task

Here are the essential steps to build an XGBoost regression model in scikit-learn using cross-validation.

X,y = datasets.load_diabetes(return_X_y=True)

xgb = XGBRegressor(booster='gbtree', objective='reg:squarederror', 
                    learning_rate=0.1, n_estimators=100, random_state=42, n_jobs=-1)

scores = cross_val_score(xgb, X, y, scoring='neg_mean_squared_error', cv=5)

# Take square root of the scores
rmse = np.sqrt(-scores)

# Display accuracy
print('RMSE:', np.round(rmse, 3))

# Display mean score
print('RMSE mean: %0.3f' % (rmse.mean()))

RMSE: [63.033 59.689 64.538 63.699 64.661]
RMSE mean: 63.124

xgb.fit(X,y)
xgb.get_params()

{'objective': 'reg:squarederror',
 'base_score': 0.5,
 'booster': 'gbtree',
 'callbacks': None,
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'early_stopping_rounds': None,
 'enable_categorical': False,
 'eval_metric': None,
 'gamma': 0,
 'gpu_id': -1,
 'grow_policy': 'depthwise',
 'importance_type': None,
 'interaction_constraints': '',
 'learning_rate': 0.1,
 'max_bin': 256,
 'max_cat_to_onehot': 4,
 'max_delta_step': 0,
 'max_depth': 6,
 'max_leaves': 0,
 'min_child_weight': 1,
 'missing': nan,
 'monotone_constraints': '()',
 'n_estimators': 100,
 'n_jobs': -1,
 'num_parallel_tree': 1,
 'predictor': 'auto',
 'random_state': 42,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'sampling_method': 'uniform',
 'scale_pos_weight': 1,
 'subsample': 1,
 'tree_method': 'exact',
 'validate_parameters': 1,
 'verbosity': None}

Without a baseline of comparison, we have no idea what that score means. Converting the target column, y, into a pandas DataFrame with the .describe() method will give the quartiles and the general statistics of the predictor column, as follows:

pd.DataFrame(y).describe()

	0
count	442.000000
mean	152.133484
std	77.093005
min	25.000000
25%	87.000000
50%	140.500000
75%	211.500000
max	346.000000

A score of 63.124 is less than 1 standard deviation, a respectable result.

8.7.3 Speed comparsion

Let’s now compare GradientBoostingClassifier and XGBoostClassifier with the exoplanet dataset for its speed

!wget https://github.com/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn/raw/master/Chapter04/exoplanets.csv

--2022-11-08 09:14:53--  https://github.com/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn/raw/master/Chapter04/exoplanets.csv
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn/master/Chapter04/exoplanets.csv [following]
--2022-11-08 09:14:53--  https://media.githubusercontent.com/media/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn/master/Chapter04/exoplanets.csv
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 262223348 (250M) [text/plain]
Saving to: ‘exoplanets.csv’

exoplanets.csv      100%[===================>] 250.08M   170MB/s    in 1.5s    

2022-11-08 09:14:59 (170 MB/s) - ‘exoplanets.csv’ saved [262223348/262223348]

df = pd.read_csv('exoplanets.csv')
df.head()

	LABEL	FLUX.1	FLUX.2	FLUX.3	FLUX.4	FLUX.5	FLUX.6	FLUX.7	FLUX.8	FLUX.9	...	FLUX.3188	FLUX.3189	FLUX.3190	FLUX.3191	FLUX.3192	FLUX.3193	FLUX.3194	FLUX.3195	FLUX.3196	FLUX.3197
0	2	93.85	83.81	20.10	-26.98	-39.56	-124.71	-135.18	-96.27	-79.89	...	-78.07	-102.15	-102.15	25.13	48.57	92.54	39.32	61.42	5.08	-39.54
1	2	-38.88	-33.83	-58.54	-40.09	-79.31	-72.81	-86.55	-85.33	-83.97	...	-3.28	-32.21	-32.21	-24.89	-4.86	0.76	-11.70	6.46	16.00	19.93
2	2	532.64	535.92	513.73	496.92	456.45	466.00	464.50	486.39	436.56	...	-71.69	13.31	13.31	-29.89	-20.88	5.06	-11.80	-28.91	-70.02	-96.67
3	2	326.52	347.39	302.35	298.13	317.74	312.70	322.33	311.31	312.42	...	5.71	-3.73	-3.73	30.05	20.03	-12.67	-8.77	-17.31	-17.35	13.98
4	2	-1107.21	-1112.59	-1118.95	-1095.10	-1057.55	-1034.48	-998.34	-1022.71	-989.57	...	-594.37	-401.66	-401.66	-357.24	-443.76	-438.54	-399.71	-384.65	-411.79	-510.54

5 rows × 3198 columns

# Split data into X and y
X = df.iloc[:,1:]
y = df.iloc[:,0]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

start = time.time()

gbr = GradientBoostingClassifier(n_estimators=100, max_depth=2, random_state=42)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
score = accuracy_score(y_pred, y_test)
print('Score: ' + str(score))

end = time.time()
elapsed = end - start

print('Run Time: ' + str(elapsed) + ' seconds')

Score: 0.9874213836477987
Run Time: 318.0388216972351 seconds

start = time.time()

# Instantiate the XGBRegressor, xg_reg
xg_reg = XGBClassifier(n_estimators=100, max_depth=2, random_state=42)

# class column has to start from 0 (as required since version 1.3.2).
le = LabelEncoder() # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
y_train = le.fit_transform(y_train)
y_test = le.fit_transform(y_test)
# Fit xg_reg to training set
xg_reg.fit(X_train, y_train)

# Predict labels of test set, y_pred
y_pred = xg_reg.predict(X_test)

score = accuracy_score(y_pred, y_test)

print('Score: ' + str(score))

end = time.time()
elapsed = end - start

print('Run Time: ' + str(elapsed) + ' seconds')

Score: 0.9913522012578616
Run Time: 71.92710447311401 seconds

When it comes to big data, an algorithm five as fast can save weeks or months of computational time and resources! This advantage is huge in the world of big data. In the world of boosting, XGBoost is the model of choice due to its unparalleled speed and impressive accuracy.

8.7.4 Hyperparameter

XGBoost has many hyperparameters. XGBoost base learner hyperparameters incorporate all decision tree hyperparameters as a starting point. There are gradient boosting hyperparameters, since XGBoost is an enhanced version of gradient boosting.

!wget https://github.com/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn/raw/master/Chapter06/heart_disease.csv

--2022-11-08 09:22:39--  https://github.com/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn/raw/master/Chapter06/heart_disease.csv
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn/master/Chapter06/heart_disease.csv [following]
--2022-11-08 09:22:39--  https://media.githubusercontent.com/media/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn/master/Chapter06/heart_disease.csv
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11328 (11K) [text/plain]
Saving to: ‘heart_disease.csv’

heart_disease.csv   100%[===================>]  11.06K  --.-KB/s    in 0s      

2022-11-08 09:22:39 (67.5 MB/s) - ‘heart_disease.csv’ saved [11328/11328]

df = pd.read_csv('heart_disease.csv')
df.head()

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1

# Split data into X and y
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

Before tuning hyperparameters, let’s build a classifier so that we can obtain a baseline score as a starting point.

When fine-tuning hyperparameters, GridSearchCV and RandomizedSearchCV are the standard options. However, cross_val_score and GridSearchCV/RandomizedSearchCV do not split data the same way. One solution is to use StratifiedKFold whenever cross-validation is used.

A stratified fold includes the same percentage of target values in each fold. If a dataset contains 60% 1s and 40% 0s in the target column, each stratified test set contains 60% 1s and 40% 0s. When folds are random, it’s possible that one test set contains a 70-30 split while another contains a 50-50 split of target values.

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# The 'binary:logistic' objective is standard for binary classification in determining the loss function
model = XGBClassifier(booster='gbtree', objective='binary:logistic', random_state=42)
# Obtain scores of cross-validation
scores = cross_val_score(model, X, y, cv=kfold)

# Display accuracy
print('Accuracy:', np.round(scores, 2))

# Display mean accuracy
print('Accuracy mean: %0.2f' % (scores.mean()))

Accuracy: [0.85 0.72 0.74 0.82 0.78]
Accuracy mean: 0.78

The point here is to use the same folds to obtain new scores when fine-tuning hyperparameters with GridSearchCV and RandomizedSearchCV so that the comparison of scores is fair.

def grid_search(params, random=False): 
    
    xgb = XGBClassifier(booster='gbtree', objective='binary:logistic', random_state=42)
    
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    if random:
        grid = RandomizedSearchCV(xgb, params, cv=kfold, n_iter=20, n_jobs=-1, random_state=42)
    else:
        # Instantiate GridSearchCV as grid_reg
        grid = GridSearchCV(xgb, params, cv=kfold, n_jobs=-1)
    
    # Fit grid_reg on X_train and y_train
    grid.fit(X, y)

    # Extract best params
    best_params = grid.best_params_

    # Print best params
    print("Best params:", best_params)
    
    # Compute best score
    best_score = grid.best_score_

    # Print best score
    print("Best score: {:.5f}".format(best_score))

The XGBoost hyperparameters presented here are not meant to be exhaustive, but they are meant to be comprehensive. For a complete list of hyperparameters, read the official documentation, XGBoost Parameters, at https://xgboost.readthedocs.io/en/latest/parameter.html.

8.7.4.1 learning_rate

learning_rate shrinks the weights of trees for each round of boosting. By lowering learning_rate, more trees are required to produce better scores. Lowering learning_rate prevents overfitting because the size of the weights carried forward is smaller.

A default value of 0.3 is used. Here is a starting range for learning_rate as placed inside our grid_search function:

grid_search(params={'learning_rate':[0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]})

Best params: {'learning_rate': 0.5}
Best score: 0.80525

lowering learning_rate may be advantageous when n_estimators goes up.

8.7.4.2 max_depth

max_depth determines the length of the tree, equivalent to the number of rounds of splitting. Limiting max_depth prevents overfitting because the individual trees can only grow as far as max_depth allows. XGBoost provides a default max_depth value of six:

grid_search(params={'max_depth':[2, 3, 5, 6, 8]})

Best params: {'max_depth': 2}
Best score: 0.79552

Changing max_depth from 6 to 2 gave a better score. The lower value for max_depth means variance has been reduced.

8.7.4.3 gamma

Known as a Lagrange multiplier, gamma provides a threshold that nodes must surpass before making further splits according to the loss function. There is no upper limit to the value of gamma. The default is 0, and anything over 10 is considered very high. Increasing gamma results in a more conservative model:

grid_search(params={'gamma':[0, 0.01, 0.1, 0.5, 1, 2]})

Best params: {'gamma': 1}
Best score: 0.79880

Changing gamma from 0 to 1 has resulted in a slight improvement.

8.7.4.4 min_child_weight

min_child_weight refers to the minimum sum of weights required for a node to split into a child. If the sum of the weights is less than the value of min_child_weight, no further splits are made. min_child_weight reduces overfitting by increasing its value:

grid_search(params={'min_child_weight':[1, 2, 3, 4, 5]})

Best params: {'min_child_weight': 5}
Best score: 0.81202

A slight adjustment to min_child_weight form 1 to 5 gives the best results yet.

8.7.4.5 subsample

The subsample hyperparameter limits the percentage of training instances (rows) for each boosting round. Decreasing subsample from 100% reduces overfitting:

grid_search(params={'subsample':[0.5, 0.7, 0.8, 0.9, 1]})

Best params: {'subsample': 0.5}
Best score: 0.82525

The score has improved by a slight amount once again, indicating a small presence of overfitting.

8.7.4.6 colsample_bytree

Similar to subsample, colsample_bytree randomly selects particular columns according to the given percentage. colsample_bytree is useful for limiting the influence of columns and reducing variance. Note that colsample_bytree takes a percentage as input, not the number of columns:

grid_search(params={'colsample_bytree':[0.5, 0.7, 0.8, 0.9, 1]})

Best params: {'colsample_bytree': 0.5}
Best score: 0.79874

You are encouraged to try colsample_bylevel and colsample_bynode on your own. colsample_bylevel randomly selects columns for each tree depth, and colsample_bynode randomly selects columns when evaluating each tree split.

8.7.4.7 n_estimators

Recall that n_estimators provides the number of trees in the ensemble. In the case of XGBoost, n_estimators is the number of trees trained on the residuals. Initialize a grid search of n_estimators with the default of 100, then double the number of trees through 800 as follows:

grid_search(params={'n_estimators':[100, 200, 400, 800]})

Best params: {'n_estimators': 200}
Best score: 0.79219

Since our dataset is small, increasing n_estimators did not produce better results.

8.7.4.8 Applying early stopping

early_stopping_rounds is not a hyperparameter, but a strategy for optimizing the n_estimators hyperparameter.

Normally when choosing hyperparameters, a test score is given after all boosting rounds are complete. To use early stopping, we need a test score after each round. eval_metric and eval_set may be used as parameters for .fit to generate test scores for each training round. eval_metric provides the scoring method, commonly ‘error’ for classification, and ‘rmse’ for regression. eval_set provides the test to be evaluated, commonly X_test and y_test.

The following steps display an evaluation metric for each round of training with the default n_estimators=100:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = XGBClassifier(booster='gbtree', objective='binary:logistic', random_state=42)
eval_set = [(X_test, y_test)]
eval_metric='error'
model.fit(X_train, y_train, eval_metric=eval_metric, eval_set=eval_set)
# make predictions for test data
y_pred = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

[0] validation_0-error:0.23684
[1] validation_0-error:0.22368
[2] validation_0-error:0.22368
[3] validation_0-error:0.21053
[4] validation_0-error:0.22368
[5] validation_0-error:0.18421
[6] validation_0-error:0.21053
[7] validation_0-error:0.22368
[8] validation_0-error:0.19737
[9] validation_0-error:0.19737
[10]    validation_0-error:0.18421
[11]    validation_0-error:0.18421
[12]    validation_0-error:0.19737

/usr/local/lib/python3.7/dist-packages/xgboost/sklearn.py:797: UserWarning: `eval_metric` in `fit` method is deprecated for better compatibility with scikit-learn, use `eval_metric` in constructor or`set_params` instead.
  UserWarning,

[13]    validation_0-error:0.17105
[14]    validation_0-error:0.18421
[15]    validation_0-error:0.18421
[16]    validation_0-error:0.18421
[17]    validation_0-error:0.18421
[18]    validation_0-error:0.18421
[19]    validation_0-error:0.19737
[20]    validation_0-error:0.18421
[21]    validation_0-error:0.18421
[22]    validation_0-error:0.18421
[23]    validation_0-error:0.18421
[24]    validation_0-error:0.18421
[25]    validation_0-error:0.18421
[26]    validation_0-error:0.18421
[27]    validation_0-error:0.18421
[28]    validation_0-error:0.18421
[29]    validation_0-error:0.18421
[30]    validation_0-error:0.18421
[31]    validation_0-error:0.17105
[32]    validation_0-error:0.18421
[33]    validation_0-error:0.18421
[34]    validation_0-error:0.18421
[35]    validation_0-error:0.17105
[36]    validation_0-error:0.17105
[37]    validation_0-error:0.17105
[38]    validation_0-error:0.18421
[39]    validation_0-error:0.17105
[40]    validation_0-error:0.18421
[41]    validation_0-error:0.18421
[42]    validation_0-error:0.18421
[43]    validation_0-error:0.18421
[44]    validation_0-error:0.18421
[45]    validation_0-error:0.18421
[46]    validation_0-error:0.18421
[47]    validation_0-error:0.18421
[48]    validation_0-error:0.18421
[49]    validation_0-error:0.18421
[50]    validation_0-error:0.18421
[51]    validation_0-error:0.18421
[52]    validation_0-error:0.18421
[53]    validation_0-error:0.18421
[54]    validation_0-error:0.18421
[55]    validation_0-error:0.18421
[56]    validation_0-error:0.18421
[57]    validation_0-error:0.18421
[58]    validation_0-error:0.18421
[59]    validation_0-error:0.18421
[60]    validation_0-error:0.18421
[61]    validation_0-error:0.18421
[62]    validation_0-error:0.18421
[63]    validation_0-error:0.18421
[64]    validation_0-error:0.18421
[65]    validation_0-error:0.18421
[66]    validation_0-error:0.18421
[67]    validation_0-error:0.18421
[68]    validation_0-error:0.18421
[69]    validation_0-error:0.18421
[70]    validation_0-error:0.18421
[71]    validation_0-error:0.18421
[72]    validation_0-error:0.18421
[73]    validation_0-error:0.18421
[74]    validation_0-error:0.18421
[75]    validation_0-error:0.18421
[76]    validation_0-error:0.18421
[77]    validation_0-error:0.18421
[78]    validation_0-error:0.18421
[79]    validation_0-error:0.18421
[80]    validation_0-error:0.18421
[81]    validation_0-error:0.18421
[82]    validation_0-error:0.18421
[83]    validation_0-error:0.18421
[84]    validation_0-error:0.18421
[85]    validation_0-error:0.18421
[86]    validation_0-error:0.18421
[87]    validation_0-error:0.18421
[88]    validation_0-error:0.18421
[89]    validation_0-error:0.18421
[90]    validation_0-error:0.18421
[91]    validation_0-error:0.18421
[92]    validation_0-error:0.18421
[93]    validation_0-error:0.18421
[94]    validation_0-error:0.18421
[95]    validation_0-error:0.18421
[96]    validation_0-error:0.18421
[97]    validation_0-error:0.18421
[98]    validation_0-error:0.18421
[99]    validation_0-error:0.18421
Accuracy: 81.58%

We know that StratifiedKFold cross-validation gives a mean accuracy of 78% when n_estimators=100. The disparity in scores comes from the difference in test sets.

8.7.4.9 early_stopping_rounds

early_stopping_rounds is an optional parameter to include with eval_metric and eval_set when fitting a model. Let’s try early_stopping_rounds=10. The previous code is repeated with early_stopping_rounds=10 added in:

model = XGBClassifier(booster='gbtree', objective='binary:logistic', random_state=42)
eval_set = [(X_test, y_test)]
eval_metric="error"
model.fit(X_train, y_train, eval_metric=eval_metric, eval_set=eval_set, early_stopping_rounds=10, verbose=True)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

[0] validation_0-error:0.23684
[1] validation_0-error:0.22368
[2] validation_0-error:0.22368
[3] validation_0-error:0.21053
[4] validation_0-error:0.22368
[5] validation_0-error:0.18421
[6] validation_0-error:0.21053
[7] validation_0-error:0.22368
[8] validation_0-error:0.19737

/usr/local/lib/python3.7/dist-packages/xgboost/sklearn.py:797: UserWarning: `eval_metric` in `fit` method is deprecated for better compatibility with scikit-learn, use `eval_metric` in constructor or`set_params` instead.
  UserWarning,
/usr/local/lib/python3.7/dist-packages/xgboost/sklearn.py:797: UserWarning: `early_stopping_rounds` in `fit` method is deprecated for better compatibility with scikit-learn, use `early_stopping_rounds` in constructor or`set_params` instead.
  UserWarning,

[9] validation_0-error:0.19737
[10]    validation_0-error:0.18421
[11]    validation_0-error:0.18421
[12]    validation_0-error:0.19737
[13]    validation_0-error:0.17105
[14]    validation_0-error:0.18421
[15]    validation_0-error:0.18421
[16]    validation_0-error:0.18421
[17]    validation_0-error:0.18421
[18]    validation_0-error:0.18421
[19]    validation_0-error:0.19737
[20]    validation_0-error:0.18421
[21]    validation_0-error:0.18421
[22]    validation_0-error:0.18421
[23]    validation_0-error:0.18421
Accuracy: 82.89%

A more thorough approach is to use larger values, say, n_estimators = 5000 and early_stopping_rounds=100. By setting early_stopping_rounds=100, you are guaranteed to reach the default of 100 boosted trees presented by XGBoost. Here is the code that gives a maximum of 5,000 trees and that will stop after 100 consecutive rounds fail to find any improvement:

model = XGBClassifier(random_state=42, n_estimators=5000)
eval_set = [(X_test, y_test)]
eval_metric="error"
model.fit(X_train, y_train, eval_metric=eval_metric, eval_set=eval_set, early_stopping_rounds=100)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

[0] validation_0-error:0.23684
[1] validation_0-error:0.22368
[2] validation_0-error:0.22368
[3] validation_0-error:0.21053
[4] validation_0-error:0.22368
[5] validation_0-error:0.18421
[6] validation_0-error:0.21053
[7] validation_0-error:0.22368
[8] validation_0-error:0.19737
[9] validation_0-error:0.19737
[10]    validation_0-error:0.18421
[11]    validation_0-error:0.18421

/usr/local/lib/python3.7/dist-packages/xgboost/sklearn.py:797: UserWarning: `eval_metric` in `fit` method is deprecated for better compatibility with scikit-learn, use `eval_metric` in constructor or`set_params` instead.
  UserWarning,
/usr/local/lib/python3.7/dist-packages/xgboost/sklearn.py:797: UserWarning: `early_stopping_rounds` in `fit` method is deprecated for better compatibility with scikit-learn, use `early_stopping_rounds` in constructor or`set_params` instead.
  UserWarning,

[12]    validation_0-error:0.19737
[13]    validation_0-error:0.17105
[14]    validation_0-error:0.18421
[15]    validation_0-error:0.18421
[16]    validation_0-error:0.18421
[17]    validation_0-error:0.18421
[18]    validation_0-error:0.18421
[19]    validation_0-error:0.19737
[20]    validation_0-error:0.18421
[21]    validation_0-error:0.18421
[22]    validation_0-error:0.18421
[23]    validation_0-error:0.18421
[24]    validation_0-error:0.18421
[25]    validation_0-error:0.18421
[26]    validation_0-error:0.18421
[27]    validation_0-error:0.18421
[28]    validation_0-error:0.18421
[29]    validation_0-error:0.18421
[30]    validation_0-error:0.18421
[31]    validation_0-error:0.17105
[32]    validation_0-error:0.18421
[33]    validation_0-error:0.18421
[34]    validation_0-error:0.18421
[35]    validation_0-error:0.17105
[36]    validation_0-error:0.17105
[37]    validation_0-error:0.17105
[38]    validation_0-error:0.18421
[39]    validation_0-error:0.17105
[40]    validation_0-error:0.18421
[41]    validation_0-error:0.18421
[42]    validation_0-error:0.18421
[43]    validation_0-error:0.18421
[44]    validation_0-error:0.18421
[45]    validation_0-error:0.18421
[46]    validation_0-error:0.18421
[47]    validation_0-error:0.18421
[48]    validation_0-error:0.18421
[49]    validation_0-error:0.18421
[50]    validation_0-error:0.18421
[51]    validation_0-error:0.18421
[52]    validation_0-error:0.18421
[53]    validation_0-error:0.18421
[54]    validation_0-error:0.18421
[55]    validation_0-error:0.18421
[56]    validation_0-error:0.18421
[57]    validation_0-error:0.18421
[58]    validation_0-error:0.18421
[59]    validation_0-error:0.18421
[60]    validation_0-error:0.18421
[61]    validation_0-error:0.18421
[62]    validation_0-error:0.18421
[63]    validation_0-error:0.18421
[64]    validation_0-error:0.18421
[65]    validation_0-error:0.18421
[66]    validation_0-error:0.18421
[67]    validation_0-error:0.18421
[68]    validation_0-error:0.18421
[69]    validation_0-error:0.18421
[70]    validation_0-error:0.18421
[71]    validation_0-error:0.18421
[72]    validation_0-error:0.18421
[73]    validation_0-error:0.18421
[74]    validation_0-error:0.18421
[75]    validation_0-error:0.18421
[76]    validation_0-error:0.18421
[77]    validation_0-error:0.18421
[78]    validation_0-error:0.18421
[79]    validation_0-error:0.18421
[80]    validation_0-error:0.18421
[81]    validation_0-error:0.18421
[82]    validation_0-error:0.18421
[83]    validation_0-error:0.18421
[84]    validation_0-error:0.18421
[85]    validation_0-error:0.18421
[86]    validation_0-error:0.18421
[87]    validation_0-error:0.18421
[88]    validation_0-error:0.18421
[89]    validation_0-error:0.18421
[90]    validation_0-error:0.18421
[91]    validation_0-error:0.18421
[92]    validation_0-error:0.18421
[93]    validation_0-error:0.18421
[94]    validation_0-error:0.18421
[95]    validation_0-error:0.18421
[96]    validation_0-error:0.18421
[97]    validation_0-error:0.18421
[98]    validation_0-error:0.18421
[99]    validation_0-error:0.18421
[100]   validation_0-error:0.18421
[101]   validation_0-error:0.18421
[102]   validation_0-error:0.18421
[103]   validation_0-error:0.18421
[104]   validation_0-error:0.18421
[105]   validation_0-error:0.18421
[106]   validation_0-error:0.18421
[107]   validation_0-error:0.18421
[108]   validation_0-error:0.18421
[109]   validation_0-error:0.18421
[110]   validation_0-error:0.18421
[111]   validation_0-error:0.18421
[112]   validation_0-error:0.18421
[113]   validation_0-error:0.18421
Accuracy: 82.89%

After 100 rounds of boosting, the score provided by 13 trees is the best.

8.7.4.10 Automatically hyperparamter tuning

You are encourage to try https://github.com/optuna/optuna-examples/blob/main/xgboost/xgboost_simple.py for hyperparameter tuning.

8.7.5 For categorical variable and missing value

XGBoost has experiment support for categorical variable, you can check out here: https://xgboost.readthedocs.io/en/latest/tutorials/categorical.html. But it only works for a few tree methods, it is still recommend to encode your data https://www.kaggle.com/code/shahules/an-overview-of-encoding-techniques/notebook. Missing value, on the other hand can be handled by XGBoost as described at https://xgboost.readthedocs.io/en/stable/faq.html#how-to-deal-with-missing-values.

!gdown 1WkuxuToarMFAHYIQ85SW20YSXrfJsMEH

Downloading...
From: https://drive.google.com/uc?id=1WkuxuToarMFAHYIQ85SW20YSXrfJsMEH
To: /content/melb_data.csv
  0% 0.00/2.09M [00:00<?, ?B/s]100% 2.09M/2.09M [00:00<00:00, 36.3MB/s]

# Select target
data = pd.read_csv('melb_data.csv')
y = data.Price

# To keep things simple, we'll split the columns into numerical can categorical features
melb_predictors = data.drop(['Price', 'Date', 'Address'], axis=1)
cat_col = melb_predictors.select_dtypes(exclude=['int64','float64'])

# Divide data into training and validation subsets
X, X_v, y_train, y_valid = train_test_split(melb_predictors, y, train_size=0.8, test_size=0.2, random_state=0)
X_train = X.select_dtypes(exclude=['object'])
X_valid = X_v.select_dtypes(exclude=['object'])
X_train_cat = X.select_dtypes(exclude=['int64','float64'])
X_valid_cat = X_v.select_dtypes(exclude=['int64','float64'])

for col in X_train_cat.columns:
  X[col] = X[col].astype('category')
  X_v[col] = X_v[col].astype('category')

xgb = XGBRegressor(booster='gbtree', objective='reg:squarederror', 
                    random_state=42, n_jobs=-1) # You can either specify missing=-9999 or leave it as it is
xgb.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=100,
             n_jobs=-1, num_parallel_tree=1, predictor='auto', random_state=42,
             reg_alpha=0, reg_lambda=1, ...)

preds = xgb.predict(X_valid)

mean_absolute_error(y_valid, preds)

173639.76325478646

8.8 Lightgbm

8.8.1 Classification task

iris = datasets.load_iris()
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],columns= iris['feature_names'] + ['target'])
X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], random_state=42)

clf = lgb.LGBMClassifier(boosting_type='gbdt', learning_rate=0.1, n_estimators=100, random_state=42, n_jobs=-1)
clf.fit(X_train, y_train)

LGBMClassifier(random_state=42)

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
score = accuracy_score(y_pred, y_test)
print('Score: ' + str(score))

Score: 1.0

clf.get_params()

{'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'importance_type': 'split',
 'learning_rate': 0.1,
 'max_depth': -1,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_split_gain': 0.0,
 'n_estimators': 100,
 'n_jobs': -1,
 'num_leaves': 31,
 'objective': None,
 'random_state': 42,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'silent': 'warn',
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0}

8.8.2 Regression task

X,y = datasets.load_diabetes(return_X_y=True)

lgbr = lgb.LGBMRegressor(boosting_type='gbdt', learning_rate=0.1, n_estimators=100, random_state=42, n_jobs=-1)

scores = cross_val_score(lgbr, X, y, scoring='neg_mean_squared_error', cv=5)

# Take square root of the scores
rmse = np.sqrt(-scores)

# Display accuracy
print('RMSE:', np.round(rmse, 3))

# Display mean score
print('RMSE mean: %0.3f' % (rmse.mean()))

RMSE: [56.081 59.172 63.191 61.833 60.542]
RMSE mean: 60.164

lgbr.fit(X,y)
lgbr.get_params()

{'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'importance_type': 'split',
 'learning_rate': 0.1,
 'max_depth': -1,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_split_gain': 0.0,
 'n_estimators': 100,
 'n_jobs': -1,
 'num_leaves': 31,
 'objective': None,
 'random_state': 42,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'silent': 'warn',
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0}

8.8.3 Speed

df = pd.read_csv('exoplanets.csv')
# Split data into X and y
X = df.iloc[:,1:]
y = df.iloc[:,0]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

start = time.time()

# Instantiate the XGBRegressor, xg_reg
lg_reg = lgb.LGBMClassifier(n_estimators=100, max_depth=2, random_state=42)

# Fit xg_reg to training set
lg_reg.fit(X_train, y_train)

# Predict labels of test set, y_pred
y_pred = lg_reg.predict(X_test)

score = accuracy_score(y_pred, y_test)

print('Score: ' + str(score))

end = time.time()
elapsed = end - start

print('Run Time: ' + str(elapsed) + ' seconds')

Score: 0.9913522012578616
Run Time: 10.266663074493408 seconds

8.8.4 Hyperparameter

Following set of practices can be used to improve your model efficiency.

num_leaves : This is the main parameter to control the complexity of the tree model. Ideally, the value of num_leaves should be less than or equal to 2^(max_depth). Value more than this will result in overfitting.
min_data_in_leaf : Setting it to a large value can avoid growing too deep a tree, but may cause under-fitting. In practice, setting it to hundreds or thousands is enough for a large dataset.
max_depth : We also can use max_depth to limit the tree depth explicitly.

For Faster Speed

Use bagging by setting bagging_fraction and bagging_freq.
Use feature sub-sampling by setting feature_fraction.
Use small max_bin.
Use save_binary to speed up data loading in future learning.

For better accuracy

Use large max_bin (may be slower).
Use small learning_rate with large num_iterations
Use large num_leaves(may cause over-fitting)
Try to use categorical feature directly.

To deal with over-fitting

Use min_data_in_leaf and min_sum_hessian_in_leaf
Try lambda_l1, lambda_l2 and min_gain_to_split to regularization
Try max_depth to avoid growing deep tree
Try dart

Check https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html for hyperparamter tuning.

8.8.5 For categorical variable and missing value

LightGBM enables the missing value handle by default. See https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#missing-value-handle. It also deal with categorical variables as described here https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#categorical-feature-support

# Select target
data = pd.read_csv('melb_data.csv')
y = data.Price

# To keep things simple, we'll split the columns into numerical can categorical features
melb_predictors = data.drop(['Price', 'Date', 'Address'], axis=1)
cat_col = melb_predictors.select_dtypes(exclude=['int64','float64'])

# Divide data into training and validation subsets
X, X_v, y_train, y_valid = train_test_split(melb_predictors, y, train_size=0.8, test_size=0.2, random_state=0)
X_train = X.select_dtypes(exclude=['object'])
X_valid = X_v.select_dtypes(exclude=['object'])
X_train_cat = X.select_dtypes(exclude=['int64','float64'])
X_valid_cat = X_v.select_dtypes(exclude=['int64','float64'])

for col in X_train_cat.columns:
  X[col] = X[col].astype('category')
  X_v[col] = X_v[col].astype('category')

lgbr = lgb.LGBMRegressor(boosting_type='gbdt', random_state=42, n_jobs=-1)
lgbr.fit(X, y_train)

LGBMRegressor(random_state=42)

preds = lgbr.predict(X_v)

mean_absolute_error(y_valid, preds)

160267.37406568974

8.9 CatBoost

In this section, we would explore some base cases of using catboost, such as model training, cross-validation and predicting

8.9.1 Classification task

iris = datasets.load_iris()
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],columns= iris['feature_names'] + ['target'])
X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], random_state=42)

clf = CatBoostClassifier(boosting_type='Plain', learning_rate=0.1, n_estimators=100, random_state=42, thread_count=-1, logging_level = 'Silent')
clf.fit(X_train, y_train)

<catboost.core.CatBoostClassifier at 0x7f6006b63e50>

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
score = accuracy_score(y_pred, y_test)
print('Score: ' + str(score))

Score: 1.0

clf.get_all_params()

{'nan_mode': 'Min',
 'eval_metric': 'MultiClass',
 'iterations': 100,
 'sampling_frequency': 'PerTree',
 'leaf_estimation_method': 'Newton',
 'grow_policy': 'SymmetricTree',
 'penalties_coefficient': 1,
 'boosting_type': 'Plain',
 'model_shrink_mode': 'Constant',
 'feature_border_type': 'GreedyLogSum',
 'bayesian_matrix_reg': 0.10000000149011612,
 'eval_fraction': 0,
 'force_unit_auto_pair_weights': False,
 'l2_leaf_reg': 3,
 'random_strength': 1,
 'rsm': 1,
 'boost_from_average': False,
 'model_size_reg': 0.5,
 'pool_metainfo_options': {'tags': {}},
 'use_best_model': False,
 'class_names': [0, 1, 2],
 'random_seed': 42,
 'depth': 6,
 'posterior_sampling': False,
 'border_count': 254,
 'bagging_temperature': 1,
 'classes_count': 0,
 'auto_class_weights': 'None',
 'sparse_features_conflict_fraction': 0,
 'leaf_estimation_backtracking': 'AnyImprovement',
 'best_model_min_trees': 1,
 'model_shrink_rate': 0,
 'min_data_in_leaf': 1,
 'loss_function': 'MultiClass',
 'learning_rate': 0.10000000149011612,
 'score_function': 'Cosine',
 'task_type': 'CPU',
 'leaf_estimation_iterations': 1,
 'bootstrap_type': 'Bayesian',
 'max_leaves': 64}

8.9.2 Regression

X,y = datasets.load_diabetes(return_X_y=True)

catb = CatBoostRegressor(boosting_type='Plain', learning_rate=0.1, random_state=42, n_estimators=100, thread_count=-1, logging_level = 'Silent')

scores = cross_val_score(catb, X, y, scoring='neg_mean_squared_error', cv=5)

# Take square root of the scores
rmse = np.sqrt(-scores)

# Display accuracy
print('RMSE:', np.round(rmse, 3))

# Display mean score
print('RMSE mean: %0.3f' % (rmse.mean()))

RMSE: [54.744 56.125 59.595 56.908 56.583]
RMSE mean: 56.791

catb.fit(X,y)
catb.get_all_params()

{'nan_mode': 'Min',
 'eval_metric': 'RMSE',
 'iterations': 100,
 'sampling_frequency': 'PerTree',
 'leaf_estimation_method': 'Newton',
 'grow_policy': 'SymmetricTree',
 'penalties_coefficient': 1,
 'boosting_type': 'Plain',
 'model_shrink_mode': 'Constant',
 'feature_border_type': 'GreedyLogSum',
 'bayesian_matrix_reg': 0.10000000149011612,
 'eval_fraction': 0,
 'force_unit_auto_pair_weights': False,
 'l2_leaf_reg': 3,
 'random_strength': 1,
 'rsm': 1,
 'boost_from_average': True,
 'model_size_reg': 0.5,
 'pool_metainfo_options': {'tags': {}},
 'subsample': 0.800000011920929,
 'use_best_model': False,
 'random_seed': 42,
 'depth': 6,
 'posterior_sampling': False,
 'border_count': 254,
 'classes_count': 0,
 'auto_class_weights': 'None',
 'sparse_features_conflict_fraction': 0,
 'leaf_estimation_backtracking': 'AnyImprovement',
 'best_model_min_trees': 1,
 'model_shrink_rate': 0,
 'min_data_in_leaf': 1,
 'loss_function': 'RMSE',
 'learning_rate': 0.10000000149011612,
 'score_function': 'Cosine',
 'task_type': 'CPU',
 'leaf_estimation_iterations': 1,
 'bootstrap_type': 'MVS',
 'max_leaves': 64}

8.9.3 Speed

df = pd.read_csv('exoplanets.csv')
# Split data into X and y
X = df.iloc[:,1:]
y = df.iloc[:,0]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

start = time.time()

# Instantiate the XGBRegressor, xg_reg
ca_reg = CatBoostClassifier(n_estimators=100, max_depth=2, random_state=42, logging_level = 'Silent')

# Fit xg_reg to training set
ca_reg.fit(X_train, y_train)

# Predict labels of test set, y_pred
y_pred = ca_reg.predict(X_test)

score = accuracy_score(y_pred, y_test)

print('Score: ' + str(score))

end = time.time()
elapsed = end - start

print('Run Time: ' + str(elapsed) + ' seconds')

Score: 0.9913522012578616
Run Time: 16.69258213043213 seconds

8.9.4 Hyperparameter

You can check more details at https://catboost.ai/en/docs/references/training-parameters/. There are also tutorials about hyperparameter tuning https://github.com/catboost/tutorials/blob/master/hyperparameters_tuning/hyperparameters_tuning_using_optuna_and_hyperopt.ipynb

8.9.5 For categorical variable and missing value

One of the differences between CatBoost and other gradient boosting libraries is its advanced processing of the categorical features (in fact “Cat” in the package name stands for “CATegorical”).

CatBoost deals with the categorical data quite well out-of-the-box. However, it also has a huge number of training parameters, which provide fine control over the categorical features preprocessing.

The amount of parameters related to categorical features processing in CatBoost is overwhelming. Here is a hopefully the full list:

one_hot_max_size (int) - use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. No complex encoding is performed for such features.
model_size_reg (float from 0 to inf) - The model size regularization coefficient. The larger the value, the smaller the model size. This regularization is needed only for models with categorical features (other models are small). Models with categorical features might weight tens of gigabytes or more if categorical features have a lot of values. If the value of the regularizer differs from zero, then the usage of categorical features or feature combinations with a lot of values has a penalty, so fewer of them are used in the resulting model. Default value is 0.5
max_ctr_complexity - The maximum number of features that can be combined. Each resulting combination consists of one or more categorical features and can optionally contain binary features in the following form: “numeric feature > value”. For regression task on CPU the default value is 4.
has_time (bool) - if true, the 1-st step of categorical features processing, permutation, is not performed. Useful when the objects in your dataset are ordered by time. For our dataset, we don’t need it. Default value is False
simple_ctr - Quantization settings for simple categorical features. combinations_ctr - Quantization settings for combinations of categorical features.
per_feature_ctr - Per-feature quantization settings for categorical features.
counter_calc_method determines whether to use validation dataset (provided through parameter eval_set of fit method) to estimate categories frequencies with Counter. By default, it is Full and the objects from validation dataset are used; Pass SkipTest value to ignore the objects from the validation set ctr_target_border_count - The maximum number of borders to use in target quantization for categorical features that need it. Default for regression task is 1.
ctr_leaf_count_limit - The maximum number of leaves with categorical features. Default value is None i.e. no limit.
store_all_simple_ctr- If the previous parameter ctr_leaf_count_limit at some point gradient boosting tree can no longer make splits by categorical features. With Default value False the limitation applies both to original categorical features and the features, that CatBoost creates by combining different features. If this parameter is set to True only the number of splits made on combination features is limited.

The three parameters simple_ctr, combinations_ctr, and per_feature_ctr are complex parameters that control the second and the third steps of categorical features processing.

# Select target
data = pd.read_csv('melb_data.csv')
y = data.Price

# To keep things simple, we'll split the columns into numerical can categorical features
melb_predictors = data.drop(['Price', 'Date', 'Address'], axis=1)
cat_col = melb_predictors.select_dtypes(exclude=['int64','float64'])

# Divide data into training and validation subsets
X, X_v, y_train, y_valid = train_test_split(melb_predictors, y, train_size=0.8, test_size=0.2, random_state=0)
X_train = X.select_dtypes(exclude=['object'])
X_valid = X_v.select_dtypes(exclude=['object'])
X_train_cat = X.select_dtypes(exclude=['int64','float64'])
X_valid_cat = X_v.select_dtypes(exclude=['int64','float64'])

categorical_features_names = list(X_train_cat.columns)

for col in categorical_features_names:
  X[col] = X.loc[:,col].fillna(value='nan')
  X_v[col] = X_v.loc[:,col].fillna(value='nan')
  X[col] = X[col].astype('category')
  X_v[col] = X_v[col].astype('category')

catbr = CatBoostRegressor(n_estimators=100, random_state=42, thread_count=-1, logging_level = 'Silent', cat_features=categorical_features_names)
catbr.fit(X, y_train)

<catboost.core.CatBoostRegressor at 0x7f6006532dd0>

preds = catbr.predict(X_v)

mean_absolute_error(y_valid, preds)

164721.46472522244

The first thing we try is to make CatBoost use one-hot encoding for all our categorical features. The documentation says, that for the features for which one-hot encoding is used no other encodings are computed.

catbr = CatBoostRegressor(n_estimators=100, random_state=42, thread_count=-1, logging_level = 'Silent', cat_features=categorical_features_names, one_hot_max_size=500)
catbr.fit(X, y_train)
preds = catbr.predict(X_v)
mean_absolute_error(y_valid, preds)

163165.1377008556

Let us try to set model size regularization coefficient to 0 - thus we allow our model to use as many categorical features and its combinations as it wants.

catbr = CatBoostRegressor(n_estimators=100, random_state=42, thread_count=-1, logging_level = 'Silent', cat_features=categorical_features_names, model_size_reg=0)
catbr.fit(X, y_train)
preds = catbr.predict(X_v)
mean_absolute_error(y_valid, preds)

165007.46639148306

catbr = CatBoostRegressor(n_estimators=100, random_state=42, thread_count=-1, logging_level = 'Silent', cat_features=categorical_features_names, model_size_reg=1)
catbr.fit(X, y_train)
preds = catbr.predict(X_v)
mean_absolute_error(y_valid, preds)

160827.54657606702

Note that any combination of several categorical features could be considered as a new one. Although it is not mentioned in the documentation, this parameter value has to be smaller than 15.

catbr = CatBoostRegressor(n_estimators=100, random_state=42, thread_count=-1, logging_level = 'Silent', cat_features=categorical_features_names, max_ctr_complexity=6)
catbr.fit(X, y_train)
preds = catbr.predict(X_v)
mean_absolute_error(y_valid, preds)

163529.3517553931

catbr = CatBoostRegressor(n_estimators=100, random_state=42, thread_count=-1, logging_level = 'Silent', cat_features=categorical_features_names, max_ctr_complexity=0)
catbr.fit(X, y_train)
preds = catbr.predict(X_v)
mean_absolute_error(y_valid, preds)

164721.46472522244

Counter method is very similar to the traditional Frequency Encoding

catbr = CatBoostRegressor(n_estimators=100, random_state=42, thread_count=-1, logging_level = 'Silent', cat_features=categorical_features_names, simple_ctr='Counter', combinations_ctr='Counter')
catbr.fit(X, y_train)
preds = catbr.predict(X_v)
mean_absolute_error(y_valid, preds)

164993.47954656056

Now we proceed to the settings of the encodings methods that require target quantization. The first choice is Borders vs. Buckets

catbr = CatBoostRegressor(n_estimators=100, random_state=42, thread_count=-1, logging_level = 'Silent', cat_features=categorical_features_names, simple_ctr='Borders', combinations_ctr='Borders')
catbr.fit(X, y_train)
preds = catbr.predict(X_v)
mean_absolute_error(y_valid, preds)

166920.94141120723

catbr = CatBoostRegressor(n_estimators=100, random_state=42, thread_count=-1, logging_level = 'Silent', cat_features=categorical_features_names, simple_ctr='Buckets', combinations_ctr='Buckets')
catbr.fit(X, y_train)
preds = catbr.predict(X_v)
mean_absolute_error(y_valid, preds)

168168.9386958123

It is quite common to use several encodings for a categorical feature. For instance, CatBoost creates 4 different encodings for each categorical feature by default. By default, CatBoost uses several encoding techniques to encode each categorical feature.

First it uses Borders method with one target border TargetBorderCount=1 (in our example for each categorical feature we just want to see if it makes the car more expensive). The obtained float encodings are further discretized into CtrBorderCount=15 different values. Three values of Prior parameter are used to create 3 three different encodings: Prior=0/1:Prior=0.5/1:Prior=1/1
Also for each categorical feature, we create an encoding with Counter method. The number of categorical encoding value borders CtrBorderCount is also equal to 15, and only one value of Prior=0/1 is used.

catbr = CatBoostRegressor(n_estimators=100, random_state=42, thread_count=-1, logging_level = 'Silent', cat_features=categorical_features_names, ctr_target_border_count=10)
catbr.fit(X, y_train)
preds = catbr.predict(X_v)
mean_absolute_error(y_valid, preds)

166281.68070428775

catbr = CatBoostRegressor(n_estimators=100, random_state=42, thread_count=-1, logging_level = 'Silent', cat_features=categorical_features_names)
catbr.fit(X, y_train)

<catboost.core.CatBoostRegressor at 0x7f6007ab7a10>

8.10 Voting and Stacking

Carseats = pd.read_csv('/content/drive/MyDrive/Lab/Data/Carseats.csv')

print(Carseats.shape)
# Check for missing values
assert Carseats.isnull().sum().sum() == 0

# Create binary variable High 1 if Sales > 8
Carseats['High'] = (Carseats['Sales'] > 8).astype(np.float64)

predictors = Carseats.drop(["Sales","High"], axis=1).columns
X = pd.get_dummies(Carseats[predictors], drop_first=True) #sklearn does not have built-in support for categorical variable 
y = Carseats["High"]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=1)

(400, 11)

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

clf1 = make_pipeline(StandardScaler(),KNeighborsClassifier(n_neighbors=10))
clf2 = DecisionTreeClassifier(random_state=1)
clf3 = GaussianNB()

estimators = [
    ("knn", clf1),   
    ("dc", clf2),
    ("nb", clf3)
]

vclf = VotingClassifier(estimators=estimators, voting='hard')

k = 5
kf5 = KFold(n_splits=k, shuffle=True, random_state=1)

print('5-fold cross validation:\n')


for clf, label in zip([clf1, clf2, clf3, vclf], 
                      ['KNN', 
                       'DecisionTreeClassifier', 
                       'Naive Bayes',
                       'StackingClassifier']):

    scores = cross_val_score(clf, X_train, y_train, cv=kf5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
          % (scores.mean(), scores.std(), label))

5-fold cross validation:

Accuracy: 0.78 (+/- 0.06) [KNN]
Accuracy: 0.73 (+/- 0.07) [DecisionTreeClassifier]
Accuracy: 0.79 (+/- 0.07) [Naive Bayes]
Accuracy: 0.82 (+/- 0.04) [StackingClassifier]

Check https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html for more details

sclf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())

print('5-fold cross validation:\n')

for clf, label in zip([clf1, clf2, clf3, sclf], 
                      ['KNN', 
                       'DecisionTreeClassifier', 
                       'Naive Bayes',
                       'StackingClassifier']):

    scores = cross_val_score(clf, X_train, y_train, cv=kf5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
          % (scores.mean(), scores.std(), label))

5-fold cross validation:

Accuracy: 0.78 (+/- 0.06) [KNN]
Accuracy: 0.73 (+/- 0.07) [DecisionTreeClassifier]
Accuracy: 0.79 (+/- 0.07) [Naive Bayes]
Accuracy: 0.81 (+/- 0.05) [StackingClassifier]

Check https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html or http://rasbt.github.io/mlxtend/user_guide/classifier/StackingCVClassifier/#stackingcvclassifier for more details.