5 Feature selection and extraction

Author

phonchi

Published

March 20, 2023

5.1 Setup

!pip install Boruta -qq
!pip install opentsne -qq
!pip install umap-learn -qq

# Scientific computing
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
from matplotlib import pyplot as plt
%matplotlib inline

# Preprocessing and datasets
from sklearn.datasets import load_iris
from sklearn.datasets import load_digits
from sklearn.datasets import fetch_california_housing
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline, Pipeline

# Modeling
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Lasso
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

# Feature selection
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.feature_selection import SelectFromModel
from sklearn.inspection import permutation_importance

from boruta import BorutaPy

# Feature extraction
from sklearn.decomposition import PCA
from openTSNE import TSNE as oTSNE
import umap
from sklearn.cluster import KMeans
from sklearn.cluster import FeatureAgglomeration

5.2 Feature selection

The classes in the sklearn.feature_selection module can be used for feature selection/extraction methods on datasets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets.

5.2.1 Removing low variance features

Suppose that we have a dataset with boolean features, and we want to remove all features that are either one or zero (on or off) in more than 80% of the samples. Boolean features are Bernoulli random variables, and the variance of such variables is 0.8 * (1 - 0.8)

X = pd.DataFrame([[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]])
X

	0	1	2
0	0	0	1
1	0	1	0
2	1	0	0
3	0	1	1
4	0	1	0
5	0	1	1

sel = VarianceThreshold(threshold=(.8 * (1 - .8))).set_output(transform="pandas")
sel.fit_transform(X)

	x1	x2
0	0	1
1	1	0
2	0	0
3	1	1
4	1	0
5	1	1

As expected, VarianceThreshold() has removed the first column, which has a probability of containing a zero.

5.2.2 Univariate feature selection

Scikit-learn exposes feature selection routines as objects that implement the transform() method. For instance, we can perform a $\chi^2$ test to the samples to retrieve only the two best features as follows:

X, y = load_iris(return_X_y=True, as_frame=True) # Load the iris data set
X

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2
...	...	...	...	...
145	6.7	3.0	5.2	2.3
146	6.3	2.5	5.0	1.9
147	6.5	3.0	5.2	2.0
148	6.2	3.4	5.4	2.3
149	5.9	3.0	5.1	1.8

150 rows × 4 columns

selecter = SelectKBest(chi2, k=2).set_output(transform="pandas")
X_new = selecter.fit_transform(X, y)
X_new

	petal length (cm)	petal width (cm)
0	1.4	0.2
1	1.4	0.2
2	1.3	0.2
3	1.5	0.2
4	1.4	0.2
...	...	...
145	5.2	2.3
146	5.0	1.9
147	5.2	2.0
148	5.4	2.3
149	5.1	1.8

150 rows × 2 columns

plt.scatter(X_new.iloc[:,0], X_new.iloc[:,1], c=y)
plt.show()

These objects take as input a scoring function that returns univariate scores/p-values (or only scores for SelectKBest() and SelectPercentile()):

For regression: r_regression, f_regression, mutual_info_regression
For classification: chi2, f_classif, mutual_info_classif

The methods based on F-test estimate the degree of linear dependency between two random variables. For example, we create a dataset with two informative features among a hundred. To simplify our example, we do not include either redundant or repeated features. In addition, We will explicitly not shuffle the dataset to ensure that the informative features will correspond to the three two columns of X.

# Generate synthetic dataset

data, target = make_classification(
    n_samples=5000,
    n_features=100,
    n_informative=2,
    n_redundant=0,
    n_repeated=0,
    random_state=0,
    shuffle = False  # So that we know the informative freatures are X[:, n_informative + n_redundant + n_repeated:].
)

We will create two machine learning pipelines.

The former will be a random forest that will use all available features.
The latter will also be a random forest, but we will add a feature selection step to train this classifier.

# Let’s create the model without any feature selection
model_without_selection = RandomForestClassifier(n_jobs=2)

# Then, let’s create a pipeline where the first stage will make the feature selection processing.
model_with_selection = make_pipeline(
    SelectKBest(score_func=f_classif, k=2), # Feature slection
    RandomForestClassifier(n_jobs=2),       # Model
)

We will measure the average time spent to train each pipeline and make it predict. Besides, we will compute the testing score of the model. We will collect these results via cross-validation.

# Let’s start with the random forest without feature selection. We will store the results into a dataframe.
cv_results_without_selection = cross_validate(model_without_selection, data, target)
cv_results_without_selection = pd.DataFrame(cv_results_without_selection)

# Now, we will repeat the process for the pipeline incorporating the feature selection.
cv_results_with_selection = cross_validate(model_with_selection, data, target, return_estimator=True)
cv_results_with_selection = pd.DataFrame(cv_results_with_selection)

To analyze the results, we will merge the results from the two pipeline in a single pandas dataframe.

cv_results = pd.concat(
    [cv_results_without_selection, cv_results_with_selection],
    axis=1,
    keys=["Without feature selection", "With feature selection"],
)

# swap the level of the multi-index of the columns
cv_results = cv_results.swaplevel(axis="columns")

Let’s first analyze the train and score time for each pipeline.

color = {"whiskers": "black", "medians": "black", "caps": "black"}
cv_results["fit_time"].plot.box(color=color, vert=False)
plt.xlabel("Elapsed time (s)")
_ = plt.title("Time to fit the model")

cv_results["score_time"].plot.box(color=color, vert=False)
plt.xlabel("Elapsed time (s)")
_ = plt.title("Time to make prediction")

We can draw the same conclusions for both training and scoring elapsed time: selecting the most informative features speed-up our pipeline. Of course, such speed-up is beneficial only if the generalization performance in terms of metrics remain the same. Let’s check the testing score.

cv_results["test_score"].plot.box(color=color, vert=False)
plt.xlabel("Accuracy score")
_ = plt.title("Test score via cross-validation")

We can observe that the model’s generalization performance selecting a subset of features decreases compared with the model using all available features. Since we generated the dataset, we can infer that the decrease is because of the selection. The feature selection algorithm did not choose the two informative features.

for idx, pipeline in enumerate(cv_results_with_selection["estimator"]):
    print(
        f"Fold #{idx} - features selected are: "
        f"{np.argsort(pipeline[0].scores_)[-2:]}"
    )

Fold #0 - features selected are: [30  1]
Fold #1 - features selected are: [10  1]
Fold #2 - features selected are: [10  1]
Fold #3 - features selected are: [30  1]
Fold #4 - features selected are: [61  1]

We see that the feature 1 is always selected while the other feature varies depending on the cross-validation fold.

If we would like to keep our score with similar generalization performance, we could choose another metric to perform the test or select more features. For instance, we could select the number of features based on a specific percentile of the highest scores.

5.2.2.1 Mutual information

The Automobile dataset consists of 193 cars from the 1985 model year. The goal for this dataset is to predict a car’s price (the target) from 23 of the car’s features, such as make, body_style, and horsepower. In this example, we’ll rank the features with mutual information and investigate the results by data visualization. (The original dataset requires data cleaning, you could refer to https://skill-lync.com/student-projects/project-1-1299 for more details)

# Uncpmment below if you are using Kaggle
#!pip install gdown

!gdown --fuzzy https://drive.google.com/file/d/1FoCRK2LQBo1hlPWK2fDDcOjjy-DnPiJw/view?usp=sharing

Downloading...
From: https://drive.google.com/uc?id=1FoCRK2LQBo1hlPWK2fDDcOjjy-DnPiJw
To: /content/autos.csv
  0% 0.00/21.6k [00:00<?, ?B/s]100% 21.6k/21.6k [00:00<00:00, 31.6MB/s]

df = pd.read_csv("autos.csv") # Clean version
df.head()

	symboling	make	fuel_type	aspiration	num_of_doors	body_style	drive_wheels	engine_location	wheel_base	length	...	engine_size	fuel_system	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
0	3	alfa-romero	gas	std	2	convertible	rwd	front	88.6	168.8	...	130	mpfi	3.47	2.68	9	111	5000	21	27	13495
1	3	alfa-romero	gas	std	2	convertible	rwd	front	88.6	168.8	...	130	mpfi	3.47	2.68	9	111	5000	21	27	16500
2	1	alfa-romero	gas	std	2	hatchback	rwd	front	94.5	171.2	...	152	mpfi	2.68	3.47	9	154	5000	19	26	16500
3	2	audi	gas	std	4	sedan	fwd	front	99.8	176.6	...	109	mpfi	3.19	3.40	10	102	5500	24	30	13950
4	2	audi	gas	std	4	sedan	4wd	front	99.4	176.6	...	136	mpfi	3.19	3.40	8	115	5500	18	22	17450

5 rows × 25 columns

The scikit-learn algorithm for MI treats discrete features differently from continuous features. Consequently, you need to tell it which are which. As a rule of thumb, anything that have a float dtype is not discrete. Categoricals (object or categorial dtype) can be treated as discrete by giving them a label encoding

X = df.copy()
y = X.pop("price")

# Label encoding for categoricals
for colname in X.select_dtypes("object"):
# A way to perfrom label encoding see https://pandas.pydata.org/docs/reference/api/pandas.factorize.html
    X[colname], _ = X[colname].factorize() 

# All discrete features should now have integer dtypes (double-check this before using MI!)
discrete_features = X.dtypes == int

Scikit-learn has two mutual information metrics in its feature_selection module: one for real-valued targets (mutual_info_regression()) and one for categorical targets (mutual_info_classif()). Our target, price, is real-valued. The next cell computes the MI scores for our features and wraps them up in a nice dataframe.

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X, y, discrete_features)
mi_scores[::3]  # show a few features with their MI scores

curb_weight          1.436041
highway_mpg          0.948495
length               0.615407
bore                 0.497058
stroke               0.385846
num_of_cylinders     0.331445
compression_ratio    0.132048
fuel_type            0.047279
Name: MI Scores, dtype: float64

def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

plt.figure(dpi=100, figsize=(8, 5))
plot_mi_scores(mi_scores)

# As we might expect, the high-scoring `curb_weight` feature exhibits a strong relationship with `price`, the target.
sns.relplot(x="curb_weight", y="price", data=df)

The fuel_type feature has a fairly low MI score, but as we can see from the figure below, it clearly separates two price populations with different trends within the horsepower feature. This indicates that fuel_type contributes an interaction effect and might not be unimportant after all. Before deciding a feature is unimportant from its MI score, it’s good to investigate any possible interaction effects – domain knowledge can offer a lot of guidance here.

sns.lmplot(x="horsepower", y="price", hue="fuel_type", data=df)

5.2.3 Sequential feature selection

Sequential Feature Selection is available in the SequentialFeatureSelector transformer. SFS can be either forward or backward:

Forward-SFS is a greedy procedure that iteratively finds the best new feature to add to the set of selected features. Concretely, we initially start with zero feature and find the one feature that maximizes a cross-validated score when an estimator is trained on this single feature. Once that first feature is selected, we repeat the procedure by adding a new feature to the set of selected features. The procedure stops when the desired number of selected features is reached, as determined by the n_features_to_select parameter.
Backward-SFS follows the same idea but works in the opposite direction: instead of starting with no feature and greedily adding features, we start with all the features and greedily remove features from the set. The direction parameter controls whether forward or backward SFS is used.

In general, forward and backward selection do not yield equivalent results. Also, one may be much faster than the other depending on the requested number of selected features: if we have 10 features and ask for 7 selected features, forward selection would need to perform 7 iterations while backward selection would only need to perform 3.

X, y = load_iris(return_X_y=True, as_frame=True)
knn = KNeighborsClassifier(n_neighbors=3)

sfs = SequentialFeatureSelector(knn, n_features_to_select=2, direction='forward').set_output(transform="pandas")
sfs.fit(X, y)

SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3),
                          n_features_to_select=2)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

 df = sfs.transform(X)
 df

	sepal length (cm)	petal width (cm)
0	5.1	0.2
1	4.9	0.2
2	4.7	0.2
3	4.6	0.2
4	5.0	0.2
...	...	...
145	6.7	2.3
146	6.3	1.9
147	6.5	2.0
148	6.2	2.3
149	5.9	1.8

150 rows × 2 columns

plt.scatter(df.iloc[:,0], df.iloc[:,1],c=y)
plt.show()

5.2.4 Feature selection from model

SelectFromModel is a meta-transformer that can be used alongside any estimator that assigns importance to each feature through a specific attribute (such as coef_, feature_importances_) or via an importance_getter callable after fitting. The features are considered unimportant and removed if the corresponding importance of the feature values are below the provided threshold parameter.

Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. **Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”. In combination with the threshold criteria, one can use the max_features parameter to set a limit on the number of features to select.**

X, y = load_iris(return_X_y=True, as_frame=True)

lsvc = LinearSVC(C=0.01, penalty="l1", dual=False)
model = SelectFromModel(lsvc).set_output(transform="pandas")

X_new = model.fit_transform(X,y) # We use threshold instead or max_features here
X_new

/usr/local/lib/python3.9/dist-packages/sklearn/svm/_base.py:1244: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(

	sepal length (cm)	sepal width (cm)	petal length (cm)
0	5.1	3.5	1.4
1	4.9	3.0	1.4
2	4.7	3.2	1.3
3	4.6	3.1	1.5
4	5.0	3.6	1.4
...	...	...	...
145	6.7	3.0	5.2
146	6.3	2.5	5.0
147	6.5	3.0	5.2
148	6.2	3.4	5.4
149	5.9	3.0	5.1

150 rows × 3 columns

5.2.5 A Concret example

The following dataset is our old friend which is a record of neighborhoods in California district, predicting the median house value (target) given some information about the neighborhoods, as the average number of rooms, the latitude, the longitude or the median income of people in the neighborhoods (block).

X, y = fetch_california_housing(as_frame=True, return_X_y=True)

# To speed up the computation, we take the first 10,000 samples
X = X[:10000]
y = y[:10000]
X.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25

The feature reads as follow:

MedInc: median income in block
HouseAge: median house age in block
AveRooms: average number of rooms
AveBedrms: average number of bedrooms
Population: block population
AveOccup: average house occupancy
Latitude: house block latitude
Longitude: house block longitude
MedHouseVal: Median house value in 100k$ (target)

To assert the quality of our inspection technique, let’s add some random feature that won’t help the prediction (un-informative feature)

# Adding random features
rng = np.random.RandomState(0)
bin_var = pd.Series(rng.randint(0, 2, X.shape[0]), name='rnd_bin')
num_var = pd.Series(np.arange(X.shape[0]), name='rnd_num')
X_with_rnd_feat = pd.concat((X, bin_var, num_var), axis=1)
X_with_rnd_feat

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	rnd_bin	rnd_num
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	0	0
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	1	1
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	1	2
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	0	3
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	1	4
...	...	...	...	...	...	...	...	...	...	...
9995	4.0775	10.0	6.140900	1.025440	1275.0	2.495108	39.14	-121.03	0	9995
9996	4.0848	8.0	6.350394	1.091864	1977.0	2.594488	39.13	-121.07	1	9996
9997	3.6333	7.0	7.243455	1.107330	1143.0	2.992147	39.11	-121.05	0	9997
9998	3.4630	8.0	6.363636	1.166297	1307.0	2.898004	39.08	-121.04	1	9998
9999	3.0781	7.0	5.487500	1.050000	246.0	3.075000	39.09	-121.00	0	9999

10000 rows × 10 columns

X_train, X_test, y_train, y_test = train_test_split(X_with_rnd_feat, y, random_state=42)

In linear models, the target value is modeled as a linear combination of the features.

model = RidgeCV()

model.fit(X_train, y_train)

print(f'model score on training data: {model.score(X_train, y_train)}')
print(f'model score on testing data: {model.score(X_test, y_test)}')

model score on training data: 0.6049524592207427
model score on testing data: 0.5863921053581754

Our linear model obtains a score of .60, so it explains a significant part of the target. Its coefficient should be somehow relevant. Let’s look at the coefficient learnt

coefs = pd.DataFrame(
   model.coef_,
   columns=['Coefficients'], index=X_train.columns
)

coefs.plot(kind='barh', figsize=(9, 7))
plt.title('Ridge model')
plt.axvline(x=0, color='.5')
plt.subplots_adjust(left=.3)

The AveBedrms have the higher coefficient. However, we can’t compare the magnitude of these coefficients directly, since they are not scaled. Indeed, Population is an integer which can be thousands, while AveBedrms is around 4 and Latitude is in degree.

So the Population coefficient is expressed in “100k$/habitant” while the AveBedrms is expressed in “100k$/nb of bedrooms” and the Latitude coefficient in “100k$/degree”. We see that changing population by one does not change the outcome, while as we go south (latitude increase) the price becomes cheaper. Also, adding a bedroom (keeping all other feature constant) shall rise the price of the house by 80k$.

So looking at the coefficient plot to gauge feature importance can be misleading as some of them vary on a small scale, while others vary a lot more, several decades. So before any interpretation, we need to scale each column (removing the mean and scaling the variance to 1).

model = make_pipeline(StandardScaler(), RidgeCV())

model.fit(X_train, y_train)

print(f'model score on training data: {model.score(X_train, y_train)}')
print(f'model score on testing data: {model.score(X_test, y_test)}')

model score on training data: 0.6049222473801685
model score on testing data: 0.586090835494786

coefs = pd.DataFrame(
   model[1].coef_,
   columns=['Coefficients'], index=X_train.columns
)

coefs.plot(kind='barh', figsize=(9, 7))
plt.title('Ridge model')
plt.axvline(x=0, color='.5')
plt.subplots_adjust(left=.3)

Now that the coefficients have been scaled, we can safely compare them. The MedInc feature, with longitude and latitude are the three variables that most influence the model.

The plot above tells us about dependencies between a specific feature and the target when all other features remain constant, i.e., conditional dependencies. An increase of the HouseAge will induce an increase of the price when all other features remain constant. On the contrary, an increase of the AveRooms will induce an decrease of the price when all other features remain constant.

We can check the coefficient variability through cross-validation: it is a form of data perturbation.

cv_model = cross_validate(
   model, X_with_rnd_feat, y, cv=RepeatedKFold(n_splits=5, n_repeats=5),
   return_estimator=True, n_jobs=2
)
coefs = pd.DataFrame(
   [model[1].coef_
    for model in cv_model['estimator']],
   columns=X_with_rnd_feat.columns
)
plt.figure(figsize=(9, 7))
sns.boxplot(data=coefs, orient='h', color='cyan', saturation=0.5)
plt.axvline(x=0, color='.5')
plt.xlabel('Coefficient importance')
plt.title('Coefficient importance and its variability')
plt.subplots_adjust(left=.3)

Now if we want to select the four features which are the most important according to the coefficients. The SelectFromModel() is meant just for that. SelectFromModel() accepts a threshold parameter and will select the features whose importance (defined by the coefficients) are above this threshold.

model

Pipeline(steps=[('standardscaler', StandardScaler()), ('ridgecv', RidgeCV())])

importance = np.abs(model[1].coef_)
threshold = np.sort(importance)[-5] + 0.01

sfm = SelectFromModel(model[1], threshold=threshold).fit(X, y)
print(f"Features selected by SelectFromModel: {sfm.get_feature_names_out()}")

Features selected by SelectFromModel: ['MedInc' 'AveBedrms' 'Latitude' 'Longitude']

5.2.5.1 Linear models with sparse coefficients (Lasso)

In it important to keep in mind that the associations extracted depend on the model. To illustrate this point we consider a Lasso model, that performs feature selection with a L1 penalty. Let us fit a Lasso model with a strong regularization parameters alpha

model = make_pipeline(StandardScaler(), Lasso(alpha=.015))

model.fit(X_train, y_train)

print(f'model score on training data: {model.score(X_train, y_train)}')
print(f'model score on testing data: {model.score(X_test, y_test)}')

model score on training data: 0.5933235371761756
model score on testing data: 0.5673786563118284

coefs = pd.DataFrame(
   model[1].coef_,
   columns=['Coefficients'], index=X_train.columns
)

coefs.plot(kind='barh', figsize=(9, 7))
plt.title('Lasso model, strong regularization')
plt.axvline(x=0, color='.5')
plt.subplots_adjust(left=.3)

Here the model score is a bit lower, because of the strong regularization. However, it has zeroed out 3 coefficients, selecting a small number of variables to make its prediction.

5.2.5.2 Randomforest with feature importance

On some algorithms, there are some feature importance methods, inherently built within the model. It is the case in RandomForest models. Let’s investigate the built-in feature_importances_ attribute.

model = RandomForestRegressor()

model.fit(X_train, y_train)

print(f'model score on training data: {model.score(X_train, y_train)}')
print(f'model score on testing data: {model.score(X_test, y_test)}')

model score on training data: 0.9796643318943656
model score on testing data: 0.8429137479202747

importances = model.feature_importances_
indices = np.argsort(importances)

fig, ax = plt.subplots()
ax.barh(range(len(importances)), importances[indices])
ax.set_yticks(range(len(importances)))
_ = ax.set_yticklabels(np.array(X_train.columns)[indices])

MedInc is still the most important feature. It also has a small bias toward high cardinality features, such as the noisy feature rnd_num, which are here predicted having 0.1 importance, more than HouseAge (which has low cardinality).

5.2.5.3 Feature importance by permutation

We introduce here a new technique to evaluate the feature importance of any given fitted model. It basically shuffles a feature and sees how the model changes its prediction. Thus, the change in prediction will correspond to the feature importance.

# Any model could be used here

model = RandomForestRegressor()
model.fit(X_train, y_train)

print(f'model score on training data: {model.score(X_train, y_train)}')
print(f'model score on testing data: {model.score(X_test, y_test)}')

model score on training data: 0.9795458070557226
model score on testing data: 0.8448700648905965

r = permutation_importance(model, X_test, y_test, n_repeats=30, random_state=42) # Notice that the model are already fitted

fig, ax = plt.subplots()

indices = r.importances_mean.argsort()
plt.barh(range(len(indices)), r.importances_mean[indices], xerr=r.importances_std[indices])

ax.set_yticks(range(len(indices)))
_ = ax.set_yticklabels(np.array(X_train.columns)[indices])

We see again that the feature MedInc, Latitude and Longitude are important for the model. We note that our random variable rnd_num is now less important than Latitude. Indeed, the feature importance built-in in RandomForest has bias for continuous data, such as AveOccup and rnd_num.

5.2.5.4 Feature rejection using Boruta

# define Boruta feature selection method
model = RandomForestRegressor()
feat_selector = BorutaPy(model, n_estimators='auto', verbose=2, random_state=1)

# find all relevant features 
feat_selector.fit(X_train.to_numpy(), y_train.to_numpy())

Iteration:  1 / 100
Confirmed:  0
Tentative:  10
Rejected:   0
Iteration:  2 / 100
Confirmed:  0
Tentative:  10
Rejected:   0
Iteration:  3 / 100
Confirmed:  0
Tentative:  10
Rejected:   0
Iteration:  4 / 100
Confirmed:  0
Tentative:  10
Rejected:   0
Iteration:  5 / 100
Confirmed:  0
Tentative:  10
Rejected:   0
Iteration:  6 / 100
Confirmed:  0
Tentative:  10
Rejected:   0
Iteration:  7 / 100
Confirmed:  0
Tentative:  10
Rejected:   0
Iteration:  8 / 100
Confirmed:  9
Tentative:  0
Rejected:   1


BorutaPy finished running.

Iteration:  9 / 100
Confirmed:  9
Tentative:  0
Rejected:   1

BorutaPy(estimator=RandomForestRegressor(n_estimators=44,
                                         random_state=RandomState(MT19937) at 0x7F60C3A0E640),
         n_estimators='auto',
         random_state=RandomState(MT19937) at 0x7F60C3A0E640, verbose=2)

X_train

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	rnd_bin	rnd_num
4901	1.3287	43.0	4.036723	1.090395	1412.0	3.988701	34.01	-118.25	0	4901
4375	2.3421	48.0	3.425532	1.046809	633.0	2.693617	34.10	-118.28	0	4375
6698	3.6572	26.0	4.160797	1.093023	3001.0	1.994020	34.14	-118.10	0	6698
9805	3.2750	52.0	8.357827	1.543131	582.0	1.859425	36.55	-121.92	1	9805
1101	3.5189	15.0	5.489011	1.027473	1786.0	2.453297	39.82	-121.68	0	1101
...	...	...	...	...	...	...	...	...	...	...
5734	5.1718	31.0	5.676417	1.063985	1359.0	2.484461	34.16	-118.23	0	5734
5191	1.5256	36.0	4.897778	1.097778	702.0	3.120000	33.93	-118.26	1	5191
5390	2.9344	36.0	3.986717	1.079696	1756.0	3.332068	34.03	-118.38	0	5390
860	5.7192	15.0	6.395349	1.067979	1777.0	3.178891	37.58	-121.96	0	860
7270	2.3900	25.0	3.928287	1.235060	1439.0	5.733068	33.98	-118.23	1	7270

7500 rows × 10 columns

# check selected features 
np.array(X_train.columns)[feat_selector.support_]

array(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
       'AveOccup', 'Latitude', 'Longitude', 'rnd_num'], dtype=object)

# check ranking of features
feat_selector.ranking_

array([1, 1, 1, 1, 1, 1, 1, 1, 2, 1])

# call transform() on X to filter it down to selected features
X_filtered = feat_selector.transform(X_train.to_numpy())
X_filtered.shape

(7500, 9)

5.3 Dimensional reduction

We now looked at our model-based method for feature engineering: principal component analysis (PCA). You could think of PCA as a partitioning of the variation in the data. PCA is a great tool to help you discover important relationships in the data and can also be used to create more informative features.

There are two ways you could use PCA for feature engineering.

The first way is to use it as a descriptive technique. Since the components tell you about the variation, you could compute the MI scores for the components and see what kind of variation is most predictive of your target. That could give you ideas for kinds of features to create – a product of 'Height' and 'Diameter' if 'Size' is important, say, or a ratio of 'Height' and 'Diameter' if Shape is important. You could even try clustering on one or more of the high-scoring components. Biplot will be useful in this case.
The second way is to use the components themselves as features. Because the components expose the variational structure of the data directly, they can often be more informative than the original features. Here are some use-cases:

Dimensionality reduction: When your features are highly redundant (multicollinear, specifically), PCA will partition out the redundancy into one or more near-zero variance components, which you can then drop since they will contain little or no information.
Anomaly detection: Unusual variation, not apparent from the original features, will often show up in the low-variance components. These components could be highly informative in an anomaly or outlier detection task.
Noise reduction: A collection of sensor readings will often share some common background noise. PCA can sometimes collect the (informative) signal into a smaller number of features while leaving the noise alone, thus boosting the signal-to-noise ratio.
Decorrelation: Some ML algorithms struggle with highly-correlated features. PCA transforms correlated features into uncorrelated components, which could be easier for your algorithm to work with.

PCA basically gives you direct access to the correlational structure of your data. You’ll no doubt come up with applications of your own!

def plot_variance(pca, width=8, dpi=100):
    # Create figure
    fig, axs = plt.subplots(1, 2)
    n = pca.n_components_
    grid = np.arange(1, n + 1)
    # Explained variance
    evr = pca.explained_variance_ratio_
    axs[0].bar(grid, evr)
    axs[0].set(
        xlabel="Component", title="% Explained Variance", ylim=(0.0, 1.0)
    )
    # Cumulative Variance
    cv = np.cumsum(evr)
    axs[1].plot(np.r_[0, grid], np.r_[0, cv], "o-")
    axs[1].set(
        xlabel="Component", title="% Cumulative Variance", ylim=(0.0, 1.0)
    )
    # Set up figure
    fig.set(figwidth=8, dpi=100)
    return axs

df = pd.read_csv("autos.csv")
df.head()

	symboling	make	fuel_type	aspiration	num_of_doors	body_style	drive_wheels	engine_location	wheel_base	length	...	engine_size	fuel_system	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
0	3	alfa-romero	gas	std	2	convertible	rwd	front	88.6	168.8	...	130	mpfi	3.47	2.68	9	111	5000	21	27	13495
1	3	alfa-romero	gas	std	2	convertible	rwd	front	88.6	168.8	...	130	mpfi	3.47	2.68	9	111	5000	21	27	16500
2	1	alfa-romero	gas	std	2	hatchback	rwd	front	94.5	171.2	...	152	mpfi	2.68	3.47	9	154	5000	19	26	16500
3	2	audi	gas	std	4	sedan	fwd	front	99.8	176.6	...	109	mpfi	3.19	3.40	10	102	5500	24	30	13950
4	2	audi	gas	std	4	sedan	4wd	front	99.4	176.6	...	136	mpfi	3.19	3.40	8	115	5500	18	22	17450

5 rows × 25 columns

We’ve selected four features that cover a range of properties. Each of these features also has a high MI score with the target, price. We’ll standardize the data since these features aren’t naturally on the same scale.

features = ["highway_mpg", "engine_size", "horsepower", "curb_weight"]

X = df.copy()
y = X.pop('price')
X = X.loc[:, features]

# Standardize
X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)

Now we can fit scikit-learn’s PCA estimator and create the principal components. You can see here the first few rows of the transformed dataset.

# Create principal components
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Convert to dataframe
component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
X_pca = pd.DataFrame(X_pca, columns=component_names)

X_pca.head(10)

	PC1	PC2	PC3	PC4
0	0.382486	-0.400222	0.124122	0.169539
1	0.382486	-0.400222	0.124122	0.169539
2	1.550890	-0.107175	0.598361	-0.256081
3	-0.408859	-0.425947	0.243335	0.013920
4	1.132749	-0.814565	-0.202885	0.224138
5	0.547265	-0.545141	0.139969	0.424955
6	0.869268	-0.472834	-0.294073	0.090174
7	0.974373	-0.449233	-0.435749	-0.019102
8	1.796553	-1.050783	-0.081821	-0.296071
9	-0.306514	-0.542020	0.138605	0.012612

After fitting, the PCA instance contains the loadings in its components_ attribute. We’ll wrap the loadings up in a dataframe.

loadings = pd.DataFrame(
    pca.components_.T,  # transpose the matrix of loadings
    columns=component_names,  # so the columns are the principal components
    index=X.columns,  # and the rows are the original features
)
loadings

	PC1	PC2	PC3	PC4
highway_mpg	-0.492347	0.770892	0.070142	-0.397996
engine_size	0.503859	0.626709	0.019960	0.594107
horsepower	0.500448	0.013788	0.731093	-0.463534
curb_weight	0.503262	0.113008	-0.678369	-0.523232

# biplot 
xvector = pca.components_[0] 
yvector = -pca.components_[1]

xs = X_pca.to_numpy()[:,0] 
ys = -X_pca.to_numpy()[:,1]

fig, ax1 = plt.subplots(nrows=1, ncols=1, figsize=(10, 9))

# First Plot : Principal Component Scores 
x_min, x_max = -8, 8
y_min, y_max = -3, 3
ax1.set_xlim(x_min, x_max)
ax1.set_ylim(y_min, y_max)

for i in range(len(xs)):
    plt.plot(xs[i], ys[i], 'bo')

ax1.set_xlabel("1'st Principal Component Scores")
ax1.set_ylabel("2'nd Principal Component Scores")

# Plot reference lines
ax1.hlines(0,-3.5,3.5, linestyles='dotted', colors='grey')
ax1.vlines(0,-3.5,3.5, linestyles='dotted', colors='grey')

# Second Plot : Principal Component Loadings 'PCs' 
x_min, x_max = -1, 1          
y_min, y_max = -1, 1

ax2 = ax1.twinx().twiny()
ax2.set_xlim(x_min, x_max)
ax2.set_ylim(y_min, y_max)

for i in range(len(xvector)):
    ax2.arrow(0, 0, xvector[i],  yvector[i], color='red', width=0.005, head_width=0.02)
    ax2.annotate(X.columns[i], (xvector[i]*1.05, yvector[i]*1.05), color='red', size=14)  

    
ax2.set_xlabel("1'st Principal Component (Loading)", color='red')
ax2.set_ylabel("2'nd Principal Component (Loading)", color='red')

Text(0, 0.5, "2'nd Principal Component (Loading)")

Recall that the signs and magnitudes of a component’s loadings tell us what kind of variation it’s captured. The first component (PC1) shows a contrast between large, powerful vehicles with poor gas milage, and smaller, more economical vehicles with good gas milage. We might call this the “Luxury/Economy” axis. The next figure shows that our four chosen features mostly vary along the Luxury/Economy axis.

# Look at explained variance
plot_variance(pca)

array([<Axes: title={'center': '% Explained Variance'}, xlabel='Component'>,
       <Axes: title={'center': '% Cumulative Variance'}, xlabel='Component'>],
      dtype=object)

Let’s also look at the MI scores of the components. Not surprisingly, PC1 is highly informative, though the remaining components, despite their small variance, still have a significant relationship with price. Examining those components could be worthwhile to find relationships not captured by the main Luxury/Economy axis.

mi_scores = make_mi_scores(X_pca, y, discrete_features=False)
mi_scores

PC1    1.013739
PC2    0.379649
PC3    0.306207
PC4    0.204905
Name: MI Scores, dtype: float64

The third component shows a contrast between horsepower and curb_weight – sports cars vs. wagons, it seems.

# Show dataframe sorted by PC3
idx = X_pca["PC3"].sort_values(ascending=False).index
cols = ["make", "body_style", "horsepower", "curb_weight"]
df.loc[idx, cols]

	make	body_style	horsepower	curb_weight
118	porsche	hardtop	207	2756
117	porsche	hardtop	207	2756
119	porsche	convertible	207	2800
45	jaguar	sedan	262	3950
96	nissan	hatchback	200	3139
...	...	...	...	...
59	mercedes-benz	wagon	123	3750
61	mercedes-benz	sedan	123	3770
101	peugot	wagon	95	3430
105	peugot	wagon	95	3485
143	toyota	wagon	62	3110

193 rows × 4 columns

To express this contrast, let’s create a new ratio feature:

df["sports_or_wagon"] = X.curb_weight / X.horsepower
sns.regplot(x="sports_or_wagon", y='price', data=df, order=2);

5.4 Manifold learning

5.4.0.1 t-SNE

digits = load_digits()

X = digits.images.reshape(-1, digits.images.shape[1]*digits.images.shape[2])
X.shape

(1797, 64)

fig, ax_array = plt.subplots(5, 5)
axes = ax_array.flatten()
for i, ax in enumerate(axes):
    ax.imshow(digits.images[i], cmap='gray_r')
plt.setp(axes, xticks=[], yticks=[], frame_on=False)
plt.tight_layout(h_pad=0.5, w_pad=0.01)

otsne = oTSNE(
    n_components=2,
    perplexity=30,
    initialization='pca', 
    n_jobs=2,
    random_state=0,
    negative_gradient_method='auto', 
    verbose=True,
)

embedding = otsne.fit(X)

--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, n_jobs=2, random_state=0, verbose=True)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using Annoy approximate search using euclidean distance...
   --> Time elapsed: 0.79 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.08 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.01 seconds
===> Running optimization with exaggeration=12.00, lr=149.75 for 250 iterations...
Iteration   50, KL divergence 2.6162, 50 iterations in 0.5615 sec
Iteration  100, KL divergence 2.6249, 50 iterations in 0.4249 sec
Iteration  150, KL divergence 2.6110, 50 iterations in 0.5087 sec
Iteration  200, KL divergence 2.6054, 50 iterations in 0.4346 sec
Iteration  250, KL divergence 2.6028, 50 iterations in 0.4644 sec
   --> Time elapsed: 2.40 seconds
===> Running optimization with exaggeration=1.00, lr=1797.00 for 500 iterations...
Iteration   50, KL divergence 0.9203, 50 iterations in 0.4606 sec
Iteration  100, KL divergence 0.8272, 50 iterations in 0.4424 sec
Iteration  150, KL divergence 0.7951, 50 iterations in 0.4422 sec
Iteration  200, KL divergence 0.7783, 50 iterations in 0.4284 sec
Iteration  250, KL divergence 0.7685, 50 iterations in 0.4518 sec
Iteration  300, KL divergence 0.7625, 50 iterations in 1.4672 sec
Iteration  350, KL divergence 0.7580, 50 iterations in 1.2427 sec
Iteration  400, KL divergence 0.7539, 50 iterations in 0.4364 sec
Iteration  450, KL divergence 0.7514, 50 iterations in 0.4209 sec
Iteration  500, KL divergence 0.7492, 50 iterations in 1.7576 sec
   --> Time elapsed: 7.56 seconds

plt.figure(figsize=(10, 8))
plt.scatter(embedding[:, 0], embedding[:, 1], c=digits.target, cmap='Spectral', s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.title('tSNE of the Digits dataset', fontsize=24)

Text(0.5, 1.0, 'tSNE of the Digits dataset')

5.4.0.2 UMAP

UMAP is useful for generating visualisations, but if you want to make use of UMAP more generally for machine learning tasks it is important to be be able to train a model and then later pass new data to the model and have it transform that data into the learned space. For example if we use UMAP to learn a latent space and then train a classifier on data transformed into the latent space then the classifier is only useful for prediction if we can transform data for which we want a prediction into the latent space the classifier uses.

X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, stratify=digits.target, random_state=42)

Now to get a benchmark idea of what we are looking at let’s train a couple of different classifiers and then see how well they score on the test set. For this example let’s try a support vector classifier and a KNN classifier.

svc = SVC(gamma='auto').fit(X_train, y_train)
knn = KNeighborsClassifier().fit(X_train, y_train)
svc.score(X_test, y_test), knn.score(X_test, y_test)

(0.62, 0.9844444444444445)

The goal now is to make use of UMAP as a preprocessing step that one could potentially fit into a pipeline.

trans = umap.UMAP(n_neighbors=5, random_state=42).fit(X_train)

plt.figure(figsize=(10, 8))
plt.scatter(trans.embedding_[:, 0], trans.embedding_[:, 1], c=y_train, cmap='Spectral', s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.title('Umap of the Digits dataset', fontsize=24)

Text(0.5, 1.0, 'Umap of the Digits dataset')

This looks very promising! Most of the classes got very cleanly separated, and that gives us some hope that it could help with classifier performance. We can now train some new models (again an SVC and a KNN classifier) on the embedded training data. This looks exactly as before but now we pass it the embedded data.

svc = SVC(gamma='auto').fit(trans.embedding_, y_train)
knn = KNeighborsClassifier().fit(trans.embedding_, y_train)

test_embedding = trans.transform(X_test)

The next important question is what the transform did to our test data. In principle we have a new two dimensional representation of the test-set, and ideally this should be based on the existing embedding of the training set

plt.figure(figsize=(10, 8))
plt.scatter(test_embedding[:, 0], test_embedding[:, 1], c=y_test, cmap='Spectral', s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.title('Umap of the Digits dataset', fontsize=24)

Text(0.5, 1.0, 'Umap of the Digits dataset')

The results look like what we should expect; the test data has been embedded into two dimensions in exactly the locations we should expect (by class) given the embedding of the training data visualised above. This means we can now try out models that were trained on the embedded training data by handing them the newly transformed test set.

trans.transform(X_test)

array([[ 4.73177004e+00,  2.46851373e+00],
       [ 1.91173019e+01,  3.47619963e+00],
       [ 7.86290264e+00,  1.09972172e+01],
       [-8.19267273e+00,  3.35794115e+00],
       [ 7.09956217e+00,  1.42389803e+01],
       [ 1.32900066e+01,  1.82322578e+01],
       [ 2.65943050e-01, -8.95168245e-01],
       [ 8.13252068e+00,  1.00992136e+01],
       [ 5.73924875e+00,  2.07148552e+00],
       [ 5.40985489e+00,  1.51770325e+01],
       [ 6.47868681e+00,  1.46163197e+01],
       [ 7.92969179e+00,  9.91152573e+00],
       [ 1.17540598e+01, -5.85403776e+00],
       [ 1.21744642e+01, -6.17784595e+00],
       [ 6.87150419e-01, -2.17636895e+00],
       [ 1.28335676e+01,  1.65336361e+01],
       [ 6.92930698e+00,  1.49666691e+01],
       [ 1.00950708e+01,  6.95953965e-01],
       [ 3.05954218e+00,  4.18441391e+00],
       [ 1.22419624e+01,  1.81537476e+01],
       [ 8.41183376e+00,  1.01441069e+01],
       [ 9.57836246e+00,  2.75005311e-01],
       [ 1.93220310e+01,  2.32694030e+00],
       [ 1.28659058e+01,  1.85482330e+01],
       [ 1.96019783e+01,  2.01119971e+00],
       [ 4.99640083e+00,  1.86033607e+00],
       [ 1.13867264e+01, -6.41329575e+00],
       [ 2.00881138e+01,  3.05871224e+00],
       [ 3.00027013e+00, -8.01352322e-01],
       [ 2.92265922e-01, -2.01319861e+00],
       [ 1.41106904e+00,  9.47769833e+00],
       [ 2.43231082e+00,  1.04742994e+01],
       [ 4.24076748e+00,  3.97745585e+00],
       [-6.14816606e-01, -1.45310473e+00],
       [ 1.03815241e+01,  1.65336680e+00],
       [ 1.33825254e+01,  1.69268723e+01],
       [ 7.61123228e+00,  1.40565615e+01],
       [ 5.55537558e+00,  2.05418849e+00],
       [ 6.94975233e+00,  1.43994637e+01],
       [ 8.03399944e+00,  1.41689272e+01],
       [ 1.11499023e+01,  1.69359303e+00],
       [ 1.93824215e+01,  1.83013022e+00],
       [ 4.15738869e+00,  5.18572140e+00],
       [ 3.60030508e+00,  5.01691294e+00],
       [ 8.05832386e+00,  1.41819515e+01],
       [ 3.83167720e+00,  4.41925764e+00],
       [ 1.04558792e+01,  1.46102712e-01],
       [ 1.96752357e+00,  9.15024090e+00],
       [ 1.96036167e+01,  2.88023281e+00],
       [ 4.75290745e-01, -3.02901536e-01],
       [ 1.28543739e+01, -6.69067383e+00],
       [ 7.68392706e+00,  1.35354586e+01],
       [ 5.70900965e+00,  2.04589939e+00],
       [ 1.27460394e+01, -7.56970310e+00],
       [-8.09807491e+00,  3.45251179e+00],
       [-8.20304966e+00,  3.34776187e+00],
       [ 1.01384201e+01,  6.57227516e-01],
       [ 1.05179510e+01,  1.68638420e+00],
       [ 7.96865273e+00,  1.05273275e+01],
       [ 4.90999174e+00,  1.58829975e+01],
       [ 8.63133049e+00, -4.62174177e+00],
       [ 7.42704582e+00,  1.51531487e+01],
       [ 1.93333473e+01,  1.75126565e+00],
       [ 7.81681252e+00,  1.01148701e+01],
       [ 1.88073101e+01,  2.74154377e+00],
       [ 1.29166088e+01, -6.52322578e+00],
       [ 9.41298866e+00,  1.08080494e+00],
       [ 6.16237783e+00, -9.58264709e-01],
       [ 7.35584676e-01, -1.40368450e+00],
       [ 7.49761915e+00,  9.53435230e+00],
       [ 1.26329851e+01,  1.74112072e+01],
       [ 6.09630585e+00, -2.83607483e-01],
       [ 4.56159449e+00,  4.63957405e+00],
       [ 5.21145630e+00,  1.55418911e+01],
       [ 8.52060318e+00, -5.29649198e-01],
       [ 1.76717103e+00,  9.06908798e+00],
       [ 1.12594156e+01,  1.60219014e+00],
       [ 1.06761303e+01,  5.61923206e-01],
       [ 1.52136588e+00,  9.05504322e+00],
       [ 1.22102089e+01,  1.73202934e+01],
       [ 6.19346762e+00, -2.01016366e-01],
       [ 1.95389805e+01,  2.73830152e+00],
       [ 6.07237339e+00, -7.12067902e-01],
       [ 1.18179951e+01, -7.24027014e+00],
       [ 2.35402918e+00,  9.97527313e+00],
       [ 8.05538595e-01, -1.55245471e+00],
       [-5.48529863e-01, -1.44640124e+00],
       [ 8.46719360e+00, -4.49677277e+00],
       [ 2.63080096e+00, -4.21153259e+00],
       [ 1.34470768e+01,  1.80865421e+01],
       [ 7.93155432e+00,  1.32082472e+01],
       [ 2.90374517e+00,  5.25399637e+00],
       [ 1.21058340e+01, -5.75693417e+00],
       [ 1.18059082e+01,  1.69832764e+01],
       [-2.30677761e-02, -2.63910270e+00],
       [ 7.50664759e+00,  1.45911045e+01],
       [ 2.61786890e+00, -4.19906235e+00],
       [ 6.76057696e-01, -2.03193378e+00],
       [ 1.87298660e+01,  2.62258267e+00],
       [ 7.26300764e+00,  1.50124683e+01],
       [ 1.90552788e+01,  3.07881856e+00],
       [ 6.09509850e+00, -7.56763577e-01],
       [ 6.17576075e+00, -7.26652086e-01],
       [ 7.98868608e+00,  9.83880329e+00],
       [ 1.21262283e+01, -7.42956352e+00],
       [ 9.36670208e+00,  9.97476220e-01],
       [ 1.25175686e+01,  1.71377945e+01],
       [ 6.26237488e+00,  1.43485527e+01],
       [ 5.53839064e+00,  1.20771849e+00],
       [ 1.95191860e+01,  3.09513426e+00],
       [ 1.93319950e+01,  2.04565048e+00],
       [ 7.90213203e+00,  1.01868477e+01],
       [ 4.40916538e+00,  4.91170883e+00],
       [ 1.59102690e+00,  5.91555834e+00],
       [ 1.06758175e+01,  7.85464406e-01],
       [ 8.50780201e+00, -4.53833485e+00],
       [ 3.33708262e+00,  4.26565790e+00],
       [ 1.03685141e+01,  1.33268273e+00],
       [ 5.50983715e+00,  1.92333841e+00],
       [ 9.89408302e+00,  7.60242999e-01],
       [ 1.92521572e+01,  3.28466463e+00],
       [ 1.93915577e+01,  1.80793393e+00],
       [-9.59185779e-01, -1.43418705e+00],
       [ 5.38180447e+00,  1.55832243e+01],
       [-2.10883573e-01, -1.04512537e+00],
       [ 6.69982815e+00, -7.31428787e-02],
       [ 2.64970207e+00,  1.04629345e+01],
       [ 2.70043421e+00, -4.25971699e+00],
       [ 1.29440527e+01, -6.61230564e+00],
       [ 9.27611649e-01, -9.55748022e-01],
       [ 1.18965158e+01, -6.58824778e+00],
       [ 1.23818903e+01,  1.64546566e+01],
       [ 2.48696733e+00, -4.27661419e+00],
       [ 9.49136913e-01, -8.76113117e-01],
       [ 1.91355953e+01,  2.56769919e+00],
       [ 2.03352299e+01,  3.30455041e+00],
       [ 5.40721941e+00,  1.76240933e+00],
       [ 1.20430918e+01,  1.79942932e+01],
       [ 1.66636074e+00,  8.91286182e+00],
       [ 6.12154388e+00,  1.49706059e+01],
       [ 5.76479673e+00,  2.05096912e+00],
       [ 1.26848288e+01,  1.63744678e+01],
       [ 1.91420193e+01,  3.13646555e+00],
       [ 7.81396389e+00,  1.00339603e+01],
       [ 1.26516829e+01,  1.84274235e+01],
       [ 3.53068799e-01, -9.54452991e-01],
       [ 1.02022676e+01,  1.75975370e+00],
       [ 4.81706905e+00,  2.32386041e+00],
       [ 1.26537113e+01,  1.69497089e+01],
       [ 1.23623486e+01,  1.78191795e+01],
       [ 8.54515076e-01, -1.01536226e+00],
       [ 7.05518866e+00,  1.40135460e+01],
       [ 2.21469665e+00,  5.26067495e+00],
       [ 2.56542659e+00,  5.00661182e+00],
       [ 1.80082548e+00,  9.62335300e+00],
       [ 1.07883806e+01,  1.61723030e+00],
       [ 9.88454342e+00, -2.33417273e-01],
       [ 8.17566204e+00,  9.67235184e+00],
       [ 1.88418846e+01,  2.49431753e+00],
       [ 1.28272562e+01, -6.51219225e+00],
       [ 8.06352520e+00,  1.05998135e+01],
       [ 1.97371750e+01,  3.02595925e+00],
       [ 8.25599575e+00,  1.07834330e+01],
       [ 1.17708435e+01,  1.67840595e+01],
       [ 1.96602249e+01,  2.35182858e+00],
       [-4.44723725e-01, -1.08014917e+00],
       [ 1.22830667e+01,  1.63258533e+01],
       [ 1.00864286e+01,  1.41382837e+00],
       [ 9.63279915e+00,  1.22358000e+00],
       [ 1.17074785e+01,  1.73078613e+01],
       [ 4.91881818e-01, -1.86782694e+00],
       [ 9.27524626e-01, -3.64588648e-01],
       [ 1.03519487e+01,  1.71525097e+00],
       [ 1.21878176e+01,  1.66502571e+01],
       [ 2.54178166e+00, -4.13714075e+00],
       [-7.94874549e-01, -1.32138753e+00],
       [ 8.94740105e+00, -7.02543706e-02],
       [ 1.60824668e+00,  9.57526588e+00],
       [ 7.69076777e+00,  1.38979092e+01],
       [ 1.19308853e+01, -5.87281370e+00],
       [ 1.96669064e+01,  2.94271183e+00],
       [ 4.00215197e+00,  5.47322416e+00],
       [ 1.04111280e+01,  6.09881997e-01],
       [ 7.43757534e+00,  1.38516932e+01],
       [ 4.26444292e+00,  4.87076139e+00],
       [ 2.90101218e+00,  8.00403118e+00],
       [ 7.60269547e+00,  1.41803226e+01],
       [ 1.09815300e+00, -4.20335007e+00],
       [ 3.72613168e+00,  5.09122086e+00],
       [ 2.86066222e+00,  1.06173820e+01],
       [ 2.42815518e+00, -4.00461817e+00],
       [ 2.61127090e+00,  5.14484406e+00],
       [ 1.18339100e+01,  1.67339821e+01],
       [ 1.22553959e+01, -6.08764076e+00],
       [ 4.20773602e+00,  4.65283203e+00],
       [ 8.62094879e+00, -4.78529274e-01],
       [ 1.29443531e+01, -6.50885916e+00],
       [ 5.15094805e+00,  3.08955741e+00],
       [ 1.90184188e+00, -4.20560122e+00],
       [ 7.78751183e+00,  1.03105097e+01],
       [ 4.66492891e+00,  4.62675238e+00],
       [ 7.20657921e+00,  1.00898695e+01],
       [ 1.23361588e+01, -7.76199770e+00],
       [ 7.19676876e+00,  1.45420494e+01],
       [-8.28833485e+00,  3.26244545e+00],
       [ 1.93086472e+01,  2.02614713e+00],
       [ 8.56536388e+00,  1.08816519e+01],
       [ 1.56364882e+00,  9.02052307e+00],
       [ 1.93720913e+01,  3.13724089e+00],
       [ 1.03784208e+01,  1.45391178e+00],
       [ 1.01358051e+01, -4.34333980e-01],
       [ 7.30056190e+00,  1.39617376e+01],
       [ 7.31627369e+00,  1.48641815e+01],
       [ 2.33456850e+00,  8.97941303e+00],
       [ 2.09031320e+00,  9.15919304e+00],
       [ 1.35915003e+01,  1.69954529e+01],
       [ 2.61211252e+00,  9.67325783e+00],
       [ 5.70830059e+00,  2.07967734e+00],
       [ 2.03164749e+01,  2.60851431e+00],
       [ 1.23774891e+01,  1.72216015e+01],
       [ 9.85767186e-01, -8.03058326e-01],
       [ 6.18093252e+00, -2.71137953e-01],
       [ 1.33752537e+01,  1.74806652e+01],
       [ 7.87491941e+00,  1.15665522e+01],
       [ 1.19647760e+01,  1.65779037e+01],
       [ 1.58233976e+00,  8.97871685e+00],
       [ 1.11217403e+01, -7.53401709e+00],
       [ 2.77137256e+00,  1.07307339e+01],
       [ 1.98257885e+01,  3.34123826e+00],
       [ 9.88493085e-01, -6.36084318e-01],
       [-5.42631149e-01, -1.49713326e+00],
       [ 6.00486565e+00, -1.04312032e-01],
       [ 5.50934029e+00,  1.54812260e+01],
       [ 6.79059935e+00,  1.43414955e+01],
       [ 8.20319176e+00,  1.14464417e+01],
       [ 1.51997113e+00,  9.93821526e+00],
       [ 9.26871109e+00,  8.75526607e-01],
       [ 1.15349522e+01, -7.81715679e+00],
       [ 2.84447312e+00,  4.09177399e+00],
       [ 3.01040769e+00,  1.07021465e+01],
       [ 5.26517034e-01,  3.56633514e-01],
       [ 2.23956490e+00,  1.05628624e+01],
       [ 7.40757895e+00,  1.05235777e+01],
       [ 7.57580614e+00,  1.39888668e+01],
       [ 1.19394760e+01, -5.87033081e+00],
       [ 1.19141283e+01, -7.70306826e+00],
       [ 1.45862615e+00,  9.67681885e+00],
       [ 5.70191717e+00,  1.50980463e+01],
       [ 1.27075195e+01,  1.71650524e+01],
       [ 1.90146370e+01,  2.91956520e+00],
       [ 1.00382977e+01,  1.05415595e+00],
       [ 1.20603743e+01, -6.73856401e+00],
       [ 1.21463318e+01, -5.93729019e+00],
       [ 2.98728800e+00,  5.01034117e+00],
       [ 4.37886333e+00,  5.13170195e+00],
       [ 5.71549463e+00,  9.21205699e-01],
       [ 1.97396123e+00,  8.89696503e+00],
       [ 1.93121815e+01,  3.62117839e+00],
       [ 1.83073807e+00, -4.12922049e+00],
       [ 1.35204191e+01,  1.69217930e+01],
       [ 3.89947939e+00,  4.44158363e+00],
       [ 2.01038971e+01,  3.17173958e+00],
       [ 1.76283216e+00,  8.74818230e+00],
       [ 1.39133549e+00,  9.86009979e+00],
       [ 4.96045542e+00,  2.09638572e+00],
       [ 5.37798691e+00,  2.55944920e+00],
       [ 1.00153885e+01, -4.30421829e-01],
       [ 2.58003759e+00,  5.65518475e+00],
       [ 3.10385394e+00,  4.31488323e+00],
       [ 8.06058502e+00,  1.08455706e+01],
       [ 5.15572214e+00,  2.80767560e+00],
       [ 7.04855144e-01, -2.10322872e-01],
       [ 1.32493429e+01,  1.70429554e+01],
       [ 1.25905075e+01, -6.49386692e+00],
       [-8.13499737e+00,  3.41613173e+00],
       [ 1.97236271e+01,  3.78322840e+00],
       [ 8.44617176e+00, -4.47045088e+00],
       [ 2.22543931e+00,  8.94437122e+00],
       [ 7.39531803e+00,  9.88773346e+00],
       [ 4.95809615e-01, -1.17228663e+00],
       [ 3.56299803e-02, -2.51918101e+00],
       [ 9.97978687e-01, -9.73300874e-01],
       [ 7.88709497e+00,  1.02189322e+01],
       [ 1.19830122e+01, -6.20376873e+00],
       [ 7.27090883e+00,  1.48665085e+01],
       [ 7.90857124e+00,  1.04754467e+01],
       [ 7.78450203e+00,  1.34945431e+01],
       [ 9.20992315e-01, -3.84849370e-01],
       [ 2.06664467e+00,  8.91980267e+00],
       [ 6.82216120e+00,  1.47176342e+01],
       [ 7.96708155e+00,  9.88164520e+00],
       [ 9.99245048e-01, -1.10285509e+00],
       [ 4.92387438e+00,  3.11497521e+00],
       [ 4.15050173e+00,  5.49935293e+00],
       [ 1.85929432e+01,  2.93218923e+00],
       [ 1.05673771e+01,  1.58918750e+00],
       [ 1.17832613e+01, -6.88497448e+00],
       [ 2.05436015e+00,  8.92279243e+00],
       [-8.32482147e+00,  3.22605968e+00],
       [-6.68644428e-01, -1.44689846e+00],
       [ 1.26672497e+01, -6.95247126e+00],
       [ 4.02511883e+00,  4.68911886e+00],
       [ 1.83233964e+00,  1.02191486e+01],
       [ 1.24769630e+01,  1.83115978e+01],
       [ 6.54460144e+00,  8.17592907e+00],
       [ 4.99710226e+00,  2.14675093e+00],
       [ 8.82683563e+00, -3.46012086e-01],
       [ 5.66530418e+00,  2.03878736e+00],
       [ 1.12690344e+01,  1.65654433e+00],
       [ 1.94616604e+01,  1.83572781e+00],
       [ 4.22863197e+00,  4.24069071e+00],
       [ 1.32883120e+01,  1.72658978e+01],
       [ 2.04569340e+00,  9.79823208e+00],
       [ 1.16519632e+01, -5.80133963e+00],
       [ 2.73238510e-01, -1.97288191e+00],
       [ 1.37230902e+01,  1.69941349e+01],
       [ 2.34973192e+00,  9.78479385e+00],
       [-8.32408810e+00,  3.22686195e+00],
       [ 5.00070572e+00,  2.14611840e+00],
       [ 1.94834633e+01,  3.31213975e+00],
       [ 5.13502932e+00,  2.53799510e+00],
       [ 3.51085663e+00,  4.86806011e+00],
       [ 1.60497224e+00,  1.02030182e+01],
       [ 1.77237749e+00,  9.68281555e+00],
       [ 1.01137390e+01,  6.01305068e-01],
       [ 1.24101944e+01,  1.69083271e+01],
       [ 9.99769878e+00, -4.69175726e-01],
       [ 1.92375374e+01,  2.82835078e+00],
       [ 1.29148035e+01, -6.90968513e+00],
       [ 4.79194498e+00,  1.79446292e+00],
       [ 1.09801130e+01, -7.14352322e+00],
       [ 4.43158436e+00,  4.53871965e+00],
       [ 1.19143171e+01, -5.24801064e+00],
       [ 8.03468132e+00,  1.08030205e+01],
       [ 1.25794535e+01,  1.83805733e+01],
       [ 1.23168669e+01,  1.66269989e+01],
       [ 1.23458223e+01,  1.77778034e+01],
       [ 9.14830303e+00,  2.24295601e-01],
       [ 2.66571760e+00, -4.23472404e+00],
       [ 5.79700947e+00,  2.10118103e+00],
       [ 4.93734884e+00,  2.10275912e+00],
       [ 8.59681225e+00, -6.13303065e-01],
       [ 4.52130413e+00,  4.89938354e+00],
       [ 1.25310907e+01,  1.77465248e+01],
       [ 4.70883727e-01, -2.58479118e-01],
       [ 7.51077795e+00,  1.44778681e+01],
       [ 3.78150916e+00,  4.56379843e+00],
       [ 5.95912027e+00,  3.97240400e-01],
       [ 3.72255659e+00,  5.11259413e+00],
       [ 1.19461737e+01,  1.67762375e+01],
       [ 9.08414245e-01, -7.92203903e-01],
       [ 2.96357393e+00,  1.00844154e+01],
       [ 1.17556305e+01, -6.66913843e+00],
       [ 2.02050533e+01,  3.61629868e+00],
       [ 1.10305328e+01, -7.52728462e+00],
       [ 1.96918893e+00,  9.84622478e+00],
       [ 4.19275188e+00,  5.32430649e+00],
       [ 7.87172747e+00,  1.34470758e+01],
       [ 1.98914948e+01,  2.00160146e+00],
       [ 8.35500717e+00,  9.87718010e+00],
       [ 9.96109867e+00,  8.08443546e-01],
       [ 1.18106031e+01, -5.74526215e+00],
       [ 5.16102409e+00,  1.80734324e+00],
       [ 1.06605225e+01,  5.83091080e-01],
       [ 5.54406452e+00,  1.55050116e+01],
       [-8.23605418e-01, -1.30636895e+00],
       [ 1.74808240e+00, -4.07517624e+00],
       [ 9.26349831e+00,  8.88854027e-01],
       [ 1.09849749e+01, -7.48004770e+00],
       [ 1.20971031e+01,  1.70955334e+01],
       [ 7.80687809e+00,  1.46189804e+01],
       [ 7.74820900e+00,  1.09906073e+01],
       [ 2.86658764e+00,  8.10734367e+00],
       [ 5.48915911e+00,  1.53481274e+01],
       [ 3.66877079e+00,  4.72051382e+00],
       [ 9.71720600e+00,  1.88182616e+00],
       [ 1.29751072e+01, -6.48790073e+00],
       [ 4.74673939e+00,  2.45048618e+00],
       [ 6.65666580e+00,  1.45971031e+01],
       [ 8.40965176e+00,  1.14738111e+01],
       [-8.10725212e+00,  3.44368553e+00],
       [ 7.50969601e+00,  1.05435114e+01],
       [ 2.85458922e+00,  1.05950384e+01],
       [ 9.97610331e-01, -4.11940002e+00],
       [ 1.61350918e+00,  9.61839390e+00],
       [ 7.60637164e-01, -2.08404779e+00],
       [ 1.08878446e+00, -4.20332289e+00],
       [ 1.71200943e+00,  1.01663733e+01],
       [ 1.68586135e+00,  9.51459694e+00],
       [ 9.96205997e+00, -4.65066135e-01],
       [ 2.30960464e+00,  9.90717316e+00],
       [ 7.35355377e-01, -1.99536717e+00],
       [ 6.53025198e+00,  8.15707302e+00],
       [ 2.84418404e-01, -4.09361899e-01],
       [ 1.47742963e+00,  8.79598904e+00],
       [ 1.94371700e+01,  2.68893600e+00],
       [ 6.51380682e+00,  1.46814518e+01],
       [ 1.20655746e+01, -6.61174297e+00],
       [ 2.61204028e+00, -4.18903923e+00],
       [ 1.27581778e+01,  1.74870110e+01],
       [ 7.28155994e+00,  1.42150888e+01],
       [ 7.64708996e+00,  1.37616825e+01],
       [ 1.20456152e+01, -5.99717808e+00],
       [ 5.90163994e+00, -5.59923708e-01],
       [ 9.67985809e-01, -1.10858858e+00],
       [ 1.89800892e+01,  3.22926378e+00],
       [ 1.28227119e+01, -6.45035076e+00],
       [ 2.41951394e+00,  1.07949047e+01],
       [ 1.95769138e+01,  3.33323169e+00],
       [ 1.95406742e+01,  3.64278507e+00],
       [ 4.94027615e+00,  3.08518648e+00],
       [ 1.16152763e+01, -7.42098475e+00],
       [ 1.89538881e-01, -2.01126289e+00],
       [ 3.81815052e+00,  4.59767532e+00],
       [ 2.12944388e+00,  1.01413727e+01],
       [ 5.31503677e+00,  1.56595840e+01],
       [ 7.29801846e+00,  1.01420183e+01],
       [ 2.09905934e+00,  5.28698969e+00],
       [ 2.07446499e+01,  3.95429325e+00],
       [ 8.20926762e+00,  1.09102173e+01],
       [ 1.08266563e+01,  7.70246744e-01],
       [ 6.07540846e+00, -7.29724586e-01],
       [ 1.33210297e+01,  1.75505371e+01],
       [ 1.24755239e+01,  1.77963142e+01],
       [ 8.17131042e+00,  9.57447624e+00],
       [ 1.85363064e+01,  2.59018946e+00],
       [ 1.92491379e+01,  1.91278136e+00],
       [ 1.18102551e+01,  1.75860405e+01],
       [ 1.96883640e+01,  2.18472767e+00],
       [ 9.52509582e-01, -1.43717289e+00],
       [ 6.90265989e+00,  1.45218258e+01],
       [ 5.67792988e+00,  1.57095594e+01],
       [ 4.19719028e+00,  4.65386629e+00],
       [ 1.18064594e+01,  1.68005753e+01],
       [ 4.59421349e+00,  4.83855247e+00],
       [ 1.28239794e+01, -6.72075176e+00],
       [-1.79536175e-02, -1.28720593e+00],
       [ 1.02263308e+01,  8.62763345e-01],
       [ 7.40937901e+00,  1.00602980e+01],
       [ 6.83069372e+00,  1.49400043e+01],
       [-1.60397947e-01, -1.87085092e+00],
       [ 1.31290855e+01,  1.73606892e+01],
       [-8.06395912e+00,  3.48659539e+00],
       [ 3.60813916e-01, -2.49946070e+00],
       [ 3.95752883e+00,  5.46306419e+00],
       [ 3.21658778e+00,  5.64122772e+00],
       [ 1.89867115e+01,  2.76880908e+00],
       [ 1.43908429e+00,  9.82541084e+00],
       [ 8.73421192e+00, -4.49178368e-01],
       [ 4.20121336e+00,  4.74164057e+00]], dtype=float32)

svc.score(trans.transform(X_test), y_test), knn.score(trans.transform(X_test), y_test)

(0.98, 0.98)

The results are pretty good. While the accuracy of the KNN classifier did not improve there was not a lot of scope for improvement given the data. On the other hand the SVC has improved to have equal accuracy to the KNN classifier!

For more interesting datasets the larger dimensional embedding might have been a significant gain – it is certainly worth exploring as one of the parameters in a grid search across a pipeline that includes UMAP.

5.5 Clustering

When used for feature engineering, we could attempt to discover groups of customers representing a market segment, for instance, or geographic areas that share similar weather patterns. Adding a feature of cluster labels can help machine learning models untangle complicated relationships of space or proximity.

5.5.1 Cluster Labels as a feature

Applied to a single real-valued feature, clustering acts like a traditional “binning” or “discretization” transform. On multiple features, it’s like “multi-dimensional binning” (sometimes called vector quantization).

It’s important to remember that this Cluster feature is categorical. Here, it’s shown with a label encoding (that is, as a sequence of integers) as a typical clustering algorithm would produce; depending on your model, a one-hot encoding may be more appropriate. The motivating idea for adding cluster labels is that the clusters will break up complicated relationships across features into simpler chunks. Our model can then just learn the simpler chunks one-by-one instead having to learn the complicated whole all at once. It’s a “divide and conquer” strategy.

As spatial features, California Housing’s 'Latitude' and 'Longitude' make natural candidates for k-means clustering. In this example we’ll cluster these with 'MedInc' (median income) to create economic segments in different regions of California. Since k-means clustering is sensitive to scale, it can be a good idea rescale or normalize data with extreme values. Our features are already roughly on the same scale, so we’ll leave them as-is.

df = fetch_california_housing(as_frame=True)['frame']

X = df.loc[:, ["MedInc", "Latitude", "Longitude"]]
X.head()

	MedInc	Latitude	Longitude
0	8.3252	37.88	-122.23
1	8.3014	37.86	-122.22
2	7.2574	37.85	-122.24
3	5.6431	37.85	-122.25
4	3.8462	37.85	-122.25

# Create cluster feature
kmeans = KMeans(n_clusters = 6, n_init = 10)
X["Cluster"] = kmeans.fit_predict(X)
X["Cluster"] = X["Cluster"].astype("category")

X.head()

	MedInc	Latitude	Longitude	Cluster
0	8.3252	37.88	-122.23	5
1	8.3014	37.86	-122.22	5
2	7.2574	37.85	-122.24	5
3	5.6431	37.85	-122.25	5
4	3.8462	37.85	-122.25	1

Notice the differnece between predict() and transform() in the KMeans. predict() will predict the closest cluster each sample in X belongs to. transform() will transform data to a cluster-distance space where each dimension is the distance to the cluster centers.

Now let’s look at a couple plots to see how effective this was. First, a scatter plot that shows the geographic distribution of the clusters. It seems like the algorithm has created separate segments for higher-income areas on the coasts.

sns.relplot(
    x="Longitude", y="Latitude", hue="Cluster", data=X, height=6,
);

The target in this dataset is MedHouseVal (median house value). These box-plots show the distribution of the target within each cluster. If the clustering is informative, these distributions should, for the most part, separate across MedHouseVal, which is indeed what we see.

X["MedHouseVal"] = df["MedHouseVal"]
sns.catplot(x="MedHouseVal", y="Cluster", data=X, kind="boxen", height=6);

5.5.2 Cluster distance as a feature

X_digits, y_digits = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits, random_state=42)

Now let’s fit a Logistic Regression model and evaluate it on the test set:

log_reg = LogisticRegression(multi_class="ovr", solver="lbfgs", max_iter=5000, random_state=42)
log_reg.fit(X_train, y_train)

LogisticRegression(max_iter=5000, multi_class='ovr', random_state=42)

log_reg_score = log_reg.score(X_test, y_test)
log_reg_score

0.9688888888888889

Okay, that’s our baseline: 96.89% accuracy. Let’s see if we can do better by using K-Means as a preprocessing step. We will create a pipeline that will first cluster the training set into 30 clusters and replace the images with their distances to the 30 clusters, then apply a logistic regression model:

pipeline = Pipeline([
    ("kmeans", KMeans(n_clusters=30, random_state=42, n_init=10)),
    ("log_reg", LogisticRegression(multi_class="ovr", solver="lbfgs", max_iter=5000, random_state=42)),
])
pipeline.fit(X_train, y_train)

Pipeline(steps=[('kmeans', KMeans(n_clusters=30, n_init=10, random_state=42)),
                ('log_reg',
                 LogisticRegression(max_iter=5000, multi_class='ovr',
                                    random_state=42))])

X_train.shape

(1347, 64)

pipeline_score = pipeline.score(X_test, y_test)
pipeline_score

0.9733333333333334

How much did the error rate drop?

1 - (1 - pipeline_score) / (1 - log_reg_score)

0.1428571428571439

We reduced the error rate by over 14%!

5.6 Guideline to determine the optimal number of features or threshold?

To determine the optimal hyperparameter, we can use cross validation. For instance, in the above example, we chose the number of clusters k completely arbitrarily. However, we can surely do better. Since K-Means is just a preprocessing step in a classification pipeline, finding a good value for k is the best value of k is simply the one that results in the best classification performance.

# The following cell may take close to 20 minutes to run, or more depending on your hardware!
param_grid = dict(kmeans__n_clusters=range(2, 60))
grid_clf = GridSearchCV(pipeline, param_grid, cv=3, verbose=2)
grid_clf.fit(X_train, y_train)

grid_clf.best_params_

grid_clf.score(X_test, y_test)

In the same way, you can also use cross-validation to evaluate model performance with different numbers of top-ranked features or different numbers of features and choose the optimal number based on the performance metric (e.g., highest accuracy or lowest error).

5.6.1 Using Clustering for Semi-Supervised Learning

Another use case for clustering is in semi-supervised learning, when we have plenty of unlabeled instances and very few labeled instances.

X_digits, y_digits = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits, random_state=42)

Let’s look at the performance of a logistic regression model when we only have 50 labeled instances:

n_labeled = 50

log_reg = LogisticRegression(multi_class="ovr", solver="lbfgs", max_iter=5000, random_state=42)
log_reg.fit(X_train[:n_labeled], y_train[:n_labeled])
log_reg.score(X_test, y_test)

0.8333333333333334

The model’s accuracy is just 83.33%. It’s much less than earlier of course. Let’s see how we can do better. First, let’s cluster the training set into 50 clusters, then for each cluster let’s find the image closest to the centroid. We will call these images the representative images:

k = 50
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
X_digits_dist = kmeans.fit_transform(X_train)
representative_digit_idx = np.argmin(X_digits_dist, axis=0)
X_representative_digits = X_train[representative_digit_idx]

Now let’s plot these representative images and label them manually:

plt.figure(figsize=(8, 2))
for index, X_representative_digit in enumerate(X_representative_digits):
    plt.subplot(k // 10, 10, index + 1)
    plt.imshow(X_representative_digit.reshape(8, 8), cmap="binary", interpolation="bilinear")
    plt.axis('off')
plt.show()

# Assuming we manually label these digits
y_train[representative_digit_idx]

array([4, 8, 0, 6, 8, 3, 7, 7, 9, 2, 5, 5, 8, 5, 2, 1, 2, 9, 6, 1, 1, 6,
       9, 0, 8, 3, 0, 7, 4, 1, 6, 5, 2, 4, 1, 8, 6, 3, 9, 2, 4, 2, 9, 4,
       7, 6, 2, 3, 1, 1])

y_representative_digits = y_train[representative_digit_idx]

Now we have a dataset with just 50 labeled instances, but instead of being completely random instances, each of them is a representative image of its cluster. Let’s see if the performance is any better:

log_reg = LogisticRegression(multi_class="ovr", solver="lbfgs", max_iter=5000, random_state=42)
log_reg.fit(X_representative_digits, y_representative_digits)
log_reg.score(X_test, y_test)

0.9222222222222223

We jumped from 83.3% accuracy to 92.2%, although we are still only training the model on 50 instances. Since it’s often costly and painful to label instances, especially when it has to be done manually by experts, it’s a good idea to make them label representative instances rather than just random instances.

But perhaps we can go one step further: what if we propagated the labels to all the other instances in the same cluster?

y_train_propagated = np.empty(len(X_train), dtype=np.int32)
for i in range(k):
    y_train_propagated[kmeans.labels_==i] = y_representative_digits[i]

log_reg = LogisticRegression(multi_class="ovr", solver="lbfgs", max_iter=5000, random_state=42)
log_reg.fit(X_train, y_train_propagated)

LogisticRegression(max_iter=5000, multi_class='ovr', random_state=42)

log_reg.score(X_test, y_test)

0.9333333333333333

We got a tiny little accuracy boost. Better than nothing, but we should probably have propagated the labels only to the instances closest to the centroid, because by propagating to the full cluster, we have certainly included some outliers. Let’s only propagate the labels to the 75th percentile closest to the centroid:

percentile_closest = 75

X_cluster_dist = X_digits_dist[np.arange(len(X_train)), kmeans.labels_]
for i in range(k):
    in_cluster = (kmeans.labels_ == i)
    cluster_dist = X_cluster_dist[in_cluster]
    cutoff_distance = np.percentile(cluster_dist, percentile_closest)
    above_cutoff = (X_cluster_dist > cutoff_distance)
    X_cluster_dist[in_cluster & above_cutoff] = -1

partially_propagated = (X_cluster_dist != -1)
X_train_partially_propagated = X_train[partially_propagated]
y_train_partially_propagated = y_train_propagated[partially_propagated]

X_train.shape, X_train_partially_propagated.shape

((1347, 64), (1003, 64))

log_reg = LogisticRegression(multi_class="ovr", solver="lbfgs", max_iter=5000, random_state=42)
log_reg.fit(X_train_partially_propagated, y_train_partially_propagated)

LogisticRegression(max_iter=5000, multi_class='ovr', random_state=42)

log_reg.score(X_test, y_test)

0.9355555555555556

A bit better. With just 50 labeled instances (just 5 examples per class on average!), we got 93.5% performance, which is getting closer to the performance of logistic regression on the fully labeled digits dataset.

Our propagated labels are actually pretty good: their accuracy is about 97.5%:

np.mean(y_train_partially_propagated == y_train[partially_propagated])

0.9750747756729811

You could also do a few iterations of active learning:

Manually label the instances that the classifier is least sure about, if possible by picking them in distinct clusters.
Train a new model with these additional labels.

5.6.2 Feature agglomeration

cluster.FeatureAgglomeration applies Hierarchical clustering to group together features that behave similarly.

X, y = load_iris(return_X_y=True)

#set n_clusters to 2, the output will be two columns of agglomerated features (iris has 4 features)
agglo = FeatureAgglomeration(n_clusters=2).fit_transform(X)

plt.scatter(agglo[:,0], agglo[:,1],c=y)
plt.show()