6  Linear Models and Regularization Methods

Author

phonchi

Published

October 17, 2022

Open In Colab


6.1 Subset Selection Methods

Some of the commands in this lab may take a while to run on your computer.

6.1.1 Best Subset Selection

Here we apply the best subset selection approach to the Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.

!pip install mlxtend --upgrade --no-deps -qq
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import itertools

import statsmodels.formula.api as smf
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
#import glmnet as gln

#from sklearn.feature_selection import SequentialFeatureSelector 
from sklearn.model_selection import train_test_split,KFold, cross_val_score
from sklearn.preprocessing import scale, StandardScaler 
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error
from sklearn.cross_decomposition import PLSRegression, PLSSVD

%matplotlib inline
plt.style.use('seaborn-white')
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})

First of all, we note that the Salary variable is missing for some of the players. The isnull() function can be used to identify the missing observations. It returns a vector of the same length as the input vector, with a True value for any elements that are missing, and a False value for non-missing elements. The sum() function can then be used to count all of the missing elements.

from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Hitter = pd.read_csv("/content/drive/MyDrive/Lab/Data/Hitters.csv")
Hitter.head()
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
0 293 66 1 30 29 14 1 293 66 1 30 29 14 A E 446 33 20 NaN A
1 315 81 7 24 38 39 14 3449 835 69 321 414 375 N W 632 43 10 475.0 N
2 479 130 18 66 72 76 3 1624 457 63 224 266 263 A W 880 82 14 480.0 A
3 496 141 20 65 78 37 11 5628 1575 225 828 838 354 N E 200 11 3 500.0 N
4 321 87 10 39 42 30 2 396 101 12 48 46 33 N E 805 40 4 91.5 N
print(Hitter["Salary"].isnull().sum())
59
# Print the dimensions of the original Hitters data (322 rows x 20 columns)
print(Hitter.shape)

# Drop any rows the contain missing values
Hitter = Hitter.dropna()

# Print the dimensions of the modified Hitters data (263 rows x 20 columns)
print(Hitter.shape)

# One last check: should return 0
print(Hitter["Salary"].isnull().sum())

# Create dummy variables for qualitative features, since sklearn will not create dummy variables for categorical variables
qual = ['League', 'Division', 'NewLeague']
Hitter = pd.get_dummies(Hitter, columns=qual, drop_first=True)

Hitter.head()
(322, 20)
(263, 20)
0
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks PutOuts Assists Errors Salary League_N Division_W NewLeague_N
1 315 81 7 24 38 39 14 3449 835 69 321 414 375 632 43 10 475.0 1 1 1
2 479 130 18 66 72 76 3 1624 457 63 224 266 263 880 82 14 480.0 0 1 0
3 496 141 20 65 78 37 11 5628 1575 225 828 838 354 200 11 3 500.0 1 0 1
4 321 87 10 39 42 30 2 396 101 12 48 46 33 805 40 4 91.5 1 0 1
5 594 169 4 74 51 35 11 4408 1133 19 501 336 194 282 421 25 750.0 0 1 0

The itertools.combinations function can help us to generate all possible combinations contains exactly k element within from a list of elements. We can then perform best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The following code implement the first stage of best subset selection.

list(itertools.combinations([1,2,3,4], 3))
[(1, 2, 3), (1, 2, 4), (1, 3, 4), (2, 3, 4)]
def get_models(k_features, X, y):
    """
    Fit all possible models that contain exactly k_features predictors.
    X is predictor and y is target or response.
    """
    
    n_features = X.shape[1]
    
    X_combos = itertools.combinations(list(X.columns), k_features)
    
    best_score = np.inf
    
    for X_label in X_combos:
        X_smf = ' + '.join(X_label)
        f     = 'Salary ~ {}'.format(X_smf)
        # Fit model
        lin_reg = smf.ols(formula=f, data=pd.concat([X, y], axis=1)).fit()
        score = lin_reg.ssr
        if score < best_score:
            best_score, best_subset = score, X_label
            best_reg = lin_reg
    

    return best_score, best_reg, best_subset

Which model with 2 predictors yields lowest RSS score?

X = Hitter.drop('Salary', axis=1) #19 variables remain
y = Hitter['Salary'] # put Salary as response variable

# Set number for predictors in subset
k = 2

# Get best models in subset
subset = get_models(2, X, y)

# Display results
print('This model yields the lowest RSS score for the subset of models with {} predictors:'.format(k))
print(subset)
This model yields the lowest RSS score for the subset of models with 2 predictors:
(30646559.890372835, <statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x7f81074d46d0>, ('Hits', 'CRBI'))

First we fit all possible models in each subset of models with \(k\) predictors. This turns out to be a very computationally expensive process as number of possible combinations without repetition is given by:

\(\frac{p!}{k!(p-k)!}\)

Where \(p\) is number of predictors to choose from and we choose \(k\) of them.

# get all model results
rss = []
sub = []
reg = []
kft = []

for i in range(1,5):
    best_score, best_reg, best_subset = get_models(i, X, y)
    
    rss.append(best_score)
    sub.append(best_subset)
    reg.append(best_reg)
    kft.append(i)
    print('Progess: i = {}, done'.format(i))
Progess: i = 1, done
Progess: i = 2, done
Progess: i = 3, done
Progess: i = 4, done
results = pd.DataFrame({'kft': kft, 'rss': rss, 'reg': reg, 'sub': sub},
                           columns = ['rss', 'reg', 'sub', 'kft']).set_index('kft')
results
rss reg sub
kft
1 3.617968e+07 <statsmodels.regression.linear_model.Regressio... (CRBI,)
2 3.064656e+07 <statsmodels.regression.linear_model.Regressio... (Hits, CRBI)
3 2.924930e+07 <statsmodels.regression.linear_model.Regressio... (Hits, CRBI, PutOuts)
4 2.797085e+07 <statsmodels.regression.linear_model.Regressio... (Hits, CRBI, PutOuts, Division_W)

We can then choose the best model using different criteria.

plt.figure(figsize=(15,9))
plt.rcParams.update({'font.size': 14, 'lines.markersize': 10})

# Set up a 2x2 grid so we can look at 4 plots at once
plt.subplot(2, 2, 1)

# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.
# The argmax() function can be used to identify the location of the maximum point of a vector
ax = sns.lineplot(x = "kft", y = "rss", data = results)
ax.set_xlabel('# Predictors')
ax.set_ylabel('RSS')

# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.
# The argmax() function can be used to identify the location of the maximum point of a vector

results["rsquared_adj"] = results.apply(lambda row: row[1].rsquared_adj, axis=1)

plt.subplot(2, 2, 2)
ax = sns.lineplot(x = "kft", y = "rsquared_adj", data = results)
plt.plot(results["rsquared_adj"].argmax()+1, results["rsquared_adj"].max(), "or")
ax.set_xlabel('# Predictors')
ax.set_ylabel('adjusted rsquared')

# We'll do the same for AIC and BIC, this time looking for the models with the SMALLEST statistic
results["aic"] = results.apply(lambda row: row[1].aic, axis=1)

plt.subplot(2, 2, 3)
ax = sns.lineplot(x = "kft", y = "aic", data = results)
plt.plot(results["aic"].argmin()+1, results["aic"].min(), "or")
ax.set_xlabel('# Predictors')
ax.set_ylabel('AIC')

results["bic"] = results.apply(lambda row: row[1].bic, axis=1)

plt.subplot(2, 2, 4)
ax = sns.lineplot(x = "kft", y = "bic", data = results)
plt.plot(results["bic"].argmin()+1, results["bic"].min(), "or")
ax.set_xlabel('# Predictors')
ax.set_ylabel('BIC')
Text(0, 0.5, 'BIC')

The package mlxtend can help us to perform feature selection, it internally use cross-validation to estimate test errors for all possible combinations of features instead of dividing the procedure into two stage

# Perform an Exhaustive Search. The EFS and SFS packages use 'neg_mean_squared_error'. The 'mean_squared_error' seems to have been deprecated. I think this is just the MSE with the a negative sign.
lr = LinearRegression()
efs1 = EFS(lr, 
           min_features=1,
           max_features=4,
           scoring='neg_mean_squared_error',
           print_progress=True,
           n_jobs=-1,
           cv=5) #5-fold
efs1.fit(X, y)
Features: 5035/5035
ExhaustiveFeatureSelector(estimator=LinearRegression(),
                          feature_groups=[[0], [1], [2], [3], [4], [5], [6],
                                          [7], [8], [9], [10], [11], [12], [13],
                                          [14], [15], [16], [17], [18]],
                          max_features=4, n_jobs=-1,
                          scoring='neg_mean_squared_error')
efs1.best_feature_names_
('AtBat', 'Hits', 'Walks', 'CRBI')
## This is a time consuming process, be careful. You may consider setting --ServerApp.iopub_msg_rate_limit to a larger value or not to execute this cell
r22 = []
sub = []
kft = []
for i in range(1,20):
    efs1 = EFS(lr, 
           min_features=i,
           max_features=i,
           #scoring='neg_mean_squared_error', to calculate AIC, BIC you will need rss and the estimation of sigma (also from rss) https://xavierbourretsicotte.github.io/subset_selection.html
           scoring='r2',
           print_progress=True,
           cv = 0, #no CV 
           n_jobs=-1) #parallelism
    efs1.fit(X, y)
    best_score, best_subset = efs1.best_score_, efs1.best_feature_names_
    
    r22.append(best_score)
    sub.append(best_subset)
    kft.append(i)
    print('Progess: i = {}, done'.format(i))
Features: 19/19
Progess: i = 1, done
Features: 171/171
Progess: i = 2, done
Features: 969/969
Progess: i = 3, done
Features: 3876/3876
Progess: i = 4, done
Features: 11628/11628
Progess: i = 5, done
Features: 27132/27132
Progess: i = 6, done
Features: 50388/50388
Progess: i = 7, done
Features: 75582/75582
Progess: i = 8, done
Features: 92378/92378
Progess: i = 9, done
Features: 92378/92378
Progess: i = 10, done
Features: 75582/75582
Progess: i = 11, done
Features: 50388/50388
Progess: i = 12, done
Features: 27132/27132
Progess: i = 13, done
Features: 11628/11628
Progess: i = 14, done
Features: 3876/3876
Progess: i = 15, done
Features: 969/969
Progess: i = 16, done
Features: 18/19
Progess: i = 17, done
Features: 1/1
Progess: i = 18, done
Progess: i = 19, done
results2 = pd.DataFrame({'kft': kft, 'r2': r22, 'sub': sub},
                           columns = ['r2', 'sub', 'kft']).set_index('kft')
results2['sub']
kft
1                                               (CRBI,)
2                                          (Hits, CRBI)
3                                 (Hits, CRBI, PutOuts)
4                     (Hits, CRBI, PutOuts, Division_W)
5              (AtBat, Hits, CRBI, PutOuts, Division_W)
6       (AtBat, Hits, Walks, CRBI, PutOuts, Division_W)
7     (Hits, Walks, CAtBat, CHits, CHmRun, PutOuts, ...
8     (AtBat, Hits, Walks, CHmRun, CRuns, CWalks, Pu...
9     (AtBat, Hits, Walks, CAtBat, CRuns, CRBI, CWal...
10    (AtBat, Hits, Walks, CAtBat, CRuns, CRBI, CWal...
11    (AtBat, Hits, Walks, CAtBat, CRuns, CRBI, CWal...
12    (AtBat, Hits, Runs, Walks, CAtBat, CRuns, CRBI...
13    (AtBat, Hits, Runs, Walks, CAtBat, CRuns, CRBI...
14    (AtBat, Hits, HmRun, Runs, Walks, CAtBat, CRun...
15    (AtBat, Hits, HmRun, Runs, Walks, CAtBat, CHit...
16    (AtBat, Hits, HmRun, Runs, RBI, Walks, CAtBat,...
17    (AtBat, Hits, HmRun, Runs, RBI, Walks, CAtBat,...
18    (AtBat, Hits, HmRun, Runs, RBI, Walks, Years, ...
19    (AtBat, Hits, HmRun, Runs, RBI, Walks, Years, ...
Name: sub, dtype: object
def adjust_r2(r2, num_examples, num_features):
    coef = (num_examples - 1) / (num_examples - num_features - 1) 
    return 1 - (1 - r2) * coef
adj_r2 = []
for i, r2 in enumerate(r22):
    adj_r2.append(adjust_r2(r2=r2,
                  num_examples=X.shape[0],
                  num_features=i+1))
adj_r2 = np.array(adj_r2)
sns.lineplot(x=range(1,20), y=adj_r2)
plt.plot(adj_r2.argmax()+1, adj_r2.max(), "or")
plt.xticks(range(1,20))
plt.xlabel('# Predictors')
plt.ylabel('adjusted rsquared')
Text(0, 0.5, 'adjusted rsquared')

According to the adjusted \(R^2\), we see that the eleven-variable model is the best model. The predictor associated with the model is listed below.

results2['sub'][11]
('AtBat',
 'Hits',
 'Walks',
 'CAtBat',
 'CRuns',
 'CRBI',
 'CWalks',
 'PutOuts',
 'Assists',
 'League_N',
 'Division_W')

6.1.2 Forward and Backward Stepwise Selection

In practice, the best subset is quite time-consuming, therefore forward or backward selection is usually prefered. Note that we do not consider null model below.

def processSubset(X_label):
    # Fit model on feature_set and calculate RSS
    X_smf = ' + '.join(X_label)
    f     = 'Salary ~ {}'.format(X_smf)
    # Fit model
    lin_reg = smf.ols(formula=f, data=pd.concat([X, y], axis=1)).fit()
    RSS = lin_reg.ssr
    return {'model': lin_reg, 'RSS': RSS}
def forward(predictors):

    # Pull out predictors we still need to process
    remaining_predictors = [p for p in X.columns if p not in predictors]
      
    results = []
    
    for p in remaining_predictors:
        results.append(processSubset(predictors+[p]))
    
    # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    
    # Choose the model with the lowest RSS
    best_model = models.loc[models['RSS'].argmin()]
    
    # Return the best model, along with some other useful information about the model
    return best_model

def backward(predictors):
    
    results = []
    
    for combo in itertools.combinations(predictors, len(predictors)-1):
        results.append(processSubset(combo))
    
    # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    
    # Choose the model with the lowest RSS
    best_model = models.loc[models['RSS'].argmin()]

    return best_model
models2 = pd.DataFrame(columns=['RSS', 'model'])

predictors = [] # we start with null model M0

for i in range(1,len(X.columns)+1):
    models2.loc[i] = forward(predictors)
    exog = models2.loc[i]['model'].model.exog_names.copy()
    exog.remove('Intercept') #smf will automatically adds intercept
    predictors = exog
    print(i, predictors)
1 ['CRBI']
2 ['CRBI', 'Hits']
3 ['CRBI', 'Hits', 'PutOuts']
4 ['CRBI', 'Hits', 'PutOuts', 'Division_W']
5 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat']
6 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks']
7 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks']
8 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks', 'CRuns']
9 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks', 'CRuns', 'CAtBat']
10 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks', 'CRuns', 'CAtBat', 'Assists']
11 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks', 'CRuns', 'CAtBat', 'Assists', 'League_N']
12 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks', 'CRuns', 'CAtBat', 'Assists', 'League_N', 'Runs']
13 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks', 'CRuns', 'CAtBat', 'Assists', 'League_N', 'Runs', 'Errors']
14 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks', 'CRuns', 'CAtBat', 'Assists', 'League_N', 'Runs', 'Errors', 'HmRun']
15 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks', 'CRuns', 'CAtBat', 'Assists', 'League_N', 'Runs', 'Errors', 'HmRun', 'CHits']
16 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks', 'CRuns', 'CAtBat', 'Assists', 'League_N', 'Runs', 'Errors', 'HmRun', 'CHits', 'RBI']
17 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks', 'CRuns', 'CAtBat', 'Assists', 'League_N', 'Runs', 'Errors', 'HmRun', 'CHits', 'RBI', 'NewLeague_N']
18 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks', 'CRuns', 'CAtBat', 'Assists', 'League_N', 'Runs', 'Errors', 'HmRun', 'CHits', 'RBI', 'NewLeague_N', 'Years']
19 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks', 'CRuns', 'CAtBat', 'Assists', 'League_N', 'Runs', 'Errors', 'HmRun', 'CHits', 'RBI', 'NewLeague_N', 'Years', 'CHmRun']

We can then use AIC or BIC to choose the best model in the second stage

bic_f = []

for m in models2.model:
    bic_f.append(m.bic)
    
np.array(bic_f).argmin()
5
models3 = pd.DataFrame(columns=['RSS', 'model'], index = range(1, len(X.columns)))

predictors = X.columns # we start with full model Mp
models3.loc[len(predictors)] = processSubset(predictors)

while(len(predictors) > 1):  
    models3.loc[len(predictors)-1] = backward(predictors)
    exog = models3.loc[len(predictors)-1]['model'].model.exog_names.copy()
    exog.remove('Intercept')
    predictors = exog
    print(len(predictors), predictors)
18 ['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'CAtBat', 'CHits', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Assists', 'Errors', 'League_N', 'Division_W', 'NewLeague_N']
17 ['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'CAtBat', 'CHits', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Assists', 'Errors', 'League_N', 'Division_W', 'NewLeague_N']
16 ['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'CAtBat', 'CHits', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Assists', 'Errors', 'League_N', 'Division_W']
15 ['AtBat', 'Hits', 'HmRun', 'Runs', 'Walks', 'CAtBat', 'CHits', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Assists', 'Errors', 'League_N', 'Division_W']
14 ['AtBat', 'Hits', 'HmRun', 'Runs', 'Walks', 'CAtBat', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Assists', 'Errors', 'League_N', 'Division_W']
13 ['AtBat', 'Hits', 'Runs', 'Walks', 'CAtBat', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Assists', 'Errors', 'League_N', 'Division_W']
12 ['AtBat', 'Hits', 'Runs', 'Walks', 'CAtBat', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Assists', 'League_N', 'Division_W']
11 ['AtBat', 'Hits', 'Walks', 'CAtBat', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Assists', 'League_N', 'Division_W']
10 ['AtBat', 'Hits', 'Walks', 'CAtBat', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Assists', 'Division_W']
9 ['AtBat', 'Hits', 'Walks', 'CAtBat', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Division_W']
8 ['AtBat', 'Hits', 'Walks', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Division_W']
7 ['AtBat', 'Hits', 'Walks', 'CRuns', 'CWalks', 'PutOuts', 'Division_W']
6 ['AtBat', 'Hits', 'Walks', 'CRuns', 'PutOuts', 'Division_W']
5 ['AtBat', 'Hits', 'Walks', 'CRuns', 'PutOuts']
4 ['AtBat', 'Hits', 'CRuns', 'PutOuts']
3 ['Hits', 'CRuns', 'PutOuts']
2 ['Hits', 'CRuns']
1 ['CRuns']
bic_b = []

for m in models3.model:
    bic_b.append(m.bic)
    
np.array(bic_b).argmin()
7

We see that using forward stepwise selection, the best one-variable model contains only CRBI, and the best two-variable model additionally includes Hits. For this data, the best one-variable through six-variable models are each identical for best subset and forward selection. However, the best seven-variable models identified by forward stepwise selection, backward stepwise selection, and best subset selection are different.

print(results2['sub'][6]) #best subset
print(models2['model'][6].model.exog_names[1:]) # Do not print Intercept
('AtBat', 'Hits', 'Walks', 'CRBI', 'PutOuts', 'Division_W')
['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks']
print(results2['sub'][7])
print(models2['model'][7].model.exog_names[1:]) # Do not print Intercept
print(models3['model'][7].model.exog_names[1:])
('Hits', 'Walks', 'CAtBat', 'CHits', 'CHmRun', 'PutOuts', 'Division_W')
['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks']
['AtBat', 'Hits', 'Walks', 'CRuns', 'CWalks', 'PutOuts', 'Division_W']

There are also other feature selection methods available in sklearn https://scikit-learn.org/stable/modules/feature_selection.html and mlxtend http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.feature_selection/

lr = LinearRegression()
sfs1 = SFS(lr, 
          k_features=19, 
          forward=True, 
          floating=False, 
          scoring='neg_mean_squared_error',
          cv=5)

sfs1.fit(X, y)
SequentialFeatureSelector(estimator=LinearRegression(), k_features=(19, 19),
                          scoring='neg_mean_squared_error')
pd.DataFrame.from_dict(sfs1.get_metric_dict()).T
feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err
1 (11,) [-69552.92336716877, -213009.92207213468, -118... -142142.865462 (CRBI,) 74428.969884 57908.267167 28954.133583
2 (1, 11) [-53798.49669675372, -163254.80704040316, -104... -124277.839415 (Hits, CRBI) 67031.533473 52152.810324 26076.405162
3 (1, 11, 17) [-62163.905140527975, -153941.83393405456, -97... -120082.156746 (Hits, CRBI, Division_W) 59591.942149 46364.555526 23182.277763
4 (1, 11, 13, 17) [-65576.39003922763, -141278.46897398788, -825... -117217.983933 (Hits, CRBI, PutOuts, Division_W) 56263.230127 43774.704486 21887.352243
5 (0, 1, 11, 13, 17) [-64199.08545469596, -133171.50527159323, -836... -114112.221253 (AtBat, Hits, CRBI, PutOuts, Division_W) 54818.569767 42650.709646 21325.354823
6 (0, 1, 5, 11, 13, 17) [-60075.53412214324, -132306.39833204867, -765... -110162.261438 (AtBat, Hits, Walks, CRBI, PutOuts, Division_W) 57689.191989 44884.151259 22442.075629
7 (0, 1, 3, 5, 11, 13, 17) [-60330.25929006559, -132346.01741313486, -766... -110553.231464 (AtBat, Hits, Runs, Walks, CRBI, PutOuts, Divi... 58252.099546 45322.112462 22661.056231
8 (0, 1, 3, 5, 6, 11, 13, 17) [-64303.54034551623, -130892.95408839498, -745... -110934.554619 (AtBat, Hits, Runs, Walks, Years, CRBI, PutOut... 57911.784603 45057.335875 22528.667937
9 (0, 1, 3, 5, 6, 10, 11, 13, 17) [-63508.043800155414, -137971.21159579078, -75... -111236.70073 (AtBat, Hits, Runs, Walks, Years, CRuns, CRBI,... 57344.878076 44616.263337 22308.131669
10 (0, 1, 3, 5, 6, 10, 11, 12, 13, 17) [-75371.33424690235, -128223.28240695987, -742... -110021.54195 (AtBat, Hits, Runs, Walks, Years, CRuns, CRBI,... 52372.497258 40747.582133 20373.791066
11 (0, 1, 3, 5, 6, 10, 11, 12, 13, 16, 17) [-75554.06873028616, -129427.52046763463, -741... -110792.912735 (AtBat, Hits, Runs, Walks, Years, CRuns, CRBI,... 52633.998179 40951.038733 20475.519367
12 (0, 1, 3, 5, 6, 10, 11, 12, 13, 16, 17, 18) [-75588.46350075006, -130320.07532800487, -754... -111570.449036 (AtBat, Hits, Runs, Walks, Years, CRuns, CRBI,... 52803.480655 41082.901857 20541.450929
13 (0, 1, 2, 3, 5, 6, 10, 11, 12, 13, 16, 17, 18) [-75578.82866274133, -130098.48628552802, -797... -112677.57758 (AtBat, Hits, HmRun, Runs, Walks, Years, CRuns... 51485.083739 40057.144267 20028.572133
14 (0, 1, 2, 3, 4, 5, 6, 10, 11, 12, 13, 16, 17, 18) [-75574.5961843588, -132822.10224589257, -8117... -113601.744452 (AtBat, Hits, HmRun, Runs, RBI, Walks, Years, ... 51592.950752 40141.068467 20070.534233
15 (0, 1, 2, 3, 4, 5, 6, 10, 11, 12, 13, 15, 16, ... [-75795.22141743562, -132827.4310618751, -8116... -115593.410667 (AtBat, Hits, HmRun, Runs, RBI, Walks, Years, ... 53008.106168 41242.107459 20621.053729
16 (0, 1, 2, 3, 4, 5, 6, 10, 11, 12, 13, 14, 15, ... [-77460.55999200376, -130968.8896155133, -8015... -116290.756812 (AtBat, Hits, HmRun, Runs, RBI, Walks, Years, ... 51687.395454 40214.549654 20107.274827
17 (0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 1... [-74741.82159316978, -128751.90512264056, -799... -118642.76362 (AtBat, Hits, HmRun, Runs, RBI, Walks, Years, ... 56165.48438 43698.654974 21849.327487
18 (0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14... [-75598.10118706549, -128381.8825323173, -7998... -118980.278688 (AtBat, Hits, HmRun, Runs, RBI, Walks, Years, ... 56546.558941 43995.143928 21997.571964
19 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... [-76408.91409133923, -129937.76968228162, -799... -121136.310318 (AtBat, Hits, HmRun, Runs, RBI, Walks, Years, ... 59785.432226 46515.09738 23257.54869

6.1.3 Choosing Among Models Using the Validation-Set Approach and Cross-Validation

We just saw that it is possible to choose among a set of models of different sizes using AIC, BIC, and adjusted \(R^2\) in the second stage. We will now consider how to do this using the validation set and cross-validation approaches.

In order for these approaches to yield accurate estimates of the test error, we must use only the training observations to perform all aspects of model-fitting—including variable selection. Therefore, the determination of which model of a given size is best must be made using only the training observations. This point is subtle but important. If the full data set is used to perform the best subset selection step, the validation set errors and cross-validation errors that we obtain will not be accurate estimates of the test error.

In order to use the validation set approach, we begin by splitting the observations into a training set and a test set. We also set a random seed so that the user will obtain the same training set/test set split.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
def processSubset(X_label, X_train, y_train, X_test, y_test):
    # Fit model on feature_set and calculate RSS
    X_smf = ' + '.join(X_label)
    f     = 'Salary ~ {}'.format(X_smf)
    # Fit model
    lin_reg = smf.ols(formula=f, data=pd.concat([X_train, y_train], axis=1)).fit()
    RSS = ((lin_reg.predict(X_test[list(X_label)]) - y_test) ** 2).sum() #on test set
    return {'model': lin_reg, 'RSS': RSS}
def forward(predictors, X_train, y_train, X_test, y_test):

    # Pull out predictors we still need to process
    remaining_predictors = [p for p in X_train.columns if p not in predictors]
    
    results = []
    
    for p in remaining_predictors:
        results.append(processSubset(predictors+[p], X_train, y_train, X_test, y_test))
    
    # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    
    # Choose the model with the highest RSS
    best_model = models.loc[models['RSS'].argmin()]
    
    # Return the best model, along with some other useful information about the model
    return best_model
models_test = pd.DataFrame(columns=['RSS', 'model'])

predictors = []

for i in range(1,len(X.columns)+1):
    models_test.loc[i] = forward(predictors, X_train, y_train, X_test, y_test)
    exog = models_test.loc[i]['model'].model.exog_names.copy()
    exog.remove('Intercept')
    predictors = exog
    print(i, predictors)
1 ['CRuns']
2 ['CRuns', 'Hits']
3 ['CRuns', 'Hits', 'Walks']
4 ['CRuns', 'Hits', 'Walks', 'CWalks']
5 ['CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat']
6 ['CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat', 'CHmRun']
7 ['CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat', 'CHmRun', 'Years']
8 ['CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat', 'CHmRun', 'Years', 'CAtBat']
9 ['CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat', 'CHmRun', 'Years', 'CAtBat', 'Assists']
10 ['CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat', 'CHmRun', 'Years', 'CAtBat', 'Assists', 'HmRun']
11 ['CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat', 'CHmRun', 'Years', 'CAtBat', 'Assists', 'HmRun', 'Runs']
12 ['CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat', 'CHmRun', 'Years', 'CAtBat', 'Assists', 'HmRun', 'Runs', 'League_N']
13 ['CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat', 'CHmRun', 'Years', 'CAtBat', 'Assists', 'HmRun', 'Runs', 'League_N', 'NewLeague_N']
14 ['CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat', 'CHmRun', 'Years', 'CAtBat', 'Assists', 'HmRun', 'Runs', 'League_N', 'NewLeague_N', 'Errors']
15 ['CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat', 'CHmRun', 'Years', 'CAtBat', 'Assists', 'HmRun', 'Runs', 'League_N', 'NewLeague_N', 'Errors', 'Division_W']
16 ['CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat', 'CHmRun', 'Years', 'CAtBat', 'Assists', 'HmRun', 'Runs', 'League_N', 'NewLeague_N', 'Errors', 'Division_W', 'CRBI']
17 ['CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat', 'CHmRun', 'Years', 'CAtBat', 'Assists', 'HmRun', 'Runs', 'League_N', 'NewLeague_N', 'Errors', 'Division_W', 'CRBI', 'RBI']
18 ['CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat', 'CHmRun', 'Years', 'CAtBat', 'Assists', 'HmRun', 'Runs', 'League_N', 'NewLeague_N', 'Errors', 'Division_W', 'CRBI', 'RBI', 'PutOuts']
19 ['CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat', 'CHmRun', 'Years', 'CAtBat', 'Assists', 'HmRun', 'Runs', 'League_N', 'NewLeague_N', 'Errors', 'Division_W', 'CRBI', 'RBI', 'PutOuts', 'CHits']
sns.lineplot(x=range(1,20), y=models_test['RSS'])
plt.xlabel('# Predictors')
plt.ylabel('RSS')
plt.plot(models_test['RSS'].argmin()+1, models_test['RSS'].min(), 'or')

Now that we know what we’re looking for, let’s perform forward selection on the full dataset and select the best 7-predictor model. It is important that we make use of the full data set in order to obtain more accurate coefficient estimates. Note that we perform forward selection on the full data set and select the best 7-predictor model, rather than simply using the predictors that we obtained from the training set, because the best 7-predictor model on the full data set may differ from the corresponding model on the training set.

models_full = pd.DataFrame(columns=['RSS', 'model'])

predictors = []
   
for i in range(1,8):
    models_full.loc[i] = forward(predictors, X, y, X, y) #use full dataset
    exog = models_full.loc[i]['model'].model.exog_names.copy()
    exog.remove('Intercept')
    predictors = exog
    print(i, predictors)
1 ['CRBI']
2 ['CRBI', 'Hits']
3 ['CRBI', 'Hits', 'PutOuts']
4 ['CRBI', 'Hits', 'PutOuts', 'Division_W']
5 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat']
6 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks']
7 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks']
print(models_test.loc[7, 'model'].model.exog_names)
print(models_full.loc[7, 'model'].model.exog_names) # we will use this one as our final model!
['Intercept', 'CRuns', 'Hits', 'Walks', 'CWalks', 'AtBat', 'CHmRun', 'Years']
['Intercept', 'CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks']

6.1.4 Model selection using Cross-Validation

Now let’s try to choose among the models of different sizes using cross-validation. This approach is somewhat involved, as we must perform forward selection within each of the \(k\) training sets. Despite this, we see that with its clever subsetting syntax, python makes this job quite easy. First, we create a vector that assigns each observation to one of \(k = 10\) folds, and we create a DataFrame in which we will store the results:

k = 10
kf = KFold(n_splits=k, shuffle=True, random_state=2)
# Create a DataFrame to store the results of our upcoming calculations
cv_errors = pd.DataFrame(columns=range(1,k+1), index=range(1,20))
cv_errors = cv_errors.fillna(0)
models_cv = pd.DataFrame(columns=["RSS", "model"])

j = 0
# Outer loop iterates over all folds
for train_index, test_index in kf.split(X):
    j = j+1
    # Reset predictors
    predictors = []
    
    X_train2, X_test2 = X.iloc[train_index], X.iloc[test_index]
    y_train2, y_test2 = y.iloc[train_index], y.iloc[test_index]
    
    # Inner loop iterates over each size i
    for i in range(1,len(X.columns)+1):    
    
        # The perform forward selection on the full dataset minus the jth fold, test on jth fold
        models_cv.loc[i] = forward(predictors, X_train2, y_train2, X_test2, y_test2)
        
        # Save the cross-validated error for this fold
        cv_errors[j][i] = models_cv.loc[i]["RSS"]

        exog = models_cv.loc[i]['model'].model.exog_names.copy()
        exog.remove('Intercept')
        predictors = exog
cv_mean = cv_errors.apply(np.mean, axis=1)

sns.lineplot(x=range(1,20), y=cv_mean)
plt.xlabel('# Predictors')
plt.ylabel('CV Error')
plt.plot(cv_mean.argmin()+1, cv_mean.min(), "or")

Cross-validation suggest 8-predictor model is the best model. We now perform best subset selection on the full data set in order to obtain the 8-variable model.

models_full = pd.DataFrame(columns=['RSS', 'model'])

predictors = []
   
for i in range(1,9):
    models_full.loc[i] = forward(predictors, X, y, X, y) #use full dataset
    exog = models_full.loc[i]['model'].model.exog_names.copy()
    exog.remove('Intercept')
    predictors = exog
    print(i, predictors)
1 ['CRBI']
2 ['CRBI', 'Hits']
3 ['CRBI', 'Hits', 'PutOuts']
4 ['CRBI', 'Hits', 'PutOuts', 'Division_W']
5 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat']
6 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks']
7 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks']
8 ['CRBI', 'Hits', 'PutOuts', 'Division_W', 'AtBat', 'Walks', 'CWalks', 'CRuns']

There are also other supports from http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/ and https://scikit-learn.org/stable/modules/feature_selection.html#feature-selection which implement different selection strategies.

6.2 Ridge Regression and the Lasso

The glmnet algorithms in R optimize the objective function using cyclical coordinate descent, while scikit-learn Ridge regression uses linear least squares with L2 regularization. They are rather different implementations, but the general principles are the same.

The glmnet() function in R optimizes: ### \[ \frac{1}{N}|| X\beta-y||^2_2+\lambda\bigg(\frac{1}{2}(1−\alpha)||\beta||^2_2 \ +\ \alpha||\beta||_1\bigg) \] (See R documentation and https://cran.r-project.org/web/packages/glmnet/vignettes/glmnet_beta.pdf)
The function supports L1 and L2 regularization. For just Ridge regression we need to use $= 0 $. This reduces the above cost function to ### \[ \frac{1}{N}|| X\beta-y||^2_2+\frac{1}{2}\lambda ||\beta||^2_2 \]

In python, you can also find similar functions in https://github.com/civisanalytics/python-glmnet

The sklearn Ridge() function on the other hand optimizes: ### \[ ||X\beta - y||^2_2 + \alpha ||\beta||^2_2 \] which is equivalent to optimizing ### \[ \frac{1}{N}||X\beta - y||^2_2 + \frac{\alpha}{N} ||\beta||^2_2 \]

By default the Ridge() function performs ridge regression for an automatically selected range of \(\alpha\) (or \(\lambda\) in the textbook) values. However, here we have chosen to implement the function over a grid of values ranging from \(\alpha=10^{10}\) to \(\alpha=10^{-2}\), essentially covering the full range of scenarios from the null model containing only the intercept, to the least squares fit. The following we plot 19 coefficients (excluding intercept) for 100 different alphas

alphas = 10**np.linspace(10,-2,100)

ridge = Ridge()
coefs = []

for a in alphas:
    ridge.set_params(alpha=a)
    ridge.fit(scale(X), y) # We standardize features before ridge regression
    coefs.append(ridge.coef_)

ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
Text(0.5, 1.0, 'Ridge coefficients as a function of the regularization')

The above plot shows that the Ridge coefficients get larger when we decrease alpha. We now split the samples into a training set and a test set in order to estimate the test error of ridge regression and the lasso.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)

Note that if we had instead simply fit a model with just an intercept, we would have predicted each test observation using the mean of the training observations. In that case, we could compute the test set MSE like this:

np.mean((np.mean(y_train) - y_test)**2)
172862.23592080915

We could also get similar result by fitting a ridge regression model with a very large value of \(\lambda\). Note that 1e10 means \(10^{10}\). This big penalty shrinks the coefficients to a very large degree and makes the model more biased, resulting in a higher MSE.

scaler = StandardScaler()
scaler.fit(X_train)
StandardScaler()
model = Ridge()
model.set_params(alpha=10**10)
X_scale = scaler.transform(X_train)
model.fit(X_scale, y_train)
X_scale_t = scaler.transform(X_test)
pred = model.predict(X_scale_t)
mean_squared_error(y_test, pred)
172862.22059245987

Next we fit a ridge regression model on the training set, and evaluate its MSE on the test set

validation_score = []
X_scale_t = scaler.transform(X_test)
for alpha in alphas:
    model = Ridge(alpha=alpha)
    model.fit(X_scale,y_train) #compare normalize=True vs standardize
    validation_score.append(mean_squared_error(model.predict(X_scale_t),y_test))
sns.lineplot(x=alphas,y=validation_score)
plt.xscale('log')

np.argmin(validation_score)
65

Therefore, we see that the value of alpha that results in the smallest cross-validation error is the 64th. What is the test MSE associated with this value of alpha?

alphas[65], validation_score[65]
(132.19411484660287, 99557.32756698191)

We can also do built-in cross-validation using RidgeCV

kf = KFold(n_splits=10, shuffle=True, random_state=2)
ridgecv = RidgeCV(alphas=alphas, scoring='neg_mean_squared_error', cv=kf)
ridgecv.fit(X_scale, y_train)
RidgeCV(alphas=array([1.00000000e+10, 7.56463328e+09, 5.72236766e+09, 4.32876128e+09,
       3.27454916e+09, 2.47707636e+09, 1.87381742e+09, 1.41747416e+09,
       1.07226722e+09, 8.11130831e+08, 6.13590727e+08, 4.64158883e+08,
       3.51119173e+08, 2.65608778e+08, 2.00923300e+08, 1.51991108e+08,
       1.14975700e+08, 8.69749003e+07, 6.57933225e+07, 4.97702356e+07,
       3.76493581e+07, 2.84803587e+0...
       2.00923300e+00, 1.51991108e+00, 1.14975700e+00, 8.69749003e-01,
       6.57933225e-01, 4.97702356e-01, 3.76493581e-01, 2.84803587e-01,
       2.15443469e-01, 1.62975083e-01, 1.23284674e-01, 9.32603347e-02,
       7.05480231e-02, 5.33669923e-02, 4.03701726e-02, 3.05385551e-02,
       2.31012970e-02, 1.74752840e-02, 1.32194115e-02, 1.00000000e-02]),
        cv=KFold(n_splits=10, random_state=2, shuffle=True),
        scoring='neg_mean_squared_error')
ridgecv.alpha_
100.0
model.set_params(alpha=ridgecv.alpha_)
model.fit(X_scale, y_train)
mean_squared_error(y_test, model.predict(X_scale_t))
99586.56834382795

6.2.0.1 Fit model to full data set

model.fit(scale(X), y)
Ridge(alpha=100.0)
pd.Series(model.coef_.flatten(), index=X.columns)
AtBat          -0.006619
Hits           49.467498
HmRun          -0.859718
Runs           28.891970
RBI            22.425519
Walks          41.183554
Years          -2.737654
CAtBat         24.915049
CHits          44.636902
CHmRun         38.851734
CRuns          45.225776
CRBI           47.320463
CWalks          3.558932
PutOuts        56.671318
Assists         7.335151
Errors        -13.493719
League_N       14.878188
Division_W    -48.564908
NewLeague_N     2.817544
dtype: float64

As expected, none of the coefficients are zero—ridge regression does not perform variable selection!

You can also fit the ridge regression with LOOCV, this should be efficient since their is corresponding shortcut equation

ridgeloocv = RidgeCV(alphas=alphas, scoring='neg_mean_squared_error', gcv_mode='auto')
ridgeloocv.fit(X_scale, y_train)
RidgeCV(alphas=array([1.00000000e+10, 7.56463328e+09, 5.72236766e+09, 4.32876128e+09,
       3.27454916e+09, 2.47707636e+09, 1.87381742e+09, 1.41747416e+09,
       1.07226722e+09, 8.11130831e+08, 6.13590727e+08, 4.64158883e+08,
       3.51119173e+08, 2.65608778e+08, 2.00923300e+08, 1.51991108e+08,
       1.14975700e+08, 8.69749003e+07, 6.57933225e+07, 4.97702356e+07,
       3.76493581e+07, 2.84803587e+0...
       2.00923300e+00, 1.51991108e+00, 1.14975700e+00, 8.69749003e-01,
       6.57933225e-01, 4.97702356e-01, 3.76493581e-01, 2.84803587e-01,
       2.15443469e-01, 1.62975083e-01, 1.23284674e-01, 9.32603347e-02,
       7.05480231e-02, 5.33669923e-02, 4.03701726e-02, 3.05385551e-02,
       2.31012970e-02, 1.74752840e-02, 1.32194115e-02, 1.00000000e-02]),
        gcv_mode='auto', scoring='neg_mean_squared_error')
ridgeloocv.alpha_
75.64633275546291
model.set_params(alpha=ridgeloocv.alpha_)
model.fit(X_scale, y_train)
mean_squared_error(y_test, model.predict(X_scale_t))
99820.74305168974
model.fit(scale(X), y)
Ridge(alpha=75.64633275546291)

6.2.1 The Lasso

We saw that ridge regression with a wise choice of alpha may outperform least squares as well as the null model on the Hitters data set. We now ask whether the lasso can yield either a more accurate or a more interpretable model than ridge regression. The following we plot 19 coefficients (excluding intercept) for 100 different alphas.

lasso = Lasso(max_iter=10000)
coefs = []

for a in alphas:
    lasso.set_params(alpha=a)
    lasso.fit(scale(X_train), y_train)
    coefs.append(lasso.coef_)

ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Lasso coefficients as a function of the regularization');

We can see from the coefficient plot that depending on the choice of tuning parameter, some of the coefficients will be exactly equal to zero. We now perform cross-validation and compute the associated test error.

kf = KFold(n_splits=10, shuffle=True, random_state=2)
lassocv = LassoCV(alphas=None, cv=kf, max_iter=10000)
lassocv.fit(X_scale, y_train.values.ravel())
LassoCV(cv=KFold(n_splits=10, random_state=2, shuffle=True), max_iter=10000)
lassocv.alpha_
25.649654020305032
lasso.set_params(alpha=lassocv.alpha_)
lasso.fit(X_scale, y_train)
mean_squared_error(y_test, lasso.predict(X_scale_t))
104853.47279766494
# fit the full model
lasso.fit(scale(X), y)
Lasso(alpha=25.649654020305032, max_iter=10000)
# Some of the coefficients are now reduced to exactly zero.
pd.Series(lasso.coef_, index=X.columns)
AtBat            0.000000
Hits            80.723891
HmRun            0.000000
Runs             0.000000
RBI              0.000000
Walks           45.833055
Years            0.000000
CAtBat           0.000000
CHits            0.000000
CHmRun           0.000000
CRuns           65.022016
CRBI           129.889483
CWalks           0.000000
PutOuts         55.241441
Assists         -0.000000
Errors          -0.000000
League_N         0.000000
Division_W     -43.731588
NewLeague_N      0.000000
dtype: float64

The test set MSE is very similar to the test MSE of ridge regression with \(\alpha\) chosen by cross-validation.

However, the lasso has a substantial advantage over ridge regression in that the resulting coefficient estimates are sparse. Here we see that 13 of the 19 coefficient estimates are exactly zero. So the lasso model with \(\alpha\) chosen by cross-validation contains only six variables.

6.3 PCR and PLS Regression

6.3.1 Principal Components Regression

Principal components regression (PCR) can be performed using the pca function. We now apply PCR to the Hitters data, in order to predict Salary. Again, we ensure that the missing values have been removed from the data, as described in Section 6.5.1.

pca = PCA()
X_reduced = pca.fit_transform(scale(X)) #standardize before PCA

print(pca.components_.shape)
pd.DataFrame(pca.components_.T).loc[:4,:5]
(19, 19)
0 1 2 3 4 5
0 0.198290 -0.383784 0.088626 0.031967 0.028117 -0.070646
1 0.195861 -0.377271 0.074032 0.017982 -0.004652 -0.082240
2 0.204369 -0.237136 -0.216186 -0.235831 0.077660 -0.149646
3 0.198337 -0.377721 -0.017166 -0.049942 -0.038536 -0.136660
4 0.235174 -0.314531 -0.073085 -0.138985 0.024299 -0.111675
print(X_reduced.shape)
pd.DataFrame(X_reduced).loc[:4,:5]
(263, 19)
0 1 2 3 4 5
0 -0.009649 1.870522 1.265145 -0.935481 1.109636 1.211972
1 0.411434 -2.429422 -0.909193 -0.264212 1.232031 1.826617
2 3.466822 0.825947 0.555469 -1.616726 -0.857488 -1.028712
3 -2.558317 -0.230984 0.519642 -2.176251 -0.820301 1.491696
4 1.027702 -1.573537 1.331382 3.494004 0.983427 0.513675
# Variance explained by the principal components
np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
array([38.31, 60.15, 70.84, 79.03, 84.29, 88.63, 92.26, 94.96, 96.28,
       97.25, 97.97, 98.64, 99.14, 99.46, 99.73, 99.88, 99.95, 99.98,
       99.99])
plt.plot(np.cumsum(pca.explained_variance_ratio_)) #scree plot
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

6.3.1.1 Fitting PCA with training data

In this method, we use the train/val/test split scheme like the lab in the textbook

pca2 = PCA()
X_reduced_train = pca2.fit_transform(X_scale)
n = len(X_reduced_train)

# 10-fold CV, with shuffle
kf_10 = KFold(n_splits=10, shuffle=True, random_state=2)

mse = []
regr = LinearRegression()
# Calculate MSE with only the intercept (no principal components in regression) 
score = -1*cross_val_score(regr, np.ones((n,1)), y_train, cv=kf_10, scoring='neg_mean_squared_error').mean()    
mse.append(score)

# Calculate MSE using CV for the 19 principle components, adding one component at the time.
for i in np.arange(1, 20):
    score = -1*cross_val_score(regr, X_reduced_train[:,:i], y_train, cv=kf_10, scoring='neg_mean_squared_error').mean()
    mse.append(score)

plt.plot(np.arange(0, 20), np.array(mse), '-v')
plt.xlabel('Number of principal components in regression')
plt.ylabel('MSE')
plt.title('Salary')
plt.xlim(xmin=-1);

np.array(mse).argmin() #note mse contains no pc
6

6.3.1.2 Transform test data with PCA loadings and fit regression on 6 principal components

X_reduced_test = pca2.transform(X_scale_t)[:,:6]

# Train regression model on training data 
regr = LinearRegression()
regr.fit(X_reduced_train[:,:6], y_train)

# Prediction with test data
pred = regr.predict(X_reduced_test)
mean_squared_error(y_test, pred)
108195.77957462525
# Fit regression model on the whole data 
regr = LinearRegression()
regr.fit(X_reduced[:,:6], y)
LinearRegression()

6.3.2 Partial Least Squares

Scikit-learn PLSRegression gives same results as the pls package in R when using ‘method=’oscorespls’.
See documentation: http://scikit-learn.org/dev/modules/generated/sklearn.cross_decomposition.PLSRegression.html#sklearn.cross_decomposition.PLSRegression.

n = len(X_train)

# 10-fold CV, with shuffle
kf_10 = KFold(n_splits=10, shuffle=True, random_state=2)

mse = []

score = -1*cross_val_score(regr, np.ones((n,1)), y_train, cv=kf_10, scoring='neg_mean_squared_error').mean()    
mse.append(score)

for i in np.arange(1, 20):
    pls = PLSRegression(n_components=i)
    score = -1*cross_val_score(pls, X_scale, y_train, cv=kf_10, scoring='neg_mean_squared_error').mean()
    mse.append(score)

plt.plot(np.arange(0, 20), np.array(mse), '-v')
plt.xlabel('Number of principal components in regression')
plt.ylabel('MSE')
plt.title('Salary')
plt.xlim(xmin=-1);

np.array(mse).argmin()
1

The lowest cross-validation error occurs when only \(M=1\) partial least squares directions are used. We now evaluate the corresponding test set MSE.

pls = PLSRegression(n_components=1)
pls.fit(X_scale, y_train)

mean_squared_error(y_test, pls.predict(X_scale_t))
101462.13107757183