线性模型的交叉验证 | 糖尿病数据集 | 套索回归

简介

在本实验中，我们将对线性模型使用交叉验证。我们将使用糖尿病数据集，并应用 GridSearchCV 来找到套索回归的最佳 alpha 值。然后，我们将绘制误差，并使用 LassoCV 来了解我们对 alpha 值选择的信任程度。

虚拟机使用提示

虚拟机启动完成后，点击左上角切换到笔记本标签页，以访问 Jupyter Notebook 进行练习。

有时，你可能需要等待几秒钟让 Jupyter Notebook 完成加载。由于 Jupyter Notebook 的限制，操作验证无法自动化。

如果你在学习过程中遇到问题，随时向 Labby 提问。课程结束后提供反馈，我们将立即为你解决问题。

加载并准备数据集

首先，我们将加载并准备糖尿病数据集。本次练习我们仅使用前 150 个样本。

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets

X, y = datasets.load_diabetes(return_X_y=True)
X = X[:150]
y = y[:150]

应用 GridSearchCV

接下来，我们将应用 GridSearchCV 来找到套索回归的最佳 alpha 值。我们将使用从 10^-4 到 10^-0.5 的一系列 alpha 值，中间有 30 个值。我们将使用 5 折交叉验证。

from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

lasso = Lasso(random_state=0, max_iter=10000)
alphas = np.logspace(-4, -0.5, 30)

tuned_parameters = [{"alpha": alphas}]
n_folds = 5

clf = GridSearchCV(lasso, tuned_parameters, cv=n_folds, refit=False)
clf.fit(X, y)

绘制误差图

我们现在将绘制误差图，以查看最佳 alpha 值。我们将绘制平均测试分数和分数的标准误差。

scores = clf.cv_results_["mean_test_score"]
scores_std = clf.cv_results_["std_test_score"]

plt.figure().set_size_inches(8, 6)
plt.semilogx(alphas, scores)

std_error = scores_std / np.sqrt(n_folds)

plt.semilogx(alphas, scores + std_error, "b--")
plt.semilogx(alphas, scores - std_error, "b--")

plt.fill_between(alphas, scores + std_error, scores - std_error, alpha=0.2)

plt.ylabel("CV score +/- std error")
plt.xlabel("alpha")
plt.axhline(np.max(scores), linestyle="--", color=".5")
plt.xlim([alphas[0], alphas[-1]])

使用 LassoCV 检查 alpha 值的选择

最后，我们将使用 LassoCV 来了解我们对 alpha 值选择的信任程度。我们将使用 3 折的 KFold。

from sklearn.linear_model import LassoCV
from sklearn.model_selection import KFold

lasso_cv = LassoCV(alphas=alphas, random_state=0, max_iter=10000)
k_fold = KFold(3)

print("Answer to the bonus question:", "how much can you trust the selection of alpha?")
print()
print("Alpha parameters maximising the generalization score on different")
print("subsets of the data:")
for k, (train, test) in enumerate(k_fold.split(X, y)):
    lasso_cv.fit(X[train], y[train])
    print(
        "[fold {0}] alpha: {1:.5f}, score: {2:.5f}".format(
            k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test])
        )
    )

print()
print("Answer: Not very much since we obtained different alphas for different")
print("subsets of the data and moreover, the scores for these alphas differ")
print("quite substantially.")

总结

在本实验中，我们学习了如何在线性模型中使用交叉验证。我们使用 GridSearchCV 来找到套索回归的最佳 alpha 值，并绘制误差图以直观展示该选择。我们还使用 LassoCV 来检查 alpha 值的选择。