線形モデルを用た交差検証 | 糖尿病データセット | Lasso 回帰

はじめに

この実験では、線形モデルを用いた交差検証を行います。糖尿病データセットを使用し、GridSearchCV を適用して Lasso 回帰の最適な alpha 値を見つけます。その後、誤差をプロットし、LassoCV を使って alpha の選択にどれだけ信頼できるかを確認します。

VM のヒント

VM の起動が完了したら、左上隅をクリックしてノートブックタブに切り替え、Jupyter Notebook を使って練習しましょう。

Jupyter Notebook の読み込みには数秒かかる場合があります。Jupyter Notebook の制限により、操作の検証は自動化できません。

学習中に問題がある場合は、Labby にお問い合わせください。セッション後にフィードバックを提供してください。すぐに問題を解決いたします。

データセットの読み込みと準備

まず、糖尿病データセットを読み込み、準備します。この演習では最初の 150 サンプルのみを使用します。

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets

X, y = datasets.load_diabetes(return_X_y=True)
X = X[:150]
y = y[:150]

GridSearchCV を適用する

次に、GridSearchCV を適用して Lasso 回帰の最適な alpha 値を見つけます。10^-4 から 10^-0.5 までの alpha 値の範囲を使い、その間に 30 個の値を設定します。交差検証には 5 分割を使用します。

from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

lasso = Lasso(random_state=0, max_iter=10000)
alphas = np.logspace(-4, -0.5, 30)

tuned_parameters = [{"alpha": alphas}]
n_folds = 5

clf = GridSearchCV(lasso, tuned_parameters, cv=n_folds, refit=False)
clf.fit(X, y)

誤差をプロットする

ここでは、誤差をプロットして最適な alpha 値を確認します。平均テストスコアとスコアの標準誤差をプロットします。

scores = clf.cv_results_["mean_test_score"]
scores_std = clf.cv_results_["std_test_score"]

plt.figure().set_size_inches(8, 6)
plt.semilogx(alphas, scores)

std_error = scores_std / np.sqrt(n_folds)

plt.semilogx(alphas, scores + std_error, "b--")
plt.semilogx(alphas, scores - std_error, "b--")

plt.fill_between(alphas, scores + std_error, scores - std_error, alpha=0.2)

plt.ylabel("CV score +/- std error")
plt.xlabel("alpha")
plt.axhline(np.max(scores), linestyle="--", color=".5")
plt.xlim([alphas[0], alphas[-1]])

LassoCV を使って alpha の選択を確認する

最後に、LassoCV を使って alpha の選択にどれだけ信頼できるかを確認します。3 分割の KFold を使用します。

from sklearn.linear_model import LassoCV
from sklearn.model_selection import KFold

lasso_cv = LassoCV(alphas=alphas, random_state=0, max_iter=10000)
k_fold = KFold(3)

print("Answer to the bonus question:", "how much can you trust the selection of alpha?")
print()
print("Alpha parameters maximising the generalization score on different")
print("subsets of the data:")
for k, (train, test) in enumerate(k_fold.split(X, y)):
    lasso_cv.fit(X[train], y[train])
    print(
        "[fold {0}] alpha: {1:.5f}, score: {2:.5f}".format(
            k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test])
        )
    )

print()
print("Answer: Not very much since we obtained different alphas for different")
print("subsets of the data and moreover, the scores for these alphas differ")
print("quite substantially.")

まとめ

この実験では、線形モデルで交差検証を使用する方法を学びました。GridSearchCV を使って Lasso 回帰の最適な alpha 値を見つけ、誤差をプロットして選択を視覚化しました。また、LassoCV を使って alpha の選択を確認しました。