最適な性能のための決定木の剪定

はじめに

機械学習において、決定木は一般的に使用されるモデルです。ただし、決定木は学習データに対して過学習する傾向があり、テストデータでの性能が低下する原因になります。過学習を防止する 1 つの方法は、決定木を剪定することです。コスト複雑度剪定は、決定木を剪定するための一般的な方法です。この実験では、scikit-learn を使用して、決定木のコスト複雑度剪定を示します。

VM のヒント

VM の起動が完了した後、画面の左上隅をクリックしてノートブックタブに切り替え、Jupyter Notebook を使用して練習します。

場合によっては、Jupyter Notebook が読み込み終了するまで数秒待つ必要があります。Jupyter Notebook の制限により、操作の検証を自動化することはできません。

学習中に問題が発生した場合は、Labby にお問い合わせください。セッション後にフィードバックを提供してください。そうすると、迅速に問題を解決します。

データの読み込み

scikit-learn の乳がんデータセットを使用します。このデータセットには 30 の特徴と、患者が悪性または良性の癌を持っているかどうかを示す 2 値の目的変数があります。

from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

データの分割

データを学習セットとテストセットに分割します。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

適切なアルファ値を決定する

決定木を剪定するために使用する適切なアルファ値を決定したいと思います。これは、葉の総不純度と剪定された木の有効なアルファ値をプロットすることで行うことができます。

from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

clf = DecisionTreeClassifier(random_state=0)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

fig, ax = plt.subplots()
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")

決定木を学習する

次に、それぞれの有効なアルファ値を使用して決定木を学習します。ccp_alphasの最後の値は、木全体を剪定し、1 つのノードだけの木にするアルファ値です。

clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)

自明な木を削除する

決定木のリストから、1 つのノードだけの自明な木を削除します。

clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

木のノード数と深さをプロットする

アルファが増加するにつれて、木のノード数と深さをプロットします。

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1)
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

最適なアルファ値を決定する

決定木を剪定するために使用する最適なアルファ値を決定したいと思います。これは、学習セットとテストセットに対する精度をアルファに対してプロットすることで行うことができます。

train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]

fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()

まとめ

この実験では、scikit-learn を使用して決定木のコスト複雑度剪定を行う方法を示しました。データを学習セットとテストセットに分割し、剪定に使用する適切なアルファ値を決定し、有効なアルファ値を使用して決定木を学習し、木のノード数と深さをプロットし、学習セットとテストセットの精度に基づいて剪定に使用する最適なアルファ値を決定しました。

決定木の剪定後処理