掌握决策树分析

简介

决策树分类器是一种用于分类和回归问题的流行机器学习算法。它是一种基于树的模型，将特征空间划分为一组不重叠的区域，并预测每个区域的目标值。在本实验中，我们将学习如何分析决策树结构，以进一步深入了解特征与要预测的目标之间的关系。

虚拟机使用提示

虚拟机启动完成后，点击左上角切换到“笔记本”标签，以访问 Jupyter Notebook 进行练习。

有时，你可能需要等待几秒钟让 Jupyter Notebook 完成加载。由于 Jupyter Notebook 的限制，操作验证无法自动化。

如果你在学习过程中遇到问题，请随时向 Labby 提问。课程结束后提供反馈，我们将立即为你解决问题。

训练决策树分类器

首先，我们需要使用 scikit-learn 中的load_iris数据集来拟合一个决策树分类器。这个数据集包含 3 个类别，每个类别有 50 个实例，每个类别代表一种鸢尾花植物。我们将把数据集拆分为训练集和测试集，并拟合一个最多有 3 个叶节点的决策树分类器。

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

clf = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)
clf.fit(X_train, y_train)

分析二叉树结构

决策树分类器有一个名为tree_的属性，通过它可以访问一些底层属性，比如node_count（节点总数）和max_depth（树的最大深度）。它还存储了整个二叉树结构，以多个并行数组的形式表示。利用这些数组，我们可以遍历树结构来计算各种属性，比如每个节点的深度以及它是否为叶节点。以下是计算这些属性的代码：

import numpy as np

n_nodes = clf.tree_.node_count
children_left = clf.tree_.children_left
children_right = clf.tree_.children_right
feature = clf.tree_.feature
threshold = clf.tree_.threshold

node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, 0)]  ## 从根节点 ID（0）及其深度（0）开始
while len(stack) > 0:
    ## `pop` 操作确保每个节点只被访问一次
    node_id, depth = stack.pop()
    node_depth[node_id] = depth

    ## 如果一个节点的左子节点和右子节点不同，那么它就是一个分裂节点
    is_split_node = children_left[node_id]!= children_right[node_id]
    ## 如果是分裂节点，将左子节点、右子节点及其深度添加到 `stack` 中
    ## 以便我们可以遍历它们
    if is_split_node:
        stack.append((children_left[node_id], depth + 1))
        stack.append((children_right[node_id], depth + 1))
    else:
        is_leaves[node_id] = True

print(
    "The binary tree structure has {n} nodes and has "
    "the following tree structure:\n".format(n=n_nodes)
)
for i in range(n_nodes):
    if is_leaves[i]:
        print(
            "{space}node={node} is a leaf node.".format(
                space=node_depth[i] * "\t", node=i
            )
        )
    else:
        print(
            "{space}node={node} is a split node: "
            "go to node {left} if X[:, {feature}] <= {threshold} "
            "else to node {right}.".format(
                space=node_depth[i] * "\t",
                node=i,
                left=children_left[i],
                feature=feature[i],
                threshold=threshold[i],
                right=children_right[i],
            )
        )

可视化决策树

我们还可以使用 scikit-learn 的tree模块中的plot_tree函数来可视化决策树。

from sklearn import tree
import matplotlib.pyplot as plt

tree.plot_tree(clf)
plt.show()

获取决策路径和叶节点

我们可以使用decision_path方法来获取感兴趣样本的决策路径。此方法输出一个指示矩阵，通过它我们能够获取感兴趣样本所遍历的节点。感兴趣样本到达的叶节点 ID 可以通过apply方法获得。这将返回每个感兴趣样本到达的叶节点的节点 ID 数组。利用叶节点 ID 和decision_path，我们可以获得用于预测一个样本或一组样本的分裂条件。以下是获取单个样本的决策路径和叶节点的代码：

node_indicator = clf.decision_path(X_test)
leaf_id = clf.apply(X_test)

sample_id = 0
## 获取样本 `sample_id` 所经过的节点 ID，即行 `sample_id`
node_index = node_indicator.indices[
    node_indicator.indptr[sample_id] : node_indicator.indptr[sample_id + 1]
]

print("用于预测样本 {id} 的规则:\n".format(id=sample_id))
for node_id in node_index:
    ## 如果是叶节点，则继续到下一个节点
    if leaf_id[sample_id] == node_id:
        continue

    ## 检查样本 0 的分裂特征值是否低于阈值
    if X_test[sample_id, feature[node_id]] <= threshold[node_id]:
        threshold_sign = "<="
    else:
        threshold_sign = ">"

    print(
        "决策节点 {node} : (X_test[{sample}, {feature}] = {value}) "
        "{inequality} {threshold})".format(
            node=node_id,
            sample=sample_id,
            feature=feature[node_id],
            value=X_test[sample_id, feature[node_id]],
            inequality=threshold_sign,
            threshold=threshold[node_id],
        )
    )

确定一组样本的公共节点

对于一组样本，我们可以使用decision_path方法和toarray方法将指示矩阵转换为密集数组，从而确定这些样本所经过的公共节点。

sample_ids = [0, 1]
## 布尔数组，指示两个样本都经过的节点
common_nodes = node_indicator.toarray()[sample_ids].sum(axis=0) == len(sample_ids)
## 使用数组中的位置获取节点 ID
common_node_id = np.arange(n_nodes)[common_nodes]

print(
    "\n以下样本 {samples} 在树中共享节点 {nodes}。".format(
        samples=sample_ids, nodes=common_node_id
    )
)
print("这占所有节点的 {prop}%。".format(prop=100 * len(common_node_id) / n_nodes))

总结

在本实验中，我们学习了如何分析决策树结构，以便更深入地了解特征与要预测的目标之间的关系。我们已经了解了如何获取二叉树结构、可视化决策树，以及获取单个样本或一组样本的决策路径和叶节点。这些技术可以帮助我们更好地理解决策树分类器是如何进行预测的，并指导我们对模型进行微调以提高其性能。