掌握 Scikit-Learn 的迭代插补

简介

在本实验中，我们将学习如何使用 Scikit-Learn 的IterativeImputer类来填充数据集中的缺失值。我们将比较不同的估计器，以了解在加利福尼亚住房数据集上使用BayesianRidge估计器且每行随机删除一个值时，哪个估计器最适合IterativeImputer。

虚拟机提示

虚拟机启动完成后，点击左上角切换到笔记本标签页，以访问 Jupyter Notebook 进行练习。

有时，你可能需要等待几秒钟让 Jupyter Notebook 完成加载。由于 Jupyter Notebook 的限制，操作验证无法自动化。

如果你在学习过程中遇到问题，请随时向 Labby 提问。课程结束后提供反馈，我们将立即为你解决问题。

导入库

我们将首先为本实验导入必要的库。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.experimental import enable_iterative_imputer
from sklearn.datasets import fetch_california_housing
from sklearn.impute import SimpleImputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge, Ridge
from sklearn.kernel_approximation import Nystroem
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

加载数据集

我们将从 Scikit-Learn 中加载加利福尼亚住房数据集。为了减少计算时间，我们只使用 2k 个样本。

N_SPLITS = 5

rng = np.random.RandomState(0)

X_full, y_full = fetch_california_housing(return_X_y=True)
X_full = X_full[::10]
y_full = y_full[::10]
n_samples, n_features = X_full.shape

添加缺失值

我们将在数据集的每一行中添加一个缺失值。

X_missing = X_full.copy()
y_missing = y_full
missing_samples = np.arange(n_samples)
missing_features = rng.choice(n_features, n_samples, replace=True)
X_missing[missing_samples, missing_features] = np.nan

使用简单插补器填充缺失值

我们将使用 Scikit-Learn 的SimpleImputer类，通过均值和中位数策略来填充缺失值。

score_simple_imputer = pd.DataFrame()
for strategy in ("mean", "median"):
    estimator = make_pipeline(
        SimpleImputer(missing_values=np.nan, strategy=strategy), BayesianRidge()
    )
    score_simple_imputer[strategy] = cross_val_score(
        estimator, X_missing, y_missing, scoring="neg_mean_squared_error", cv=N_SPLITS
    )

使用迭代插补器填充缺失值

我们将使用 Scikit-Learn 的IterativeImputer类，通过不同的估计器来填充缺失值。

estimators = [
    BayesianRidge(),
    RandomForestRegressor(
        n_estimators=4,
        max_depth=10,
        bootstrap=True,
        max_samples=0.5,
        n_jobs=2,
        random_state=0,
    ),
    make_pipeline(
        Nystroem(kernel="polynomial", degree=2, random_state=0), Ridge(alpha=1e3)
    ),
    KNeighborsRegressor(n_neighbors=15),
]
score_iterative_imputer = pd.DataFrame()
tolerances = (1e-3, 1e-1, 1e-1, 1e-2)
for impute_estimator, tol in zip(estimators, tolerances):
    estimator = make_pipeline(
        IterativeImputer(
            random_state=0, estimator=impute_estimator, max_iter=25, tol=tol
        ),
        BayesianRidge(),
    )
    score_iterative_imputer[impute_estimator.__class__.__name__] = cross_val_score(
        estimator, X_missing, y_missing, scoring="neg_mean_squared_error", cv=N_SPLITS
    )

比较结果

我们将使用柱状图来比较不同插补策略的结果。

scores = pd.concat(
    [score_full_data, score_simple_imputer, score_iterative_imputer],
    keys=["Original", "SimpleImputer", "IterativeImputer"],
    axis=1,
)

fig, ax = plt.subplots(figsize=(13, 6))
means = -scores.mean()
errors = scores.std()
means.plot.barh(xerr=errors, ax=ax)
ax.set_title("California Housing Regression with Different Imputation Methods")
ax.set_xlabel("MSE (smaller is better)")
ax.set_yticks(np.arange(means.shape[0]))
ax.set_yticklabels([" w/ ".join(label) for label in means.index.tolist()])
plt.tight_layout(pad=1)
plt.show()

总结

在本实验中，我们学习了如何使用 Scikit-Learn 的IterativeImputer类来填充数据集中的缺失值。我们使用SimpleImputer通过均值和中位数插补比较了不同的插补策略，并使用IterativeImputer结合不同的估计器进行了比较。我们发现，对于加利福尼亚住房数据集中这种特定的缺失值模式，BayesianRidge和RandomForestRegressor给出了最佳结果。

Scikit-Learn 迭代插补器

简介