Scikit-Learn 迭代插补器

Machine LearningMachine LearningBeginner
立即练习

This tutorial is from open-source community. Access the source code

💡 本教程由 AI 辅助翻译自英文原版。如需查看原文,您可以 切换至英文原版

简介

在本实验中,我们将学习如何使用Scikit-Learn的IterativeImputer类来填充数据集中的缺失值。我们将比较不同的估计器,以了解在加利福尼亚住房数据集上使用BayesianRidge估计器且每行随机删除一个值时,哪个估计器最适合IterativeImputer

虚拟机提示

虚拟机启动完成后,点击左上角切换到笔记本标签页,以访问Jupyter Notebook进行练习。

有时,你可能需要等待几秒钟让Jupyter Notebook完成加载。由于Jupyter Notebook的限制,操作验证无法自动化。

如果你在学习过程中遇到问题,请随时向Labby提问。课程结束后提供反馈,我们将立即为你解决问题。

导入库

我们将首先为本实验导入必要的库。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.experimental import enable_iterative_imputer
from sklearn.datasets import fetch_california_housing
from sklearn.impute import SimpleImputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge, Ridge
from sklearn.kernel_approximation import Nystroem
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

加载数据集

我们将从Scikit-Learn中加载加利福尼亚住房数据集。为了减少计算时间,我们只使用2k个样本。

N_SPLITS = 5

rng = np.random.RandomState(0)

X_full, y_full = fetch_california_housing(return_X_y=True)
X_full = X_full[::10]
y_full = y_full[::10]
n_samples, n_features = X_full.shape

添加缺失值

我们将在数据集的每一行中添加一个缺失值。

X_missing = X_full.copy()
y_missing = y_full
missing_samples = np.arange(n_samples)
missing_features = rng.choice(n_features, n_samples, replace=True)
X_missing[missing_samples, missing_features] = np.nan

使用简单插补器填充缺失值

我们将使用Scikit-Learn的SimpleImputer类,通过均值和中位数策略来填充缺失值。

score_simple_imputer = pd.DataFrame()
for strategy in ("mean", "median"):
    estimator = make_pipeline(
        SimpleImputer(missing_values=np.nan, strategy=strategy), BayesianRidge()
    )
    score_simple_imputer[strategy] = cross_val_score(
        estimator, X_missing, y_missing, scoring="neg_mean_squared_error", cv=N_SPLITS
    )

使用迭代插补器填充缺失值

我们将使用Scikit-Learn的IterativeImputer类,通过不同的估计器来填充缺失值。

estimators = [
    BayesianRidge(),
    RandomForestRegressor(
        n_estimators=4,
        max_depth=10,
        bootstrap=True,
        max_samples=0.5,
        n_jobs=2,
        random_state=0,
    ),
    make_pipeline(
        Nystroem(kernel="polynomial", degree=2, random_state=0), Ridge(alpha=1e3)
    ),
    KNeighborsRegressor(n_neighbors=15),
]
score_iterative_imputer = pd.DataFrame()
tolerances = (1e-3, 1e-1, 1e-1, 1e-2)
for impute_estimator, tol in zip(estimators, tolerances):
    estimator = make_pipeline(
        IterativeImputer(
            random_state=0, estimator=impute_estimator, max_iter=25, tol=tol
        ),
        BayesianRidge(),
    )
    score_iterative_imputer[impute_estimator.__class__.__name__] = cross_val_score(
        estimator, X_missing, y_missing, scoring="neg_mean_squared_error", cv=N_SPLITS
    )

比较结果

我们将使用柱状图来比较不同插补策略的结果。

scores = pd.concat(
    [score_full_data, score_simple_imputer, score_iterative_imputer],
    keys=["Original", "SimpleImputer", "IterativeImputer"],
    axis=1,
)

fig, ax = plt.subplots(figsize=(13, 6))
means = -scores.mean()
errors = scores.std()
means.plot.barh(xerr=errors, ax=ax)
ax.set_title("California Housing Regression with Different Imputation Methods")
ax.set_xlabel("MSE (smaller is better)")
ax.set_yticks(np.arange(means.shape[0]))
ax.set_yticklabels([" w/ ".join(label) for label in means.index.tolist()])
plt.tight_layout(pad=1)
plt.show()

总结

在本实验中,我们学习了如何使用Scikit-Learn的IterativeImputer类来填充数据集中的缺失值。我们使用SimpleImputer通过均值和中位数插补比较了不同的插补策略,并使用IterativeImputer结合不同的估计器进行了比较。我们发现,对于加利福尼亚住房数据集中这种特定的缺失值模式,BayesianRidgeRandomForestRegressor给出了最佳结果。