使用 Set_output API

Machine LearningMachine LearningBeginner
立即练习

This tutorial is from open-source community. Access the source code

💡 本教程由 AI 辅助翻译自英文原版。如需查看原文,您可以 切换至英文原版

简介

在本实验中,我们将学习如何使用Scikit-Learn中的set_output API来配置变换器,使其输出pandas DataFrame。在处理Scikit-Learn中的异构数据和管道时,此功能非常有用。

虚拟机使用提示

虚拟机启动完成后,点击左上角切换到“笔记本”标签,以访问Jupyter Notebook进行练习。

有时,你可能需要等待几秒钟,以便Jupyter Notebook完成加载。由于Jupyter Notebook的限制,操作验证无法自动化。

如果你在学习过程中遇到问题,请随时向Labby提问。课程结束后提供反馈,我们将立即为你解决问题。


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("Sklearn")) -.-> sklearn/DataPreprocessingandFeatureEngineeringGroup(["Data Preprocessing and Feature Engineering"]) sklearn(("Sklearn")) -.-> sklearn/ModelSelectionandEvaluationGroup(["Model Selection and Evaluation"]) sklearn(("Sklearn")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["Core Models and Algorithms"]) sklearn(("Sklearn")) -.-> sklearn/UtilitiesandDatasetsGroup(["Utilities and Datasets"]) ml(("Machine Learning")) -.-> ml/FrameworkandSoftwareGroup(["Framework and Software"]) sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/linear_model("Linear Models") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/preprocessing("Preprocessing and Normalization") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/feature_selection("Feature Selection") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/pipeline("Pipeline") sklearn/DataPreprocessingandFeatureEngineeringGroup -.-> sklearn/impute("Impute") sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/model_selection("Model Selection") sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/compose("Composite Estimators") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("Datasets") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("scikit-learn") subgraph Lab Skills sklearn/linear_model -.-> lab-49285{{"使用 Set_output API"}} sklearn/preprocessing -.-> lab-49285{{"使用 Set_output API"}} sklearn/feature_selection -.-> lab-49285{{"使用 Set_output API"}} sklearn/pipeline -.-> lab-49285{{"使用 Set_output API"}} sklearn/impute -.-> lab-49285{{"使用 Set_output API"}} sklearn/model_selection -.-> lab-49285{{"使用 Set_output API"}} sklearn/compose -.-> lab-49285{{"使用 Set_output API"}} sklearn/datasets -.-> lab-49285{{"使用 Set_output API"}} ml/sklearn -.-> lab-49285{{"使用 Set_output API"}} end

加载鸢尾花数据集

首先,我们将把鸢尾花数据集加载为一个DataFrame,以演示set_output API。

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
X_train.head()

配置变换器以输出DataFrame

要配置像preprocessing.StandardScaler这样的估计器以返回DataFrame,请调用set_output

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().set_output(transform="pandas")

scaler.fit(X_train)
X_test_scaled = scaler.transform(X_test)
X_test_scaled.head()

在拟合后配置transform

可以在fit之后调用set_output,以便事后配置transform

scaler2 = StandardScaler()

scaler2.fit(X_train)
X_test_np = scaler2.transform(X_test)
print(f"默认输出类型: {type(X_test_np).__name__}")

scaler2.set_output(transform="pandas")
X_test_df = scaler2.transform(X_test)
print(f"配置后的pandas输出类型: {type(X_test_df).__name__}")

配置管道以输出DataFrame

pipeline.Pipeline中,set_output会将所有步骤配置为输出DataFrame。

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectPercentile

clf = make_pipeline(
    StandardScaler(), SelectPercentile(percentile=75), LogisticRegression()
)
clf.set_output(transform="pandas")
clf.fit(X_train, y_train)

加载泰坦尼克号数据集

接下来,我们将加载泰坦尼克号数据集,以演示如何使用compose.ColumnTransformer和异构数据来使用set_output

from sklearn.datasets import fetch_openml

X, y = fetch_openml(
    "titanic", version=1, as_frame=True, return_X_y=True, parser="pandas"
)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

全局配置set_output

可以通过使用set_config并将transform_output设置为"pandas"来全局配置set_output API。

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn import set_config

set_config(transform_output="pandas")

num_pipe = make_pipeline(SimpleImputer(), StandardScaler())
num_cols = ["age", "fare"]
ct = ColumnTransformer(
    (
        ("numerical", num_pipe, num_cols),
        (
            "categorical",
            OneHotEncoder(
                sparse_output=False, drop="if_binary", handle_unknown="ignore"
            ),
            ["embarked", "sex", "pclass"],
        ),
    ),
    verbose_feature_names_out=False,
)
clf = make_pipeline(ct, SelectPercentile(percentile=50), LogisticRegression())
clf.fit(X_train, y_train)

使用config_context配置set_output

当使用config_context配置输出类型时,调用transformfit_transform时的配置才是关键。

scaler = StandardScaler()
scaler.fit(X_train[num_cols])

with config_context(transform_output="pandas"):
    X_test_scaled = scaler.transform(X_test[num_cols])
X_test_scaled.head()

总结

在本实验中,我们学习了如何使用Scikit-Learn中的set_output API来配置转换器,使其输出pandas DataFrame。我们展示了如何配置一个估计器以输出DataFrame,配置一个管道以输出DataFrame,以及使用set_config全局配置set_output。我们还学习了如何使用config_context配置set_output