多类稀疏逻辑回归

Machine LearningMachine LearningBeginner
立即练习

This tutorial is from open-source community. Access the source code

💡 本教程由 AI 辅助翻译自英文原版。如需查看原文,您可以 切换至英文原版

简介

在本实验中,我们将探索如何使用scikit-learn在“20新闻组”数据集上使用多类稀疏逻辑回归。我们将比较多项逻辑回归与一对其余L1逻辑回归的性能。

虚拟机提示

虚拟机启动完成后,点击左上角切换到“笔记本”标签以访问Jupyter Notebook进行练习。

有时,你可能需要等待几秒钟让Jupyter Notebook完成加载。由于Jupyter Notebook的限制,操作验证无法自动化。

如果你在学习过程中遇到问题,请随时向Labby提问。课程结束后提供反馈,我们将立即为你解决问题。


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("Sklearn")) -.-> sklearn/UtilitiesandDatasetsGroup(["Utilities and Datasets"]) ml(("Machine Learning")) -.-> ml/FrameworkandSoftwareGroup(["Framework and Software"]) sklearn(("Sklearn")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["Core Models and Algorithms"]) sklearn(("Sklearn")) -.-> sklearn/ModelSelectionandEvaluationGroup(["Model Selection and Evaluation"]) sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/linear_model("Linear Models") sklearn/ModelSelectionandEvaluationGroup -.-> sklearn/model_selection("Model Selection") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/datasets("Datasets") sklearn/UtilitiesandDatasetsGroup -.-> sklearn/exceptions("Exceptions and Warnings") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("scikit-learn") subgraph Lab Skills sklearn/linear_model -.-> lab-49296{{"多类稀疏逻辑回归"}} sklearn/model_selection -.-> lab-49296{{"多类稀疏逻辑回归"}} sklearn/datasets -.-> lab-49296{{"多类稀疏逻辑回归"}} sklearn/exceptions -.-> lab-49296{{"多类稀疏逻辑回归"}} ml/sklearn -.-> lab-49296{{"多类稀疏逻辑回归"}} end

导入库

我们首先导入本实验中将会用到的必要库和模块。

import timeit
import warnings

import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.exceptions import ConvergenceWarning

warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn")

加载并准备数据

接下来,我们加载“20新闻组”数据集,并为训练和测试准备数据。

## 我们使用SAGA求解器
solver = "saga"

## 降低此值以加快运行速度
n_samples = 5000

X, y = fetch_20newsgroups_vectorized(subset="all", return_X_y=True)
X = X[:n_samples]
y = y[:n_samples]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, stratify=y, test_size=0.1
)
train_samples, n_features = X_train.shape
n_classes = np.unique(y).shape[0]

print(
    "Dataset 20newsgroup, train_samples=%i, n_features=%i, n_classes=%i"
    % (train_samples, n_features, n_classes)
)

定义并训练模型

我们将定义两个模型,即多项逻辑回归和一对其余L1逻辑回归,并使用不同的轮数对它们进行训练。

models = {
    "ovr": {"name": "One versus Rest", "iters": [1, 2, 3]},
    "multinomial": {"name": "Multinomial", "iters": [1, 2, 5]},
}

for model in models:
    ## 为绘图目的添加初始随机水平值
    accuracies = [1 / n_classes]
    times = [0]
    densities = [1]

    model_params = models[model]

    ## 为快速运行设置较少的轮数
    for this_max_iter in model_params["iters"]:
        print(
            "[model=%s, solver=%s] Number of epochs: %s"
            % (model_params["name"], solver, this_max_iter)
        )
        lr = LogisticRegression(
            solver=solver,
            multi_class=model,
            penalty="l1",
            max_iter=this_max_iter,
            random_state=42,
        )
        t1 = timeit.default_timer()
        lr.fit(X_train, y_train)
        train_time = timeit.default_timer() - t1

        y_pred = lr.predict(X_test)
        accuracy = np.sum(y_pred == y_test) / y_test.shape[0]
        density = np.mean(lr.coef_!= 0, axis=1) * 100
        accuracies.append(accuracy)
        densities.append(density)
        times.append(train_time)
    models[model]["times"] = times
    models[model]["densities"] = densities
    models[model]["accuracies"] = accuracies
    print("Test accuracy for model %s: %.4f" % (model, accuracies[-1]))
    print(
        "%% non-zero coefficients for model %s, per class:\n %s"
        % (model, densities[-1])
    )
    print(
        "Run time (%i epochs) for model %s:%.2f"
        % (model_params["iters"][-1], model, times[-1])
    )

可视化结果

最后,我们将使用折线图来可视化训练模型的结果。

fig = plt.figure()
ax = fig.add_subplot(111)

for model in models:
    name = models[model]["name"]
    times = models[model]["times"]
    accuracies = models[model]["accuracies"]
    ax.plot(times, accuracies, marker="o", label="Model: %s" % name)
    ax.set_xlabel("Train time (s)")
    ax.set_ylabel("Test accuracy")
ax.legend()
fig.suptitle("Multinomial vs One-vs-Rest Logistic L1\nDataset %s" % "20newsgroups")
fig.tight_layout()
fig.subplots_adjust(top=0.85)
run_time = timeit.default_timer() - t0
print("Example run in %.3f s" % run_time)
plt.show()

总结

在本实验中,我们使用scikit-learn对“20新闻组”数据集执行多类稀疏逻辑回归。我们比较了多项逻辑回归与一对其余L1逻辑回归的性能,并使用折线图可视化了结果。