高斯过程回归 | 机器学习教程

简介

在本实验中，我们将学习如何使用高斯过程回归为数据集拟合模型。我们将生成一个合成数据集，并使用高斯过程回归为其拟合模型。我们将使用scikit-learn库来执行高斯过程回归。

虚拟机使用提示

虚拟机启动完成后，点击左上角切换到“笔记本”标签页，以访问Jupyter Notebook进行练习。

有时，你可能需要等待几秒钟让Jupyter Notebook完成加载。由于Jupyter Notebook的限制，操作验证无法自动化。

如果你在学习过程中遇到问题，随时向Labby提问。课程结束后提供反馈，我们会及时为你解决问题。

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("Sklearn")) -.-> sklearn/CoreModelsandAlgorithmsGroup(["Core Models and Algorithms"]) ml(("Machine Learning")) -.-> ml/FrameworkandSoftwareGroup(["Framework and Software"]) sklearn/CoreModelsandAlgorithmsGroup -.-> sklearn/gaussian_process("Gaussian Processes") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("scikit-learn") subgraph Lab Skills sklearn/gaussian_process -.-> lab-49145{{"拟合高斯过程回归模型"}} ml/sklearn -.-> lab-49145{{"拟合高斯过程回归模型"}} end

数据集生成

我们将生成一个合成数据集。真实的生成过程定义为f(x) = x sin(x)。

import numpy as np

X = np.linspace(start=0, stop=10, num=1_000).reshape(-1, 1)
y = np.squeeze(X * np.sin(X))

import matplotlib.pyplot as plt

plt.plot(X, y, label=r"$f(x) = x \sin(x)$", linestyle="dotted")
plt.legend()
plt.xlabel("$x$")
plt.ylabel("$f(x)$")
_ = plt.title("True generative process")

无噪声目标

在这一步中，我们将使用真实的生成过程，不添加任何噪声。为了训练高斯过程回归，我们只选择少数样本。

rng = np.random.RandomState(1)
training_indices = rng.choice(np.arange(y.size), size=6, replace=False)
X_train, y_train = X[training_indices], y[training_indices]

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF

kernel = 1 * RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e2))
gaussian_process = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
gaussian_process.fit(X_train, y_train)
gaussian_process.kernel_

预测与置信区间

在拟合模型之后，我们可以看到核函数的超参数已经得到了优化。现在，我们将使用我们的核函数来计算整个数据集的平均预测值，并绘制95%的置信区间。

mean_prediction, std_prediction = gaussian_process.predict(X, return_std=True)

plt.plot(X, y, label=r"$f(x) = x \sin(x)$", linestyle="dotted")
plt.scatter(X_train, y_train, label="Observations")
plt.plot(X, mean_prediction, label="Mean prediction")
plt.fill_between(
    X.ravel(),
    mean_prediction - 1.96 * std_prediction,
    mean_prediction + 1.96 * std_prediction,
    alpha=0.5,
    label=r"95% confidence interval",
)
plt.legend()
plt.xlabel("$x$")
plt.ylabel("$f(x)$")
_ = plt.title("Gaussian process regression on noise-free dataset")

带噪声的目标

这次我们可以重复一个类似的实验，给目标添加额外的噪声。这将有助于观察噪声对拟合模型的影响。

noise_std = 0.75
y_train_noisy = y_train + rng.normal(loc=0.0, scale=noise_std, size=y_train.shape)

gaussian_process = GaussianProcessRegressor(
    kernel=kernel, alpha=noise_std**2, n_restarts_optimizer=9
)
gaussian_process.fit(X_train, y_train_noisy)
mean_prediction, std_prediction = gaussian_process.predict(X, return_std=True)

plt.plot(X, y, label=r"$f(x) = x \sin(x)$", linestyle="dotted")
plt.errorbar(
    X_train,
    y_train_noisy,
    noise_std,
    linestyle="None",
    color="tab:blue",
    marker=".",
    markersize=10,
    label="Observations",
)
plt.plot(X, mean_prediction, label="Mean prediction")
plt.fill_between(
    X.ravel(),
    mean_prediction - 1.96 * std_prediction,
    mean_prediction + 1.96 * std_prediction,
    color="tab:orange",
    alpha=0.5,
    label=r"95% confidence interval",
)
plt.legend()
plt.xlabel("$x$")
plt.ylabel("$f(x)$")
_ = plt.title("Gaussian process regression on a noisy dataset")

总结

在这个实验中，我们学习了如何使用高斯过程回归来将模型拟合到一个数据集上。我们生成了一个合成数据集，并使用高斯过程回归对其进行模型拟合。我们使用scikit-learn库来执行高斯过程回归，并绘制了平均预测值和95%置信区间。