创建合成数据集
在本实验中,我们将创建一个包含三个分类特征的合成数据集:一个具有中等基数的信息性特征、一个具有中等基数的非信息性特征和一个具有高基数的非信息性特征。我们将使用 Scikit-learn 中的 KBinsDiscretizer
类来生成信息性特征。运行以下代码来创建合成数据集:
n_samples = 50_000
rng = np.random.RandomState(42)
y = rng.randn(n_samples)
noise = 0.5 * rng.randn(n_samples)
n_categories = 100
kbins = KBinsDiscretizer(
n_bins=n_categories, encode="ordinal", strategy="uniform", random_state=rng
)
X_informative = kbins.fit_transform((y + noise).reshape(-1, 1))
permuted_categories = rng.permutation(n_categories)
X_informative = permuted_categories[X_informative.astype(np.int32)]
X_shuffled = rng.permutation(X_informative)
X_near_unique_categories = rng.choice(
int(0.9 * n_samples), size=n_samples, replace=True
).reshape(-1, 1)
X = pd.DataFrame(
np.concatenate(
[X_informative, X_shuffled, X_near_unique_categories],
axis=1,
),
columns=["informative", "shuffled", "near_unique"],
)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)