파이썬으로 데이터 스케일링 및 변환 마스터하기

소개

이 실습에서는 Python 의 scikit-learn 라이브러리를 사용하여 이상치가 있는 데이터 세트에 대한 다양한 스케일링 및 변환 기법을 보여줍니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 Jupyter Notebook을 연습에 사용할 수 있습니다.

때때로 Jupyter Notebook 이 완전히 로드되기까지 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업의 유효성 검사를 자동화할 수 없습니다.

학습 중 문제가 발생하면 Labby 에 문의하십시오. 세션 후 피드백을 제공하면 문제를 신속하게 해결해 드리겠습니다.

라이브러리 및 데이터셋 가져오기

먼저 필요한 라이브러리를 가져오고 scikit-learn 에서 캘리포니아 주택 데이터셋을 로드해야 합니다.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, Normalizer, QuantileTransformer, PowerTransformer
from sklearn.datasets import fetch_california_housing

## 캘리포니아 주택 데이터셋 로드
dataset = fetch_california_housing()
X_full, y_full = dataset.data, dataset.target
feature_names = dataset.feature_names

특징 선택 및 매핑 정의

다음으로, 시각화를 용이하게 하기 위해 데이터셋에서 두 개의 특징을 선택하고, 더 나은 시각화를 위해 특징 이름의 매핑을 정의합니다.

## 두 개의 특징 선택
features = ["MedInc", "AveOccup"]
features_idx = [feature_names.index(feature) for feature in features]
X = X_full[:, features_idx]

## 특징 매핑 정의
feature_mapping = {
    "MedInc": "블록별 중간 소득",
    "AveOccup": "평균 주택 점유율",
}

분포 정의

데이터를 사전 정의된 범위 내로 가져오기 위해 다양한 스케일러, 변환기 및 정규화기를 리스트에 정의하고, 이를 distributions라는 리스트에 저장합니다.

## 분포 정의
distributions = [
    ("스케일링되지 않은 데이터", X),
    ("표준 스케일링 후 데이터", StandardScaler().fit_transform(X)),
    ("최소 - 최대 스케일링 후 데이터", MinMaxScaler().fit_transform(X)),
    ("강건 스케일링 후 데이터", RobustScaler(quantile_range=(25, 75)).fit_transform(X)),
    ("샘플별 L2 정규화 후 데이터", Normalizer().fit_transform(X)),
    ("분위수 변환 후 데이터 (균일 pdf)", QuantileTransformer(output_distribution="uniform").fit_transform(X)),
    ("분위수 변환 후 데이터 (가우스 pdf)", QuantileTransformer(output_distribution="normal").fit_transform(X)),
    ("멱 변환 후 데이터 (Yeo-Johnson)", PowerTransformer(method="yeo-johnson").fit_transform(X)),
    ("멱 변환 후 데이터 (Box-Cox)", PowerTransformer(method="box-cox").fit_transform(X)),
]

분포 플롯

마지막으로, 각 분포를 플롯하는 함수를 만들고, 리스트의 각 분포에 대해 함수를 호출합니다. 이 함수는 각 스케일러/변환기/정규화기에 대해 두 개의 플롯을 표시합니다. 왼쪽 플롯은 전체 데이터셋의 산점도를 보여주고, 오른쪽 플롯은 가장자리 이상치를 제외하고 데이터셋의 99% 만 고려하여 극단값을 제외합니다. 또한, 각 특징의 주변 분포가 산점도의 양쪽에 표시됩니다.

## 분포 플롯
def plot_distribution(axes, X, y, hist_nbins=50, title="", x0_label="", x1_label=""):
    ax, hist_X1, hist_X0 = axes

    ax.set_title(title)
    ax.set_xlabel(x0_label)
    ax.set_ylabel(x1_label)

    ## 산점도
    colors = cm.plasma_r(y)
    ax.scatter(X[:, 0], X[:, 1], alpha=0.5, marker="o", s=5, lw=0, c=colors)

    ## 미적 요소를 위해 위쪽 및 오른쪽 축선 제거
    ## 깔끔한 축 레이아웃 생성
    ax.spines["top"].set_visible(False)
    ax.spines["right"].set_visible(False)
    ax.get_xaxis().tick_bottom()
    ax.get_yaxis().tick_left()
    ax.spines["left"].set_position(("outward", 10))
    ax.spines["bottom"].set_position(("outward", 10))

    ## 축 X1(특징 5) 에 대한 히스토그램
    hist_X1.set_ylim(ax.get_ylim())
    hist_X1.hist(
        X[:, 1], bins=hist_nbins, orientation="horizontal", color="grey", ec="grey"
    )
    hist_X1.axis("off")

    ## 축 X0(특징 0) 에 대한 히스토그램
    hist_X0.set_xlim(ax.get_xlim())
    hist_X0.hist(
        X[:, 0], bins=hist_nbins, orientation="vertical", color="grey", ec="grey"
    )
    hist_X0.axis("off")


## 출력을 컬러바에 대해 0 과 1 사이로 조정
y = minmax_scale(y_full)

## matplotlib < 1.5 에는 plasma 가 없음
cmap = getattr(cm, "plasma_r", cm.hot_r)

def create_axes(title, figsize=(16, 6)):
    fig = plt.figure(figsize=figsize)
    fig.suptitle(title)

    ## 첫 번째 플롯의 축 정의
    left, width = 0.1, 0.22
    bottom, height = 0.1, 0.7
    bottom_h = height + 0.15
    left_h = left + width + 0.02

    rect_scatter = [left, bottom, width, height]
    rect_histx = [left, bottom_h, width, 0.1]
    rect_histy = [left_h, bottom, 0.05, height]

    ax_scatter = plt.axes(rect_scatter)
    ax_histx = plt.axes(rect_histx)
    ax_histy = plt.axes(rect_histy)

    ## 확대 플롯의 축 정의
    left = width + left + 0.2
    left_h = left + width + 0.02

    rect_scatter = [left, bottom, width, height]
    rect_histx = [left, bottom_h, width, 0.1]
    rect_histy = [left_h, bottom, 0.05, height]

    ax_scatter_zoom = plt.axes(rect_scatter)
    ax_histx_zoom = plt.axes(rect_histx)
    ax_histy_zoom = plt.axes(rect_histy)

    ## 컬러바의 축 정의
    left, width = width + left + 0.13, 0.01

    rect_colorbar = [left, bottom, width, height]
    ax_colorbar = plt.axes(rect_colorbar)

    return (
        (ax_scatter, ax_histy, ax_histx),
        (ax_scatter_zoom, ax_histy_zoom, ax_histx_zoom),
        ax_colorbar,
    )

def make_plot(item_idx):
    title, X = distributions[item_idx]
    ax_zoom_out, ax_zoom_in, ax_colorbar = create_axes(title)
    axarr = (ax_zoom_out, ax_zoom_in)
    plot_distribution(
        axarr[0],
        X,
        y,
        hist_nbins=200,
        x0_label=feature_mapping[features[0]],
        x1_label=feature_mapping[features[1]],
        title="Full data",
    )

    ## 확대
    zoom_in_percentile_range = (0, 99)
    cutoffs_X0 = np.percentile(X[:, 0], zoom_in_percentile_range)
    cutoffs_X1 = np.percentile(X[:, 1], zoom_in_percentile_range)

    non_outliers_mask = np.all(X > [cutoffs_X0[0], cutoffs_X1[0]], axis=1) & np.all(
        X < [cutoffs_X0[1], cutoffs_X1[1]], axis=1
    )
    plot_distribution(
        axarr[1],
        X[non_outliers_mask],
        y[non_outliers_mask],
        hist_nbins=50,
        x0_label=feature_mapping[features[0]],
        x1_label=feature_mapping[features[1]],
        title="Zoom-in",
    )

    norm = mpl.colors.Normalize(y_full.min(), y_full.max())
    mpl.colorbar.ColorbarBase(
        ax_colorbar,
        cmap=cmap,
        norm=norm,
        orientation="vertical",
        label="y 값에 대한 색상 매핑",
    )

## 모든 분포 플롯
for i in range(len(distributions)):
    make_plot(i)

plt.show()

요약

이 실험에서는 파이썬의 scikit-learn 라이브러리를 사용하여 이상치가 있는 데이터셋에 대한 다양한 스케일링 및 변환 기법을 보여주었습니다. 특징 선택, 특징 매핑 정의 및 분포 플롯 방법을 배웠습니다. 또한, 다양한 스케일링 및 변환 기법의 효과와 데이터에 미치는 영향을 살펴보았습니다.