Pandas 에서 중복 레이블 처리 방법

소개

이 랩에서는 pandas 에서 중복 레이블을 처리하는 방법을 배우겠습니다. Pandas 는 Python 에서 강력한 데이터 조작 라이브러리입니다. 종종 중복 행 또는 열 레이블이 있는 데이터를 접하게 되며, 이러한 중복을 감지하고 처리하는 방법을 이해하는 것이 중요합니다.

VM 팁

VM 시작이 완료되면 왼쪽 상단을 클릭하여 Notebook 탭으로 전환하여 실습을 위해 Jupyter Notebook에 액세스하십시오.

때로는 Jupyter Notebook 이 로딩을 완료하는 데 몇 초 정도 기다려야 할 수도 있습니다. Jupyter Notebook 의 제한 사항으로 인해 작업의 유효성 검사는 자동화될 수 없습니다.

학습 중에 문제가 발생하면 언제든지 Labby 에게 문의하십시오. 세션 후 피드백을 제공해주시면 문제를 신속하게 해결해 드리겠습니다.

필요한 라이브러리 가져오기

먼저, 데이터를 생성하고 조작하는 데 도움이 되는 pandas 및 numpy 라이브러리를 가져와야 합니다.

## Importing necessary libraries
import pandas as pd
import numpy as np

중복 레이블의 결과 이해

중복 레이블은 pandas 에서 특정 작업의 동작을 변경할 수 있습니다. 예를 들어, 일부 메서드는 중복이 있는 경우 작동하지 않습니다.

## Creating a pandas Series with duplicate labels
s1 = pd.Series([0, 1, 2], index=["a", "b", "b"])

## Attempting to reindex the Series
try:
    s1.reindex(["a", "b", "c"])
except Exception as e:
    print(e)

인덱싱에서의 중복

다음으로, 인덱싱에서 중복이 예상치 못한 결과를 초래할 수 있는 방법을 살펴보겠습니다.

## Creating a DataFrame with duplicate column labels
df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["A", "A", "B"])

## Indexing 'B' returns a Series
print(df1["B"])

## Indexing 'A' returns a DataFrame
print(df1["A"])

중복 레이블 감지

Index.is_unique 및 Index.duplicated() 메서드를 사용하여 중복 레이블을 확인할 수 있습니다.

## Checking if the index has unique labels
print(df1.index.is_unique)

## Checking if the columns have unique labels
print(df1.columns.is_unique)

## Detecting duplicate labels in the index
print(df1.index.duplicated())

중복 레이블 허용 금지

필요한 경우, set_flags(allows_duplicate_labels=False) 메서드를 사용하여 중복 레이블을 허용하지 않도록 설정할 수 있습니다.

## Disallowing duplicate labels in a Series
try:
    pd.Series([0, 1, 2], index=["a", "b", "b"]).set_flags(allows_duplicate_labels=False)
except Exception as e:
    print(e)

## Disallowing duplicate labels in a DataFrame
try:
    pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["A", "B", "C"]).set_flags(allows_duplicate_labels=False)
except Exception as e:
    print(e)

중복 레이블 플래그 확인 및 설정

마지막으로, DataFrame 에서 allows_duplicate_labels 플래그를 확인하고 설정할 수 있습니다.

## Creating a DataFrame and setting allows_duplicate_labels to False
df = pd.DataFrame({"A": [0, 1, 2, 3]}, index=["x", "y", "X", "Y"]).set_flags(allows_duplicate_labels=False)

## Checking the allows_duplicate_labels flag
print(df.flags.allows_duplicate_labels)

## Setting allows_duplicate_labels to True
df2 = df.set_flags(allows_duplicate_labels=True)
print(df2.flags.allows_duplicate_labels)

요약

이 랩에서는 pandas 에서 중복 레이블을 처리하는 방법을 배웠습니다. 중복 레이블을 갖는 것의 결과에 대해 이해하고, 이를 감지하는 방법과 필요에 따라 허용하지 않는 방법을 배웠습니다. 이는 중복 레이블이 잠재적으로 잘못된 데이터 분석 및 결과를 초래할 수 있는 대규모 데이터 세트를 다룰 때 필수적인 기술입니다.