PyArrow | Pandas 데이터 과학 | 성능 개선

소개

이 랩에서는 pandas 에서 PyArrow 를 활용하여 다양한 API 의 기능을 확장하고 성능을 향상시키는 과정을 안내합니다. PyArrow 는 pandas 를 더 광범위한 데이터 타입, 모든 데이터 타입에 대한 결측 데이터 지원, IO 리더 통합, 그리고 다른 데이터 프레임 라이브러리와의 상호 운용성으로 향상시킵니다.

VM 팁

VM 시작이 완료되면, 왼쪽 상단 모서리를 클릭하여 Notebook 탭으로 전환하여 실습을 위해 Jupyter Notebook에 접근하십시오.

때로는 Jupyter Notebook 이 로딩을 완료하는 데 몇 초 정도 기다려야 할 수 있습니다. Jupyter Notebook 의 제한으로 인해 작업의 유효성 검사는 자동화될 수 없습니다.

학습 중에 문제가 발생하면 언제든지 Labby 에게 문의하십시오. 세션 후 피드백을 제공해주시면 문제를 신속하게 해결해 드리겠습니다.

PyArrow 설치

시작하기 전에, 지원되는 최소 PyArrow 버전을 설치했는지 확인하십시오. Python 환경에서 다음 명령을 실행하여 확인할 수 있습니다:

## This is a comment
## Use pip to install PyArrow
## add ! to the beginning of the line to run the command in the Jupyter notebook
## !pip install pyarrow
pip install pyarrow

데이터 구조 통합

PyArrow 를 사용하면 pandas 데이터 구조를 NumPy 배열과 유사하게 PyArrow ChunkedArray 로 직접 지원할 수 있습니다. 방법은 다음과 같습니다:

## Import pandas
import pandas as pd

## Create a pandas Series, Index and DataFrame with PyArrow data type
ser = pd.Series([-1.5, 0.2, None], dtype="float32[pyarrow]")
idx = pd.Index([True, None], dtype="bool[pyarrow]")
df = pd.DataFrame([[1, 2], [3, 4]], dtype="uint64[pyarrow]")

매개변수와 함께 PyArrow 타입 사용하기

매개변수를 허용하는 PyArrow 타입의 경우, 해당 매개변수를 사용하여 PyArrow 타입을 ArrowDtype에 전달하여 dtype 매개변수에서 사용할 수 있습니다.

## Import PyArrow
import pyarrow as pa

## Create a pandas Series with PyArrow list type
list_str_type = pa.list_(pa.string())
ser = pd.Series([["hello"], ["there"]], dtype=pd.ArrowDtype(list_str_type))

PyArrow Array 를 pandas 데이터 구조로 변환하기

PyArrow Array 또는 ChunkedArray 가 있는 경우, 이를 Series, Index 또는 DataFrame 과 같은 pandas 데이터 구조로 변환할 수 있습니다.

## Create a PyArrow array
pa_array = pa.array([{"1": "2"}, {"10": "20"}, None], type=pa.map_(pa.string(), pa.string()))

## Convert the PyArrow array to a pandas Series
ser = pd.Series(pd.arrays.ArrowExtensionArray(pa_array))

PyArrow 연산

PyArrow 데이터 구조 통합은 pandas 의 ExtensionArray 인터페이스를 통해 구현됩니다. 이 인터페이스가 pandas API 내에 통합된 경우 지원되는 기능이 존재합니다.

## Create a pandas Series with PyArrow data type
ser = pd.Series([-1.545, 0.2, None], dtype="float32[pyarrow]")

## Perform various operations
ser.mean()
ser + ser
ser > (ser + 1)
ser.dropna()
ser.isna()
ser.fillna(0)

PyArrow 로 데이터 읽기

PyArrow 는 여러 pandas IO 리더에 통합된 IO 읽기 기능을 제공합니다.

## Import IO module
import io

## Create a StringIO object
data = io.StringIO("""a,b,c\n1,2.5,True\n3,4.5,False""")

## Read the data into a pandas DataFrame using PyArrow as the engine
df = pd.read_csv(data, engine="pyarrow")

요약

이 랩에서는 PyArrow 를 pandas 와 함께 사용하여 기능을 확장하고 성능을 향상시키는 방법을 살펴보았습니다. pandas 데이터 구조를 PyArrow 데이터 타입으로 변환하고 다양한 연산을 수행하는 방법을 배웠습니다. 또한 PyArrow 의 IO 읽기 기능을 사용하여 데이터를 읽는 방법도 확인했습니다.

PyArrow 로 Pandas 성능 향상

소개