Enhance Pandas with PyArrow

PythonPythonBeginner
Practice Now

This tutorial is from open-source community. Access the source code

Introduction

This lab will guide you through the process of utilizing PyArrow in pandas to extend functionality and improve the performance of various APIs. PyArrow enhances pandas with more extensive data types, missing data support for all data types, IO reader integration, and interoperability with other data frame libraries.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Installing PyArrow

Before starting, make sure you have installed the minimum supported PyArrow version. You can do this by running the following command in your Python environment:

## This is a comment
## Use pip to install PyArrow
pip install pyarrow

Data Structure Integration

PyArrow allows pandas data structures to be directly backed by a PyArrow ChunkedArray, similar to a NumPy array. Here is how to do this:

## Import pandas
import pandas as pd

## Create a pandas Series, Index and DataFrame with PyArrow data type
ser = pd.Series([-1.5, 0.2, None], dtype="float32[pyarrow]")
idx = pd.Index([True, None], dtype="bool[pyarrow]")
df = pd.DataFrame([[1, 2], [3, 4]], dtype="uint64[pyarrow]")

Using PyArrow Types with Parameters

For PyArrow types that accept parameters, you can pass in a PyArrow type with those parameters into ArrowDtype to use in the dtype parameter.

## Import PyArrow
import pyarrow as pa

## Create a pandas Series with PyArrow list type
list_str_type = pa.list_(pa.string())
ser = pd.Series([["hello"], ["there"]], dtype=pd.ArrowDtype(list_str_type))

Converting PyArrow Array to pandas Data Structures

If you have a PyArrow Array or ChunkedArray, you can convert it into pandas data structures like Series, Index or DataFrame.

## Create a PyArrow array
pa_array = pa.array([{"1": "2"}, {"10": "20"}, None], type=pa.map_(pa.string(), pa.string()))

## Convert the PyArrow array to a pandas Series
ser = pd.Series(pd.arrays.ArrowExtensionArray(pa_array))

PyArrow Operations

PyArrow data structure integration is implemented through pandas' ExtensionArray interface. Supported functionality exists where this interface is integrated within the pandas API.

## Create a pandas Series with PyArrow data type
ser = pd.Series([-1.545, 0.2, None], dtype="float32[pyarrow]")

## Perform various operations
ser.mean()
ser + ser
ser > (ser + 1)
ser.dropna()
ser.isna()
ser.fillna(0)

Reading Data with PyArrow

PyArrow provides IO reading functionality that has been integrated into several pandas IO readers.

## Import IO module
import io

## Create a StringIO object
data = io.StringIO("""a,b,c\n1,2.5,True\n3,4.5,False""")

## Read the data into a pandas DataFrame using PyArrow as the engine
df = pd.read_csv(data, engine="pyarrow")

Summary

In this lab, we explored how PyArrow can be used with pandas to extend its functionality and improve performance. We learned how to convert pandas data structures into PyArrow data types and perform various operations. We also saw how to read data using PyArrow's IO reading functionality.

Other Python Tutorials you may like