Using Sparse Structures in Pandas

PythonPythonBeginner
Practice Now

This tutorial is from open-source community. Access the source code

Introduction

This lab will guide you on how to use sparse data structures in the pandas library. This is useful in scenarios where we have large volumes of data, most of which are similar (like zero or NaN), hence can be represented more efficiently in memory. We will learn about the SparseArray, SparseDtype, sparse accessor, sparse calculation, and interaction with scipy sparse matrices.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Creating a SparseArray

Firstly, we create a sparse array, which is a pandas data structure for efficiently storing an array of sparse values. Sparse values are those that are not stored because they are similar to the majority of the values, hence considered redundant.

## Importing necessary libraries
import pandas as pd
import numpy as np

## Creating a numpy array with random values
arr = np.random.randn(10)

## Setting some values to NaN
arr[2:-2] = np.nan

## Creating a sparse array with pandas
ts = pd.Series(pd.arrays.SparseArray(arr))

## Output the sparse array
print(ts)

Checking Memory Efficiency

Next, we will check the memory efficiency of using sparse data structures. We will create a large DataFrame, convert it to sparse, and then compare the memory usage.

## Creating a large DataFrame with random values
df = pd.DataFrame(np.random.randn(10000, 4))

## Setting majority of the DataFrame to NaN
df.iloc[:9998] = np.nan

## Converting the DataFrame to sparse
sdf = df.astype(pd.SparseDtype("float", np.nan))

## Checking memory usage of dense vs sparse DataFrame
print('dense : {:0.2f} bytes'.format(df.memory_usage().sum() / 1e3))
print('sparse: {:0.2f} bytes'.format(sdf.memory_usage().sum() / 1e3))

Understanding SparseDtype

The SparseDtype stores the dtype of the non-sparse values and the scalar fill value. We can construct it by passing only a dtype, or also an explicit fill value.

## Creating a SparseDtype
print(pd.SparseDtype(np.dtype('datetime64[ns]')))

## Creating a SparseDtype with an explicit fill value
print(pd.SparseDtype(np.dtype('datetime64[ns]'), fill_value=pd.Timestamp('2017-01-01')))

Using the Sparse Accessor

We can use the .sparse accessor to get attributes and methods specific to sparse data.

## Creating a Series with sparse values
s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")

## Using the sparse accessor
print(s.sparse.density)
print(s.sparse.fill_value)

Performing Sparse Calculations

We can apply NumPy ufuncs to SparseArray and get a SparseArray as a result.

## Creating a SparseArray
arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])

## Applying a NumPy ufunc
print(np.abs(arr))

Converting Between Sparse and Dense

We can easily convert data from sparse to dense, and vice versa.

## Converting from sparse to dense
print(sdf.sparse.to_dense())

## Converting from dense to sparse
dense = pd.DataFrame({"A": [1, 0, 0, 1]})
dtype = pd.SparseDtype(int, fill_value=0)
print(dense.astype(dtype))

Interacting with scipy sparse

Lastly, we can create a DataFrame with sparse values from a scipy sparse matrix, and vice versa.

## Importing necessary libraries
from scipy.sparse import csr_matrix

## Creating a sparse matrix with scipy
arr = np.random.random(size=(1000, 5))
arr[arr < .9] = 0
sp_arr = csr_matrix(arr)

## Creating a DataFrame from the sparse matrix
sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)

## Printing the DataFrame
print(sdf.head())
print(sdf.dtypes)

## Converting back to sparse matrix
print(sdf.sparse.to_coo())

Summary

In this lab, we have learnt how to use sparse data structures in pandas for memory-efficient storage and computation. We have used SparseArray, SparseDtype, and performed sparse calculations. We also learnt how to convert between dense and sparse, and how to interact with scipy sparse matrices.

Other Python Tutorials you may like