Introduction
This lab will guide you on how to use sparse data structures in the pandas library. This is useful in scenarios where we have large volumes of data, most of which are similar (like zero or NaN), hence can be represented more efficiently in memory. We will learn about the SparseArray, SparseDtype, sparse accessor, sparse calculation, and interaction with scipy sparse matrices.
VM Tips
After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.
Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.
If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.
Creating a SparseArray
Firstly, we create a sparse array, which is a pandas data structure for efficiently storing an array of sparse values. Sparse values are those that are not stored because they are similar to the majority of the values, hence considered redundant.
## Importing necessary libraries
import pandas as pd
import numpy as np
## Creating a numpy array with random values
arr = np.random.randn(10)
## Setting some values to NaN
arr[2:-2] = np.nan
## Creating a sparse array with pandas
ts = pd.Series(pd.arrays.SparseArray(arr))
## Output the sparse array
print(ts)
Checking Memory Efficiency
Next, we will check the memory efficiency of using sparse data structures. We will create a large DataFrame, convert it to sparse, and then compare the memory usage.
## Creating a large DataFrame with random values
df = pd.DataFrame(np.random.randn(10000, 4))
## Setting majority of the DataFrame to NaN
df.iloc[:9998] = np.nan
## Converting the DataFrame to sparse
sdf = df.astype(pd.SparseDtype("float", np.nan))
## Checking memory usage of dense vs sparse DataFrame
print('dense : {:0.2f} bytes'.format(df.memory_usage().sum() / 1e3))
print('sparse: {:0.2f} bytes'.format(sdf.memory_usage().sum() / 1e3))
Understanding SparseDtype
The SparseDtype stores the dtype of the non-sparse values and the scalar fill value. We can construct it by passing only a dtype, or also an explicit fill value.
## Creating a SparseDtype
print(pd.SparseDtype(np.dtype('datetime64[ns]')))
## Creating a SparseDtype with an explicit fill value
print(pd.SparseDtype(np.dtype('datetime64[ns]'), fill_value=pd.Timestamp('2017-01-01')))
Using the Sparse Accessor
We can use the .sparse accessor to get attributes and methods specific to sparse data.
## Creating a Series with sparse values
s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")
## Using the sparse accessor
print(s.sparse.density)
print(s.sparse.fill_value)
Performing Sparse Calculations
We can apply NumPy ufuncs to SparseArray and get a SparseArray as a result.
## Creating a SparseArray
arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])
## Applying a NumPy ufunc
print(np.abs(arr))
Converting Between Sparse and Dense
We can easily convert data from sparse to dense, and vice versa.
## Converting from sparse to dense
print(sdf.sparse.to_dense())
## Converting from dense to sparse
dense = pd.DataFrame({"A": [1, 0, 0, 1]})
dtype = pd.SparseDtype(int, fill_value=0)
print(dense.astype(dtype))
Interacting with scipy sparse
Lastly, we can create a DataFrame with sparse values from a scipy sparse matrix, and vice versa.
## Importing necessary libraries
from scipy.sparse import csr_matrix
## Creating a sparse matrix with scipy
arr = np.random.random(size=(1000, 5))
arr[arr < .9] = 0
sp_arr = csr_matrix(arr)
## Creating a DataFrame from the sparse matrix
sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)
## Printing the DataFrame
print(sdf.head())
print(sdf.dtypes)
## Converting back to sparse matrix
print(sdf.sparse.to_coo())
Summary
In this lab, we have learnt how to use sparse data structures in pandas for memory-efficient storage and computation. We have used SparseArray, SparseDtype, and performed sparse calculations. We also learnt how to convert between dense and sparse, and how to interact with scipy sparse matrices.