Basic Statistics on a Pandas DataFrame
Pandas, a powerful data manipulation and analysis library in Python, provides a wide range of statistical functions that can be performed on a DataFrame, which is the core data structure in Pandas. These basic statistics can help you gain valuable insights into your data and make informed decisions. Here are some of the most common statistical operations you can perform on a Pandas DataFrame:
Descriptive Statistics
Descriptive statistics provide a summary of the key characteristics of your data. Some of the most commonly used descriptive statistics in Pandas include:
describe()
: This function generates a summary of the statistical measures for each numerical column in the DataFrame, including the count, mean, standard deviation, minimum, 25th percentile, 50th percentile (median), 75th percentile, and maximum.
import pandas as pd
# Create a sample DataFrame
df = pd.DataFrame({'Age': [25, 30, 35, 40, 45],
'Salary': [50000, 60000, 70000, 80000, 90000]})
print(df.describe())
info()
: This function provides information about the DataFrame, including the number of rows, columns, data types, and memory usage.
print(df.info())
value_counts()
: This function counts the number of occurrences of each unique value in a column, which can be useful for understanding the distribution of categorical data.
print(df['Age'].value_counts())
Measures of Central Tendency
Measures of central tendency describe the center or typical value of a dataset. The most common measures of central tendency in Pandas are:
mean()
: Calculates the arithmetic mean (average) of the values in a column.median()
: Calculates the median (middle value) of the values in a column.mode()
: Calculates the mode (most frequent value) of the values in a column.
print(f"Mean age: {df['Age'].mean()}")
print(f"Median age: {df['Age'].median()}")
print(f"Modal age: {df['Age'].mode()[0]}")
Measures of Dispersion
Measures of dispersion describe the spread or variability of the data. Some common measures of dispersion in Pandas include:
std()
: Calculates the standard deviation of the values in a column.var()
: Calculates the variance of the values in a column.min()
andmax()
: Determine the minimum and maximum values in a column, respectively.quantile()
: Calculates the specified quantile (e.g., 25th, 50th, 75th) of the values in a column.
print(f"Standard deviation of age: {df['Age'].std()}")
print(f"Variance of age: {df['Age'].var()}")
print(f"Minimum age: {df['Age'].min()}")
print(f"Maximum age: {df['Age'].max()}")
print(f"25th percentile of age: {df['Age'].quantile(0.25)}")
Correlation and Covariance
Pandas also provides functions to calculate the correlation and covariance between columns in a DataFrame:
corr()
: Calculates the Pearson correlation coefficient between columns.cov()
: Calculates the covariance matrix between columns.
print(f"Correlation between age and salary: {df.corr()['Age']['Salary']}")
print(f"Covariance matrix:\n{df.cov()}")
By using these basic statistical functions in Pandas, you can gain a deeper understanding of your data and make more informed decisions. The Mermaid diagram below summarizes the key statistical operations covered in this answer:
Remember, these are just the basic statistical operations you can perform on a Pandas DataFrame. Depending on your specific data and analysis needs, you may need to explore more advanced statistical techniques and libraries.