The .describe() method in Pandas is used to generate descriptive statistics for a DataFrame or Series. It provides a quick overview of the central tendency, dispersion, and shape of the dataset's distribution. Here are some key points about the .describe() method:
Key Features:
-
Numerical Data: By default, it computes statistics for numerical columns, including:
- Count: Number of non-null entries
- Mean: Average value
- Standard Deviation (std): Measure of the amount of variation
- Minimum (min): Smallest value
- 25th Percentile (25%): First quartile
- 50th Percentile (50%): Median
- 75th Percentile (75%): Third quartile
- Maximum (max): Largest value
-
Categorical Data: If you want to include categorical columns, you can use the
includeparameter:df.describe(include='all') -
Custom Statistics: You can also specify which statistics to compute using the
percentilesparameter:df.describe(percentiles=[.25, .5, .75])
Example:
Here’s a simple example of how to use .describe():
import pandas as pd
# Sample DataFrame
data = {
'A': [1, 2, 3, 4, 5],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']
}
df = pd.DataFrame(data)
# Generate descriptive statistics
stats = df.describe()
print(stats)
This will output the descriptive statistics for columns A and B, while column C will be excluded by default since it is categorical.
Conclusion:
The .describe() method is a powerful tool for quickly understanding the characteristics of your data, making it essential for data analysis and exploration.
