Pandas Descriptive Statistics: A Beginner's Guide

Introduction

Welcome to the lab on Pandas Descriptive Statistics. Descriptive statistics are fundamental to data analysis, providing simple summaries about the sample and the measures. With Pandas, a powerful data manipulation library in Python, calculating these statistics is straightforward and efficient.

In this lab, you will learn how to:

Calculate the mean (average) of a dataset.
Find the median (middle value).
Determine the minimum and maximum values.
Generate a full summary of statistics with a single command.
Count unique values in a categorical column.

You will perform these operations on a sample DataFrame, writing and executing Python code in the WebIDE. Let's get started!

Compute mean using mean method

In this step, you will learn how to calculate the mean (average) of a numerical column in a Pandas DataFrame. The mean is the sum of the values divided by the number of values, and it's one of the most common measures of central tendency.

Pandas provides the .mean() method, which can be called on a Series (a column of a DataFrame) to compute its mean.

First, open the main.py file from the file explorer on the left side of the WebIDE. You will see the initial code that creates our sample DataFrame.

Add the following code to the end of the main.py file to calculate the mean of the score column and print it.

## Calculate the mean of the 'score' column
score_mean = df['score'].mean()
print(f"Mean Score: {score_mean}")

Now, let's run the script. Open a terminal in the WebIDE (Terminal -> New Terminal) and execute the following command:

python3 main.py

You should see the original DataFrame, a separator, and the calculated mean score.

Original DataFrame:
      name  age  score grade
0    Alice   24     85     B
1      Bob   27     90     A
2  Charlie   22     78     C
3    David   32     95     A
4      Eve   29     88     B

==============================

Mean Score: 87.2

Calculate median with median method

In this step, you will calculate the median of a numerical column. The median is the middle value of a dataset that has been sorted in ascending order. It is often a better measure of central tendency than the mean when the data contains outliers.

Pandas makes this easy with the .median() method.

Continue editing the main.py file. Add the following lines to the end of the script to compute and print the median of the score column.

## Calculate the median of the 'score' column
score_median = df['score'].median()
print(f"Median Score: {score_median}")

Save the file and run the script again from the terminal:

python3 main.py

The output will now include both the mean and the median.

Original DataFrame:
      name  age  score grade
0    Alice   24     85     B
1      Bob   27     90     A
2  Charlie   22     78     C
3    David   32     95     A
4      Eve   29     88     B

==============================

Mean Score: 87.2
Median Score: 88.0

Find min and max values

In this step, you'll find the minimum and maximum values in a column. These statistics are useful for understanding the range and distribution of your data. Pandas provides the .min() and .max() methods for this purpose.

Let's find the lowest and highest scores in our dataset. Add the following code to the end of your main.py script.

## Find the minimum and maximum scores
score_min = df['score'].min()
score_max = df['score'].max()
print(f"Minimum Score: {score_min}")
print(f"Maximum Score: {score_max}")

Save the file and execute it from the terminal:

python3 main.py

Your output will now show the mean, median, minimum, and maximum scores.

Original DataFrame:
      name  age  score grade
0    Alice   24     85     B
1      Bob   27     90     A
2  Charlie   22     78     C
3    David   32     95     A
4      Eve   29     88     B

==============================

Mean Score: 87.2
Median Score: 88.0
Minimum Score: 78
Maximum Score: 95

Generate summary stats with describe

In this step, you will use the powerful .describe() method. This single method generates a comprehensive summary of descriptive statistics for all numerical columns in your DataFrame, including count, mean, standard deviation, min, max, and quartile values.

This is a huge time-saver for getting a quick overview of your data. Add the following code to the end of main.py.

## Generate a summary of descriptive statistics
summary_stats = df.describe()
print("Descriptive Statistics Summary:")
print(summary_stats)

Save the file and run the script:

python3 main.py

You will see a well-formatted table containing the summary statistics for the age and score columns.

... (previous output) ...

Descriptive Statistics Summary:
             age      score
count   5.000000   5.000000
mean   26.800000  87.200000
std     4.024922   6.379655
min    22.000000  78.000000
25%    24.000000  85.000000
50%    27.000000  88.000000
75%    29.000000  90.000000
max    32.000000  95.000000

Count unique values with value_counts

In this step, you will learn how to count the occurrences of unique values in a column, which is particularly useful for categorical data. The .value_counts() method returns a Series containing counts of unique values.

Let's count how many students received each grade. Add the following code to the end of main.py.

## Count the occurrences of each grade
grade_counts = df['grade'].value_counts()
print("Grade Counts:")
print(grade_counts)

Save the file and run the script for the final time.

python3 main.py

The final output will include the counts for each unique grade.

... (previous output) ...

Grade Counts:
grade
B    2
A    2
C    1
Name: count, dtype: int64

This shows that grades 'A' and 'B' each appear twice, and grade 'C' appears once.

Summary

Congratulations on completing the lab! You have successfully learned how to perform fundamental descriptive statistical analysis using the Pandas library.

In this lab, you practiced using several key Pandas methods:

.mean() to calculate the average.
.median() to find the central value.
.min() and .max() to determine the range of data.
.describe() to get a quick and comprehensive statistical summary.
.value_counts() to count unique values in a categorical column.

These functions are essential tools for any data analyst or scientist and form the basis of exploratory data analysis (EDA). Keep practicing these skills to become more proficient in your data analysis journey.