Creating Pandas DataFrames from Dictionaries

Introduction

Welcome to the world of data manipulation with Pandas! A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is one of the most commonly used data structures in modern data analysis.

In this lab, you will learn the fundamental methods for creating a Pandas DataFrame. We will start by creating a DataFrame from a simple Python dictionary and then explore how to customize its columns and index. You will perform all tasks within the WebIDE, writing and executing Python scripts.

Create DataFrame from dictionary

In this step, you will learn the most common method for creating a Pandas DataFrame: from a Python dictionary. When you use a dictionary, the keys become the column names, and the values (which are typically lists or arrays) become the data in those columns.

First, open the main.py file from the file explorer on the left side of your WebIDE.

Now, add the following code to the main.py file. This code imports the Pandas library and defines a dictionary of student data. Then, it uses pd.DataFrame() to convert the dictionary into a DataFrame and prints the result.

import pandas as pd

## Data in a dictionary
student_data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 92, 78]
}

## Create DataFrame from the dictionary
df = pd.DataFrame(student_data)

## Print the DataFrame
print(df)

To run your script, open a terminal in the WebIDE (Terminal -> New Terminal) and execute the following command. All your work should be done within the ~/project directory.

python3 main.py

You should see the following output, which shows your dictionary data neatly organized into a table with default row indexes starting from 0.

      Name  Score
0    Alice     85
1      Bob     92
2  Charlie     78

Specify column names in DataFrame

In this step, you will learn how to control the order of columns in your DataFrame. By default, Pandas may not preserve the order of keys from your dictionary. You can explicitly define the column order by passing a list of column names to the columns parameter.

Let's modify the main.py file to specify the column order. We will swap the 'Name' and 'Score' columns.

Update your main.py file with the following code. Notice the addition of the columns parameter in the pd.DataFrame() function.

import pandas as pd

## Data in a dictionary
student_data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 92, 78]
}

## Create DataFrame and specify column order
df = pd.DataFrame(student_data, columns=['Score', 'Name'])

## Print the DataFrame
print(df)

Now, run the script again in your terminal:

python3 main.py

The output will now show the 'Score' column first, as you specified.

   Score     Name
0     85    Alice
1     92      Bob
2     78  Charlie

Add index labels to DataFrame

In this step, you'll learn how to replace the default numeric index (0, 1, 2, ...) with more meaningful labels. This is done using the index parameter, which allows you to assign a custom index to each row.

Let's assign unique student IDs as the index for our DataFrame. Modify your main.py file to include a list of index labels.

Update the code in main.py as follows:

import pandas as pd

## Data in a dictionary
student_data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 92, 78]
}

## Define custom index labels
index_labels = ['ID1', 'ID2', 'ID3']

## Create DataFrame with custom index
df = pd.DataFrame(student_data, index=index_labels)

## Print the DataFrame
print(df)

Execute the script from your terminal:

python3 main.py

You will now see the default numeric index replaced by your custom 'ID' labels.

        Name  Score
ID1    Alice     85
ID2      Bob     92
ID3  Charlie     78

Access DataFrame columns using dot notation

In this step, you will learn a convenient way to access a single column of a DataFrame: dot notation. If a column's name is a valid Python identifier (no spaces, doesn't start with a number, etc.), you can access it as an attribute of the DataFrame object.

Let's use dot notation to select and print only the 'Name' column from our DataFrame.

Modify your main.py file to access the Name column and print it.

import pandas as pd

## Data in a dictionary
student_data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 92, 78]
}

## Create DataFrame
df = pd.DataFrame(student_data)

## Access and print the 'Name' column using dot notation
print(df.Name)

Run the script in your terminal:

python3 main.py

The output will be a Pandas Series, which is essentially a single column of a DataFrame.

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

Display DataFrame info using info method

In this step, you will learn to use the .info() method. This is an essential function that provides a concise summary of a DataFrame, including the data types of each column, the number of non-null values, and memory usage. It's a great first step when exploring a new dataset.

Let's apply the .info() method to our student DataFrame.

Modify the main.py file to call this method. Note that .info() prints the summary directly, so you don't need to wrap it in a print() function.

import pandas as pd

## Data in a dictionary
student_data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 92, 78]
}

## Create DataFrame
df = pd.DataFrame(student_data)

## Display a summary of the DataFrame
df.info()

Run the script from your terminal:

python3 main.py

The output gives you a detailed overview of your DataFrame's structure and content.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 ##   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Name    3 non-null      object
 1   Score   3 non-null      int64
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes

Summary

Congratulations on completing this lab! You have learned the fundamental techniques for creating and inspecting Pandas DataFrames.

In this lab, you have mastered:

Creating a DataFrame from a Python dictionary.
Specifying and reordering columns using the columns parameter.
Assigning custom row labels using the index parameter.
Accessing a specific column using convenient dot notation.
Getting a concise summary of a DataFrame's structure with the .info() method.

These skills are the essential first steps for any data analysis task using Pandas. You are now well-equipped to start creating your own datasets for further exploration.

Pandas Creating DataFrames