Various Data Analysis Problems

Intermediate

This tutorial is from open-source community. Access the source code

Introduction

In this lab, you will learn to work with various Python data containers and utilize list, set, and dictionary comprehensions. You'll also explore the collections module, which provides useful tools for data handling.

Python offers powerful tools for data manipulation and analysis. In this lab, you will practice using Python's built - in data structures and specialized tools to analyze different datasets. Starting with a simple portfolio dataset, you'll progress to analyzing Chicago Transit Authority bus data to extract meaningful insights.

This is a Guided Lab, which provides step-by-step instructions to help you learn and practice. Follow the instructions carefully to complete each step and gain hands-on experience. Historical data shows that this is a beginner level lab with a 82% completion rate. It has received a 96% positive review rate from learners.

Working with Dictionaries and CSV Data

Let's start by examining a simple dataset about stock holdings. In this step, you'll learn how to read data from a CSV file and store it in a structured format using dictionaries.

A CSV (Comma-Separated Values) file is a common way to store tabular data, where each line represents a row and values are separated by commas. Dictionaries in Python are a powerful data structure that allow you to store key - value pairs. By using dictionaries, we can organize the data from the CSV file in a more meaningful way.

First, create a new Python file in the WebIDE by following these steps:

  1. Click on the "New File" button in the WebIDE
  2. Name the file readport.py
  3. Copy and paste the following code into the file:
## readport.py

import csv

## A function that reads a file into a list of dictionaries
def read_portfolio(filename):
    portfolio = []
    with open(filename) as f:
        rows = csv.reader(f)
        headers = next(rows)   ## Skip the header row
        for row in rows:
            record = {
                'name': row[0],
                'shares': int(row[1]),
                'price': float(row[2])
            }
            portfolio.append(record)
    return portfolio

This code defines a function read_portfolio that performs several important tasks:

  1. It opens a CSV file specified by the filename parameter. The open function is used to access the file, and the with statement ensures that the file is properly closed after we're done reading it.
  2. It skips the header row. The header row usually contains the names of the columns in the CSV file. We use next(rows) to move the iterator to the next row, effectively skipping the header.
  3. For each data row, it creates a dictionary. The keys of the dictionary are 'name', 'shares', and 'price'. These keys will help us access the data in a more intuitive way.
  4. It converts the shares to integers and prices to floating - point numbers. This is important because the data read from the CSV file is initially in string format, and we need numerical values for calculations.
  5. It adds each dictionary to a list called portfolio. This list will contain all the records from the CSV file.
  6. Finally, it returns the complete list of dictionaries.

Now let's create a file for the transit data. Create a new file called readrides.py with this content:

## readrides.py

import csv

def read_rides_as_dicts(filename):
    """
    Read the CTA bus data as a list of dictionaries
    """
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        headers = next(rows)   ## Skip header
        for row in rows:
            route = row[0]
            date = row[1]
            daytype = row[2]
            rides = int(row[3])
            record = {
                'route': route,
                'date': date,
                'daytype': daytype,
                'rides': rides
            }
            records.append(record)
    return records

This read_rides_as_dicts function works in a similar way to the read_portfolio function. It reads a CSV file related to CTA bus data, skips the header row, creates a dictionary for each data row, and stores these dictionaries in a list.

Now, let's test the read_portfolio function by opening a terminal in the WebIDE:

  1. Click on the "Terminal" menu and select "New Terminal"
  2. Start the Python interpreter by typing python3
  3. Execute the following commands:
>>> from readport import read_portfolio
>>> portfolio = read_portfolio('/home/labex/project/portfolio.csv')
>>> from pprint import pprint
>>> pprint(portfolio)
[{'name': 'AA', 'price': 32.2, 'shares': 100},
 {'name': 'IBM', 'price': 91.1, 'shares': 50},
 {'name': 'CAT', 'price': 83.44, 'shares': 150},
 {'name': 'MSFT', 'price': 51.23, 'shares': 200},
 {'name': 'GE', 'price': 40.37, 'shares': 95},
 {'name': 'MSFT', 'price': 65.1, 'shares': 50},
 {'name': 'IBM', 'price': 70.44, 'shares': 100}]

The pprint function (pretty print) is used here to display the data in a more readable format. Each item in the list is a dictionary representing one stock holding. The dictionary has the following keys:

  • A stock symbol (name): This is the abbreviation used to identify the stock.
  • Number of shares owned (shares): This indicates how many shares of the stock are held.
  • Purchase price per share (price): This is the price at which each share was bought.

Notice that some stocks like 'MSFT' and 'IBM' appear multiple times. These represent different purchases of the same stock, which might have been made at different times and prices.

Using List, Set, and Dictionary Comprehensions

Python comprehensions are a really useful and concise way to create new collections based on existing ones. Collections in Python can be lists, sets, or dictionaries, which are like containers that hold different types of data. Comprehensions allow you to filter out certain data, transform the data in some way, and organize it more efficiently. In this part, we'll use our portfolio data to explore how these comprehensions work.

First, you need to open a Python terminal, just like you did in the previous step. Once the terminal is open, you'll enter the following examples one by one. This hands - on approach will help you understand how comprehensions work in practice.

List Comprehensions

A list comprehension is a special syntax in Python that creates a new list. It does this by applying an expression to each item in an existing collection.

Let's start with an example. First, we'll import a function to read our portfolio data. Then we'll use list comprehension to filter out certain holdings from the portfolio.

>>> from readport import read_portfolio
>>> portfolio = read_portfolio('/home/labex/project/portfolio.csv')

## Find all holdings with more than 100 shares
>>> large_holdings = [s for s in portfolio if s['shares'] > 100]
>>> print(large_holdings)
[{'name': 'CAT', 'shares': 150, 'price': 83.44}, {'name': 'MSFT', 'shares': 200, 'price': 51.23}]

In this code, we first import the read_portfolio function and use it to read the portfolio data from a CSV file. Then, the list comprehension [s for s in portfolio if s['shares'] > 100] goes through each item s in the portfolio collection. It only includes the item s in the new list large_holdings if the number of shares in that holding is greater than 100.

List comprehensions can also be used to perform calculations. Here are some examples:

## Calculate the total cost of each holding (shares * price)
>>> holding_costs = [s['shares'] * s['price'] for s in portfolio]
>>> print(holding_costs)
[3220.0, 4555.0, 12516.0, 10246.0, 3835.15, 3255.0, 7044.0]

## Calculate the total cost of the entire portfolio
>>> total_portfolio_cost = sum([s['shares'] * s['price'] for s in portfolio])
>>> print(total_portfolio_cost)
44671.15

In the first example, the list comprehension [s['shares'] * s['price'] for s in portfolio] calculates the total cost of each holding by multiplying the number of shares by the price for each item in the portfolio. In the second example, we use the sum function along with the list comprehension to calculate the total cost of the entire portfolio.

Set Comprehensions

A set comprehension is used to create a set from an existing collection. A set is a collection that only contains unique values.

Let's see how it works with our portfolio data:

## Find all unique stock names
>>> unique_stocks = {s['name'] for s in portfolio}
>>> print(unique_stocks)
{'MSFT', 'IBM', 'AA', 'GE', 'CAT'}

In this code, the set comprehension {s['name'] for s in portfolio} goes through each item s in the portfolio and adds the stock name (s['name']) to the set unique_stocks. Since sets only store unique values, we end up with a list of all the different stocks in our portfolio without any duplicates.

Dictionary Comprehensions

A dictionary comprehension creates a new dictionary by applying expressions to create key - value pairs.

Here's an example of using a dictionary comprehension to count the total number of shares for each stock in our portfolio:

## Create a dictionary to count total shares for each stock
>>> totals = {s['name']: 0 for s in portfolio}
>>> for s in portfolio:
...     totals[s['name']] += s['shares']
...
>>> print(totals)
{'AA': 100, 'IBM': 150, 'CAT': 150, 'MSFT': 250, 'GE': 95}

In the first line, the dictionary comprehension {s['name']: 0 for s in portfolio} creates a dictionary where each stock name (s['name']) is a key, and the initial value for each key is 0. Then, we use a for loop to go through each item in the portfolio. For each item, we add the number of shares (s['shares']) to the corresponding value in the totals dictionary.

These comprehensions are very powerful because they allow you to transform and analyze data with just a few lines of code. They are a great tool to have in your Python programming toolkit.

Exploring the Collections Module

In Python, the built - in containers such as lists, dictionaries, and sets are very useful. However, Python's collections module takes it a step further by providing specialized container datatypes that extend the functionality of these built - in containers. Let's take a closer look at some of these useful datatypes.

You'll continue working in your Python terminal and follow along with the examples below.

Counter

The Counter class is a subclass of the dictionary. Its main purpose is to count hashable objects. It offers a convenient way to count items and supports a variety of operations.

First, we need to import the Counter class and a function to read a portfolio. Then we'll read a portfolio from a CSV file.

>>> from collections import Counter
>>> from readport import read_portfolio
>>> portfolio = read_portfolio('/home/labex/project/portfolio.csv')

Now, we'll create a Counter object to count the number of shares for each stock by its name.

## Create a counter to count shares by stock name
>>> totals = Counter()
>>> for s in portfolio:
...     totals[s['name']] += s['shares']
...
>>> print(totals)
Counter({'MSFT': 250, 'IBM': 150, 'CAT': 150, 'AA': 100, 'GE': 95})

One of the great features of the Counter object is that it automatically initializes new keys with a count of 0. This means you don't have to check if a key exists before incrementing its count, which simplifies the code for accumulating counts.

Counters also come with special methods. For example, the most_common() method is very useful for data analysis.

## Get the two stocks with the most shares
>>> most_common_stocks = totals.most_common(2)
>>> print(most_common_stocks)
[('MSFT', 250), ('IBM', 150)]

In addition, counters can be combined using arithmetic operations.

## Create another counter
>>> more = Counter()
>>> more['IBM'] = 75
>>> more['AA'] = 200
>>> more['ACME'] = 30
>>> print(more)
Counter({'AA': 200, 'IBM': 75, 'ACME': 30})

## Add two counters together
>>> combined = totals + more
>>> print(combined)
Counter({'AA': 300, 'MSFT': 250, 'IBM': 225, 'CAT': 150, 'GE': 95, 'ACME': 30})

defaultdict

The defaultdict is similar to a regular dictionary, but it has a unique feature. It provides a default value for keys that don't exist yet. This can simplify your code, as you no longer need to check if a key exists before using it.

>>> from collections import defaultdict

## Group portfolio entries by stock name
>>> byname = defaultdict(list)
>>> for s in portfolio:
...     byname[s['name']].append(s)
...
>>> print(byname['IBM'])
[{'name': 'IBM', 'shares': 50, 'price': 91.1}, {'name': 'IBM', 'shares': 100, 'price': 70.44}]
>>> print(byname['AA'])
[{'name': 'AA', 'shares': 100, 'price': 32.2}]

When you create a defaultdict(list), it automatically creates a new empty list for each new key. So, you can directly append to a key's value even if the key didn't exist before. This eliminates the need to check if the key exists and create an empty list manually.

You can also use other default factory functions. For example, you can use int, float, or even your own custom function.

## Use defaultdict with int to count items
>>> word_counts = defaultdict(int)
>>> words = ['apple', 'orange', 'banana', 'apple', 'orange', 'apple']
>>> for word in words:
...     word_counts[word] += 1
...
>>> print(word_counts)
defaultdict(<class 'int'>, {'apple': 3, 'orange': 2, 'banana': 1})

These specialized container types from the collections module can make your code more concise and efficient when you're working with data.

Data Analysis Challenge with Chicago Transit Authority Data

Now that you've practiced working with different Python data structures and the collections module, it's time to put these skills to use in a real - world data analysis task. In this experiment, we'll be analyzing the bus ridership data from the Chicago Transit Authority (CTA). This practical application will help you understand how to use Python to extract meaningful information from real - world datasets.

Understanding the Data

First, let's take a look at the transit data we'll be working with. In your Python terminal, you'll run some code to load the data and understand its basic structure.

>>> import readrides
>>> rows = readrides.read_rides_as_dicts('/home/labex/project/ctabus.csv')
>>> print(len(rows))
## This will show the number of records in the dataset

>>> ## Let's look at the first record to understand the structure
>>> import pprint
>>> pprint.pprint(rows[0])

The import readrides statement imports a custom module that has a function to read the data from the CSV file. The readrides.read_rides_as_dicts function reads the data from the specified CSV file and converts each row into a dictionary. The len(rows) gives us the total number of records in the dataset. By printing the first record using pprint.pprint(rows[0]), we can see the structure of each record clearly.

The data contains daily ridership records for different bus routes. Each record includes:

  • route: The bus route number
  • date: The date in format "YYYY - MM - DD"
  • daytype: Either "W" for weekday, "A" for Saturday, or "U" for Sunday/holiday
  • rides: The number of riders that day

Analysis Tasks

Let's solve each of the challenge questions one by one:

Question 1: How many bus routes exist in Chicago?

To answer this question, we need to find all the unique route numbers in the dataset. We'll use a set comprehension for this task.

>>> ## Get all unique route numbers using a set comprehension
>>> unique_routes = {row['route'] for row in rows}
>>> print(len(unique_routes))

A set comprehension is a concise way to create a set. In this case, we iterate over each row in the rows list and extract the route value. Since a set only stores unique elements, we end up with a set of all unique route numbers. Printing the length of this set gives us the total number of unique bus routes.

We can also see what some of these routes are:

>>> ## Print a few of the route numbers
>>> print(list(unique_routes)[:10])

Here, we convert the set of unique routes to a list and then print the first 10 elements of that list.

Question 2: How many people rode the number 22 bus on February 2, 2011?

For this question, we need to filter the data to find the specific record that matches the given route and date.

>>> ## Find rides on route 22 on February 2, 2011
>>> target_date = "2011-02-02"
>>> target_route = "22"
>>>
>>> for row in rows:
...     if row['route'] == target_route and row['date'] == target_date:
...         print(f"Rides on route {target_route} on {target_date}: {row['rides']}")
...         break

We first define the target_date and target_route variables. Then, we iterate over each row in the rows list. For each row, we check if the route and date match our target values. If a match is found, we print the number of rides and break out of the loop since we've found the record we're looking for.

You can modify this to check any route on any date by changing the target_date and target_route variables.

Question 3: What is the total number of rides taken on each bus route?

Let's use a Counter to calculate the total rides per route. A Counter is a dictionary subclass from the collections module that's used to count hashable objects.

>>> from collections import Counter
>>>
>>> ## Initialize a counter
>>> total_rides_by_route = Counter()
>>>
>>> ## Sum up rides for each route
>>> for row in rows:
...     total_rides_by_route[row['route']] += row['rides']
...
>>> ## View the top 5 routes by total ridership
>>> for route, rides in total_rides_by_route.most_common(5):
...     print(f"Route {route}: {rides:,} total rides")

We first import the Counter class from the collections module. Then, we initialize an empty counter called total_rides_by_route. As we iterate over each row in the rows list, we add the number of rides for each route to the counter. Finally, we use the most_common(5) method to get the top 5 routes with the highest total ridership and print the results.

Question 4: What five bus routes had the greatest ten - year increase in ridership from 2001 to 2011?

This is a more complex task. We need to compare the ridership in 2001 with that in 2011 for each route.

>>> ## Create dictionaries to store total annual rides by route
>>> rides_2001 = Counter()
>>> rides_2011 = Counter()
>>>
>>> ## Collect data for each year
>>> for row in rows:
...     if row['date'].startswith('2001-'):
...         rides_2001[row['route']] += row['rides']
...     elif row['date'].startswith('2011-'):
...         rides_2011[row['route']] += row['rides']
...
>>> ## Calculate increases
>>> increases = {}
>>> for route in unique_routes:
...     if route in rides_2001 and route in rides_2011:
...         increase = rides_2011[route] - rides_2001[route]
...         increases[route] = increase
...
>>> ## Find the top 5 routes with the biggest increases
>>> import heapq
>>> top_5_increases = heapq.nlargest(5, increases.items(), key=lambda x: x[1])
>>>
>>> ## Display the results
>>> print("Top 5 routes with the greatest ridership increase from 2001 to 2011:")
>>> for route, increase in top_5_increases:
...     print(f"Route {route}: increased by {increase:,} rides")
...     print(f"  2001 rides: {rides_2001[route]:,}")
...     print(f"  2011 rides: {rides_2011[route]:,}")
...     print()

We first create two Counter objects, rides_2001 and rides_2011, to store the total rides for each route in 2001 and 2011 respectively. As we iterate over each row in the rows list, we check if the date starts with '2001 -' or '2011 -' and add the rides to the appropriate counter.

Then, we create an empty dictionary increases to store the increase in ridership for each route. We iterate over the unique routes and calculate the increase by subtracting the 2001 rides from the 2011 rides for each route.

To find the top 5 routes with the biggest increases, we use the heapq.nlargest function. This function takes the number of elements to return (5 in this case), the iterable (increases.items()), and a key function (lambda x: x[1]) that specifies how to compare the elements.

Finally, we print the results, showing the route number, the increase in ridership, and the number of rides in 2001 and 2011.

This analysis identifies which bus routes experienced the most growth in ridership over the decade, which could indicate changing population patterns, service improvements, or other interesting trends.

You can extend these analyses in many ways. For example, you might want to:

  • Analyze ridership patterns by day of the week
  • Find routes with declining ridership
  • Compare seasonal variations in ridership

The techniques you've learned in this lab provide a solid foundation for this kind of data exploration and analysis.

Summary

In this lab, you have learned several important Python data manipulation techniques. These include reading and processing CSV data into dictionaries, using list, set, and dictionary comprehensions for data transformation, and leveraging specialized container types from the collections module. You also applied these skills to perform meaningful data analysis.

These techniques are fundamental for Python data analysis and are valuable in various real - world scenarios. The ability to process, transform, and extract insights from data is crucial for Python programmers. Keep practicing with your own datasets to enhance your expertise in Python data analysis.