How to use re.findall() in Python to find all matching substrings

PythonBeginner
Practice Now

Introduction

In this tutorial, we will explore Python's re.findall() function, a powerful tool for extracting matching substrings from text. This function is part of the built-in regular expression (regex) module in Python and is essential for text processing tasks.

By the end of this lab, you will be able to use re.findall() to extract various patterns from text, such as email addresses, phone numbers, and URLs. These skills are valuable in data analysis, web scraping, and text processing applications.

Whether you are new to Python or looking to enhance your text processing capabilities, this step-by-step guide will equip you with practical knowledge to effectively use regular expressions in your Python projects.

Getting Started with re.findall()

In this first step, we will learn about the re.findall() function and how to use it for basic pattern matching.

Understanding Regular Expressions

Regular expressions (regex) are special text strings used for describing search patterns. They are particularly useful when you need to:

  • Find specific character patterns in text
  • Validate text format (like email addresses)
  • Extract information from text
  • Replace text

The re Module in Python

Python provides a built-in module called re for working with regular expressions. One of its most useful functions is re.findall().

Let's start by creating a simple Python script to see how re.findall() works.

  1. First, open the terminal and navigate to our project directory:
cd ~/project
  1. Create a new Python file named basic_findall.py using the code editor. In VSCode, you can click on the "Explorer" icon (usually the first icon in the sidebar), then click the "New File" button and name it basic_findall.py.

  2. In the basic_findall.py file, write the following code:

import re

## Sample text
text = "Python is amazing. Python is versatile. I love learning Python programming."

## Using re.findall() to find all occurrences of "Python"
matches = re.findall(r"Python", text)

## Print the results
print("Original text:")
print(text)
print("\nMatches found:", len(matches))
print("Matching substrings:", matches)
  1. Save the file and run it from the terminal:
python3 ~/project/basic_findall.py

You should see output similar to this:

Original text:
Python is amazing. Python is versatile. I love learning Python programming.

Matches found: 3
Matching substrings: ['Python', 'Python', 'Python']

Breaking Down the Code

Let's understand what's happening in our code:

  • We imported the re module with import re
  • We defined a sample text with multiple occurrences of the word "Python"
  • We used re.findall(r"Python", text) to find all occurrences of "Python" in the text
  • The r before the string denotes a raw string, which is recommended when working with regular expressions
  • The function returned a list of all matching substrings
  • We printed the results, showing that "Python" appeared 3 times in our text

Finding Different Patterns

Now, let's try finding a different pattern. Create a new file named findall_words.py:

import re

text = "The rain in Spain falls mainly on the plain."

## Find all words ending with 'ain'
matches = re.findall(r"\w+ain\b", text)

print("Original text:")
print(text)
print("\nWords ending with 'ain':", matches)

Run this script:

python3 ~/project/findall_words.py

The output should be:

Original text:
The rain in Spain falls mainly on the plain.

Words ending with 'ain': ['rain', 'Spain', 'plain']

In this example:

  • \w+ matches one or more word characters (letters, digits, or underscores)
  • ain matches the literal characters "ain"
  • \b represents a word boundary, ensuring we match complete words that end with "ain"

Experiment with these examples to get a feel for how re.findall() works with basic patterns.

Working with More Complex Patterns

In this step, we will explore more complex patterns with re.findall() and learn how to use character classes and quantifiers to create flexible search patterns.

Finding Numbers in Text

First, let's write a script to extract all numbers from a text. Create a new file named extract_numbers.py:

import re

text = "There are 42 apples, 15 oranges, and 123 bananas in the basket. The price is $9.99."

## Find all numbers (integers and decimals)
numbers = re.findall(r'\d+\.?\d*', text)

print("Original text:")
print(text)
print("\nNumbers found:", numbers)

## Finding only whole numbers
whole_numbers = re.findall(r'\b\d+\b', text)
print("Whole numbers only:", whole_numbers)

Run the script:

python3 ~/project/extract_numbers.py

You should see output similar to:

Original text:
There are 42 apples, 15 oranges, and 123 bananas in the basket. The price is $9.99.

Numbers found: ['42', '15', '123', '9.99']
Whole numbers only: ['42', '15', '123', '9']

Let's break down the patterns used:

  • \d+\.?\d* matches:

    • \d+: One or more digits
    • \.?: An optional decimal point
    • \d*: Zero or more digits after the decimal point
  • \b\d+\b matches:

    • \b: Word boundary
    • \d+: One or more digits
    • \b: Another word boundary (ensuring we match standalone numbers)

Finding Words of a Specific Length

Let's create a script to find all four-letter words in a text. Create find_word_length.py:

import re

text = "The quick brown fox jumps over the lazy dog. A good day to code."

## Find all 4-letter words
four_letter_words = re.findall(r'\b\w{4}\b', text)

print("Original text:")
print(text)
print("\nFour-letter words:", four_letter_words)

## Find all words between 3 and 5 letters
words_3_to_5 = re.findall(r'\b\w{3,5}\b', text)
print("Words with 3 to 5 letters:", words_3_to_5)

Run this script:

python3 ~/project/find_word_length.py

The output should be:

Original text:
The quick brown fox jumps over the lazy dog. A good day to code.

Four-letter words: ['over', 'lazy', 'good', 'code']
Words with 3 to 5 letters: ['The', 'over', 'the', 'lazy', 'dog', 'good', 'day', 'code']

In these patterns:

  • \b\w{4}\b matches exactly 4 word characters surrounded by word boundaries
  • \b\w{3,5}\b matches 3 to 5 word characters surrounded by word boundaries

Using Character Classes

Character classes allow us to match specific sets of characters. Let's create character_classes.py:

import re

text = "The temperature is 72°F or 22°C. Contact us at: info@example.com"

## Find words containing both letters and digits
mixed_words = re.findall(r'\b[a-z0-9]+\b', text.lower())

print("Original text:")
print(text)
print("\nWords with letters and digits:", mixed_words)

## Find all email addresses
emails = re.findall(r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b', text)
print("Email addresses:", emails)

Run the script:

python3 ~/project/character_classes.py

The output should be similar to:

Original text:
The temperature is 72°F or 22°C. Contact us at: info@example.com

Words with letters and digits: ['72°f', '22°c', 'info@example.com']
Email addresses: ['info@example.com']

These patterns demonstrate:

  • \b[a-z0-9]+\b: Words containing lowercase letters and digits
  • The email pattern matches the standard format for email addresses

Experiment with these examples to understand how different pattern components work together to create powerful search patterns.

Using Flags and Capturing Groups

In this step, we will learn how to use flags to modify the behavior of regular expressions and how to use capturing groups to extract specific parts of matched patterns.

Understanding Flags in Regular Expressions

Flags modify how the regular expression engine performs its search. Python's re module provides several flags that can be passed as an optional parameter to re.findall(). Let's explore some common flags.

Create a new file named regex_flags.py:

import re

text = """
Python is a great language.
PYTHON is versatile.
python is easy to learn.
"""

## Case-sensitive search (default)
matches_case_sensitive = re.findall(r"python", text)

## Case-insensitive search using re.IGNORECASE flag
matches_case_insensitive = re.findall(r"python", text, re.IGNORECASE)

print("Original text:")
print(text)
print("\nCase-sensitive matches:", matches_case_sensitive)
print("Case-insensitive matches:", matches_case_insensitive)

## Using the multiline flag
multiline_text = "First line\nSecond line\nThird line"
## Find lines starting with 'S'
starts_with_s = re.findall(r"^S.*", multiline_text, re.MULTILINE)
print("\nMultiline text:")
print(multiline_text)
print("\nLines starting with 'S':", starts_with_s)

Run the script:

python3 ~/project/regex_flags.py

The output should be similar to:

Original text:

Python is a great language.
PYTHON is versatile.
python is easy to learn.


Case-sensitive matches: ['python']
Case-insensitive matches: ['Python', 'PYTHON', 'python']

Multiline text:
First line
Second line
Third line

Lines starting with 'S': ['Second line']

Common flags include:

  • re.IGNORECASE (or re.I): Makes the pattern case-insensitive
  • re.MULTILINE (or re.M): Makes ^ and $ match the start/end of each line
  • re.DOTALL (or re.S): Makes . match any character including newlines

Using Capturing Groups

Capturing groups allow you to extract specific parts of the matched text. They are created by placing part of the regular expression inside parentheses.

Create a file named capturing_groups.py:

import re

## Sample text with dates in various formats
text = "Important dates: 2023-11-15, 12/25/2023, and Jan 1, 2024."

## Extract dates in YYYY-MM-DD format
iso_dates = re.findall(r'(\d{4})-(\d{1,2})-(\d{1,2})', text)

## Extract dates in MM/DD/YYYY format
us_dates = re.findall(r'(\d{1,2})/(\d{1,2})/(\d{4})', text)

print("Original text:")
print(text)
print("\nISO dates (Year, Month, Day):", iso_dates)
print("US dates (Month, Day, Year):", us_dates)

## Extract month names with capturing groups
month_dates = re.findall(r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+(\d{1,2}),\s+(\d{4})', text)
print("Month name dates (Month, Day, Year):", month_dates)

Run the script:

python3 ~/project/capturing_groups.py

The output should be:

Original text:
Important dates: 2023-11-15, 12/25/2023, and Jan 1, 2024.

ISO dates (Year, Month, Day): [('2023', '11', '15')]
US dates (Month, Day, Year): [('12', '25', '2023')]
Month name dates (Month, Day, Year): [('Jan', '1', '2024')]

In this example:

  • Each set of parentheses () creates a capturing group
  • The function returns a list of tuples, where each tuple contains the captured groups
  • This allows us to extract and organize structured data from text

Practical Example: Parsing Log Files

Now, let's apply what we've learned to a practical example. Imagine we have a log file with entries we want to parse. Create a file named log_parser.py:

import re

## Sample log entries
logs = """
[2023-11-15 08:30:45] INFO: System started
[2023-11-15 08:35:12] WARNING: High memory usage (85%)
[2023-11-15 08:42:11] ERROR: Connection timeout
[2023-11-15 09:15:27] INFO: Backup completed
"""

## Extract timestamp, level, and message from log entries
log_pattern = r'\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] (\w+): (.+)'
log_entries = re.findall(log_pattern, logs)

print("Original logs:")
print(logs)
print("\nParsed log entries (timestamp, level, message):")
for entry in log_entries:
    timestamp, level, message = entry
    print(f"Time: {timestamp} | Level: {level} | Message: {message}")

## Find all ERROR logs
error_logs = re.findall(r'\[\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\] ERROR: (.+)', logs)
print("\nError messages:", error_logs)

Run the script:

python3 ~/project/log_parser.py

The output should be similar to:

Original logs:

[2023-11-15 08:30:45] INFO: System started
[2023-11-15 08:35:12] WARNING: High memory usage (85%)
[2023-11-15 08:42:11] ERROR: Connection timeout
[2023-11-15 09:15:27] INFO: Backup completed


Parsed log entries (timestamp, level, message):
Time: 2023-11-15 08:30:45 | Level: INFO | Message: System started
Time: 2023-11-15 08:35:12 | Level: WARNING | Message: High memory usage (85%)
Time: 2023-11-15 08:42:11 | Level: ERROR | Message: Connection timeout
Time: 2023-11-15 09:15:27 | Level: INFO | Message: Backup completed

Error messages: ['Connection timeout']

This example demonstrates:

  • Using capturing groups to extract structured information
  • Processing and displaying the captured information
  • Filtering for specific types of log entries

Flags and capturing groups enhance the power and flexibility of regular expressions, allowing for more precise and structured data extraction.

Real-world Applications of re.findall()

In this final step, we will explore practical, real-world applications of re.findall(). We will write code to extract emails, URLs, and perform data cleaning tasks.

Extracting Email Addresses

Email extraction is a common task in data mining, web scraping, and text analysis. Create a file named email_extractor.py:

import re

## Sample text with email addresses
text = """
Contact information:
- Support: support@example.com
- Sales: sales@example.com, international.sales@example.co.uk
- Technical team: tech.team@subdomain.example.org
Personal email: john.doe123@gmail.com
"""

## Extract all email addresses
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(email_pattern, text)

print("Original text:")
print(text)
print("\nExtracted email addresses:")
for i, email in enumerate(emails, 1):
    print(f"{i}. {email}")

## Extract specific domain emails
gmail_emails = re.findall(r'\b[A-Za-z0-9._%+-]+@gmail\.com\b', text)
print("\nGmail addresses:", gmail_emails)

Run the script:

python3 ~/project/email_extractor.py

The output should be similar to:

Original text:

Contact information:
- Support: support@example.com
- Sales: sales@example.com, international.sales@example.co.uk
- Technical team: tech.team@subdomain.example.org
Personal email: john.doe123@gmail.com


Extracted email addresses:
1. support@example.com
2. sales@example.com
3. international.sales@example.co.uk
4. tech.team@subdomain.example.org
5. john.doe123@gmail.com

Gmail addresses: ['john.doe123@gmail.com']

Extracting URLs

URL extraction is useful for web scraping, link validation, and content analysis. Create a file named url_extractor.py:

import re

## Sample text with various URLs
text = """
Visit our website at https://www.example.com
Documentation: http://docs.example.org/guide
Repository: https://github.com/user/project
Forum: https://community.example.net/forum
Image: https://images.example.com/logo.png
"""

## Extract all URLs
url_pattern = r'https?://[^\s]+'
urls = re.findall(url_pattern, text)

print("Original text:")
print(text)
print("\nExtracted URLs:")
for i, url in enumerate(urls, 1):
    print(f"{i}. {url}")

## Extract specific domain URLs
github_urls = re.findall(r'https?://github\.com/[^\s]+', text)
print("\nGitHub URLs:", github_urls)

## Extract image URLs
image_urls = re.findall(r'https?://[^\s]+\.(jpg|jpeg|png|gif)', text)
print("\nImage URLs:", image_urls)

Run the script:

python3 ~/project/url_extractor.py

The output should be similar to:

Original text:

Visit our website at https://www.example.com
Documentation: http://docs.example.org/guide
Repository: https://github.com/user/project
Forum: https://community.example.net/forum
Image: https://images.example.com/logo.png


Extracted URLs:
1. https://www.example.com
2. http://docs.example.org/guide
3. https://github.com/user/project
4. https://community.example.net/forum
5. https://images.example.com/logo.png

GitHub URLs: ['https://github.com/user/project']

Image URLs: ['https://images.example.com/logo.png']

Data Cleaning with re.findall()

Let's create a script to clean and extract information from a messy dataset. Create a file named data_cleaning.py:

import re

## Sample messy data
data = """
Product: Laptop X200, Price: $899.99, SKU: LP-X200-2023
Product: Smartphone S10+, Price: $699.50, SKU: SP-S10P-2023
Product: Tablet T7, Price: $299.99, SKU: TB-T7-2023
Product: Wireless Earbuds, Price: $129.95, SKU: WE-PRO-2023
"""

## Extract product information
product_pattern = r'Product: (.*?), Price: \$([\d.]+), SKU: ([A-Z0-9-]+)'
products = re.findall(product_pattern, data)

print("Original data:")
print(data)
print("\nExtracted and structured product information:")
print("Name\t\tPrice\t\tSKU")
print("-" * 50)
for product in products:
    name, price, sku = product
    print(f"{name}\t${price}\t{sku}")

## Calculate total price
total_price = sum(float(price) for _, price, _ in products)
print(f"\nTotal price of all products: ${total_price:.2f}")

## Extract only products above $500
expensive_products = [name for name, price, _ in products if float(price) > 500]
print("\nExpensive products (>$500):", expensive_products)

Run the script:

python3 ~/project/data_cleaning.py

The output should be similar to:

Original data:

Product: Laptop X200, Price: $899.99, SKU: LP-X200-2023
Product: Smartphone S10+, Price: $699.50, SKU: SP-S10P-2023
Product: Tablet T7, Price: $299.99, SKU: TB-T7-2023
Product: Wireless Earbuds, Price: $129.95, SKU: WE-PRO-2023


Extracted and structured product information:
Name		Price		SKU
--------------------------------------------------
Laptop X200	$899.99	LP-X200-2023
Smartphone S10+	$699.50	SP-S10P-2023
Tablet T7	$299.99	TB-T7-2023
Wireless Earbuds	$129.95	WE-PRO-2023

Total price of all products: $2029.43

Expensive products (>$500): ['Laptop X200', 'Smartphone S10+']

Combining re.findall() with Other String Functions

Finally, let's see how we can combine re.findall() with other string functions for advanced text processing. Create a file named combined_processing.py:

import re

## Sample text with mixed content
text = """
Temperature readings:
- New York: 72°F (22.2°C)
- London: 59°F (15.0°C)
- Tokyo: 80°F (26.7°C)
- Sydney: 68°F (20.0°C)
"""

## Extract all temperature readings in Fahrenheit
fahrenheit_pattern = r'(\d+)°F'
fahrenheit_temps = re.findall(fahrenheit_pattern, text)

## Convert to integers
fahrenheit_temps = [int(temp) for temp in fahrenheit_temps]

print("Original text:")
print(text)
print("\nFahrenheit temperatures:", fahrenheit_temps)

## Calculate average temperature
avg_temp = sum(fahrenheit_temps) / len(fahrenheit_temps)
print(f"Average temperature: {avg_temp:.1f}°F")

## Extract city and temperature pairs
city_temp_pattern = r'- ([A-Za-z\s]+): (\d+)°F'
city_temps = re.findall(city_temp_pattern, text)

print("\nCity and temperature pairs:")
for city, temp in city_temps:
    print(f"{city}: {temp}°F")

## Find the hottest and coldest cities
hottest_city = max(city_temps, key=lambda x: int(x[1]))
coldest_city = min(city_temps, key=lambda x: int(x[1]))

print(f"\nHottest city: {hottest_city[0]} ({hottest_city[1]}°F)")
print(f"Coldest city: {coldest_city[0]} ({coldest_city[1]}°F)")

Run the script:

python3 ~/project/combined_processing.py

The output should be similar to:

Original text:

Temperature readings:
- New York: 72°F (22.2°C)
- London: 59°F (15.0°C)
- Tokyo: 80°F (26.7°C)
- Sydney: 68°F (20.0°C)


Fahrenheit temperatures: [72, 59, 80, 68]
Average temperature: 69.8°F

City and temperature pairs:
New York: 72°F
London: 59°F
Tokyo: 80°F
Sydney: 68°F

Hottest city: Tokyo (80°F)
Coldest city: London (59°F)

These examples demonstrate how re.findall() can be combined with other Python functionality to solve real-world text processing problems. The ability to extract structured data from unstructured text is an essential skill for data analysis, web scraping, and many other programming tasks.

Summary

In this tutorial, you have learned how to use the powerful re.findall() function in Python for text pattern matching and extraction. You have gained practical knowledge in several key areas:

  1. Basic Pattern Matching - You learned how to find simple substrings and use basic regular expression patterns to match specific text patterns.

  2. Complex Patterns - You explored more complex patterns including character classes, word boundaries, and quantifiers to create flexible search patterns.

  3. Flags and Capturing Groups - You discovered how to modify search behavior using flags like re.IGNORECASE and how to extract structured data using capturing groups.

  4. Real-world Applications - You applied your knowledge to practical scenarios such as extracting email addresses and URLs, parsing log files, and cleaning data.

The skills you've developed in this lab are valuable for a wide range of text processing tasks including:

  • Data extraction and cleaning
  • Content analysis
  • Web scraping
  • Log file parsing
  • Data validation

With regular expressions and the re.findall() function, you now have a powerful tool for handling text data in your Python projects. As you continue to practice and apply these techniques, you'll become more proficient at creating efficient patterns for your specific text processing needs.