How to use regex capture groups in Python

PythonBeginner
Practice Now

Introduction

Regular expression capture groups are powerful tools in Python for extracting and manipulating text data. In this lab, you will learn the essential techniques of using capture groups, providing practical insights into how these advanced pattern matching mechanisms can simplify complex string parsing and data extraction tasks.

Regex Capture Groups Basics

Capture groups are a powerful feature in regular expressions that allow you to extract and group specific parts of a matched pattern. In Python, they are defined using parentheses () within a regex pattern.

Let's start by creating a Python script to demonstrate basic capture group usage.

Open the integrated terminal in the WebIDE and navigate to the project directory if you are not already there.

cd ~/project

Create a new file named basic_capture.py using the touch command.

touch basic_capture.py

Open basic_capture.py in the WebIDE editor and add the following Python code:

import re

text = "Contact email: john.doe@example.com"
pattern = r"(\w+)\.(\w+)@(\w+)\.(\w+)"

match = re.search(pattern, text)
if match:
    username = match.group(1)
    lastname = match.group(2)
    domain = match.group(3)
    tld = match.group(4)

    print(f"Username: {username}")
    print(f"Lastname: {lastname}")
    print(f"Domain: {domain}")
    print(f"TLD: {tld}")
else:
    print("No match found.")

Save the file.

Now, run the script using the python command.

python basic_capture.py

You should see the following output:

Username: john
Lastname: doe
Domain: example
TLD: com

This output shows that the script successfully extracted the different parts of the email address using capture groups.

You can also access all captured groups as a tuple using the groups() method. Modify the basic_capture.py file to include the following lines after the if match: block:

    all_groups = match.groups()
    print(f"All groups: {all_groups}")

Save the file and run the script again.

python basic_capture.py

The output will now include the tuple of all captured groups:

Username: john
Lastname: doe
Domain: example
TLD: com
All groups: ('john', 'doe', 'example', 'com')

This demonstrates how to use basic capture groups and access the captured data.

Named Capture Groups

Named capture groups provide a more readable way to access captured data by assigning a name to each group. The syntax for a named capture group is (?P<name>...).

Let's create a new Python script to demonstrate named capture groups.

Create a new file named named_capture.py in the ~/project directory.

touch ~/project/named_capture.py

Open named_capture.py in the WebIDE editor and add the following Python code:

import re

text = "Product: Laptop, Price: $999.99"
pattern = r"Product: (?P<product>\w+), Price: \$(?P<price>\d+\.\d+)"

match = re.search(pattern, text)
if match:
    product = match.group('product')
    price = match.group('price')
    print(f"Product: {product}, Price: ${price}")
else:
    print("No match found.")

Save the file.

Run the script using the python command.

python ~/project/named_capture.py

You should see the following output:

Product: Laptop, Price: $999.99

This output shows that the script successfully extracted the product name and price using named capture groups. You can access the captured data using the group name as a key in the group() method.

Named capture groups make your regex patterns and the subsequent code more understandable, especially for complex patterns with many capture groups.

Practical Capture Group Usage

Capture groups are widely used for data extraction from various text formats like log files, URLs, and structured data.

Let's create a script to parse a log entry using capture groups.

Create a new file named log_parser.py in the ~/project directory.

touch ~/project/log_parser.py

Open log_parser.py in the WebIDE editor and add the following Python code:

import re

log_entry = '2023-06-15 14:30:45 [ERROR] Database connection failed'
pattern = r'(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) \[(\w+)\] (.+)'

match = re.match(pattern, log_entry)
if match:
    date = match.group(1)
    time = match.group(2)
    log_level = match.group(3)
    message = match.group(4)

    print(f"Date: {date}")
    print(f"Time: {time}")
    print(f"Level: {log_level}")
    print(f"Message: {message}")
else:
    print("No match found.")

Save the file.

Run the script using the python command.

python ~/project/log_parser.py

You should see the following output:

Date: 2023-06-15
Time: 14:30:45
Level: ERROR
Message: Database connection failed

This script successfully parsed the log entry and extracted the date, time, log level, and message using capture groups.

Another common use case is extracting information from URLs. Create a new file named url_parser.py in the ~/project directory.

touch ~/project/url_parser.py

Open url_parser.py and add the following code:

import re

def parse_url(url):
    pattern = r'(https?://)?([^/]+)(/.*)?'
    match = re.match(pattern, url)

    if match:
        protocol = match.group(1) or 'http://'
        domain = match.group(2)
        path = match.group(3) or '/'

        return {
            'protocol': protocol,
            'domain': domain,
            'path': path
        }
    return None

## Example usage
url = 'https://www.example.com/path/to/page'
parsed_url = parse_url(url)
if parsed_url:
    print(f"Protocol: {parsed_url['protocol']}")
    print(f"Domain: {parsed_url['domain']}")
    print(f"Path: {parsed_url['path']}")
else:
    print("Invalid URL format.")

url_no_protocol = 'example.org/another/path'
parsed_url_no_protocol = parse_url(url_no_protocol)
if parsed_url_no_protocol:
    print(f"\nProtocol: {parsed_url_no_protocol['protocol']}")
    print(f"Domain: {parsed_url_no_protocol['domain']}")
    print(f"Path: {parsed_url_no_protocol['path']}")
else:
    print("\nInvalid URL format.")

Save the file.

Run the script.

python ~/project/url_parser.py

The output will show the parsed components of the URLs:

Protocol: https://
Domain: www.example.com
Path: /path/to/page

Protocol: http://
Domain: example.org
Path: /another/path

These examples demonstrate the practical application of capture groups in extracting structured data from text.

Advanced Capture Group Techniques

Beyond basic capture groups, Python regex offers more advanced features like nested capture groups, non-capturing groups, and lookarounds.

Nested Capture Groups

Capture groups can be nested within other capture groups to extract more granular information.

Create a new file named nested_capture.py in the ~/project directory.

touch ~/project/nested_capture.py

Open nested_capture.py and add the following code:

import re

def parse_complex_data(text):
    pattern = r'((\w+)\s(\w+))\s\[(\d+)\]'
    match = re.match(pattern, text)

    if match:
        full_name = match.group(1)
        first_name = match.group(2)
        last_name = match.group(3)
        id_number = match.group(4)

        return {
            'full_name': full_name,
            'first_name': first_name,
            'last_name': last_name,
            'id': id_number
        }
    return None

text = 'John Doe [12345]'
result = parse_complex_data(text)
if result:
    print(f"Full Name: {result['full_name']}")
    print(f"First Name: {result['first_name']}")
    print(f"Last Name: {result['last_name']}")
    print(f"ID: {result['id']}")
else:
    print("No match found.")

Save the file.

Run the script.

python ~/project/nested_capture.py

The output will show the extracted data, including the full name and its components:

Full Name: John Doe
First Name: John
Last Name: Doe
ID: 12345

Here, ((\w+)\s(\w+)) is a nested capture group. group(1) captures the entire "John Doe", group(2) captures "John", and group(3) captures "Doe". group(4) captures the ID.

Non-Capturing Groups

Sometimes you need to group parts of a pattern for applying quantifiers or alternatives, but you don't need to capture the content. Non-capturing groups (?:...) are used for this purpose.

Create a new file named non_capturing.py in the ~/project directory.

touch ~/project/non_capturing.py

Open non_capturing.py and add the following code:

import re

def extract_domain_info(url):
    ## (?:) creates a non-capturing group
    pattern = r'https?://(?:www\.)?([^/]+)'
    match = re.match(pattern, url)

    if match:
        domain = match.group(1) ## Only the domain is captured
        return domain
    return None

url1 = 'https://www.example.com/path'
domain1 = extract_domain_info(url1)
print(f"Domain from '{url1}': {domain1}")

url2 = 'http://example.org/another/path'
domain2 = extract_domain_info(url2)
print(f"Domain from '{url2}': {domain2}")

Save the file.

Run the script.

python ~/project/non_capturing.py

The output will show the extracted domain names:

Domain from 'https://www.example.com/path': example.com
Domain from 'http://example.org/another/path': example.org

In this example, (?:www\.)? matches "www." if it exists but does not capture it, so group(1) directly captures the domain name.

Using non-capturing groups can slightly improve performance and keeps the captured group indices cleaner when you only need to capture specific parts of a larger pattern.

Summary

In this lab, you have learned how to use regex capture groups in Python. You started with basic capture groups, then explored named capture groups for better readability. You also practiced using capture groups for practical data extraction tasks like parsing log files and URLs. Finally, you were introduced to advanced techniques like nested and non-capturing groups. By mastering these concepts, you can effectively extract and manipulate specific parts of text data using regular expressions in Python.