Introduction
Regular expression capture groups are powerful tools in Python for extracting and manipulating text data. In this lab, you will learn the essential techniques of using capture groups, providing practical insights into how these advanced pattern matching mechanisms can simplify complex string parsing and data extraction tasks.
Regex Capture Groups Basics
Capture groups are a powerful feature in regular expressions that allow you to extract and group specific parts of a matched pattern. In Python, they are defined using parentheses () within a regex pattern.
Let's start by creating a Python script to demonstrate basic capture group usage.
Open the integrated terminal in the WebIDE and navigate to the project directory if you are not already there.
cd ~/project
Create a new file named basic_capture.py using the touch command.
touch basic_capture.py
Open basic_capture.py in the WebIDE editor and add the following Python code:
import re
text = "Contact email: john.doe@example.com"
pattern = r"(\w+)\.(\w+)@(\w+)\.(\w+)"
match = re.search(pattern, text)
if match:
username = match.group(1)
lastname = match.group(2)
domain = match.group(3)
tld = match.group(4)
print(f"Username: {username}")
print(f"Lastname: {lastname}")
print(f"Domain: {domain}")
print(f"TLD: {tld}")
else:
print("No match found.")
Save the file.
Now, run the script using the python command.
python basic_capture.py
You should see the following output:
Username: john
Lastname: doe
Domain: example
TLD: com
This output shows that the script successfully extracted the different parts of the email address using capture groups.
You can also access all captured groups as a tuple using the groups() method. Modify the basic_capture.py file to include the following lines after the if match: block:
all_groups = match.groups()
print(f"All groups: {all_groups}")
Save the file and run the script again.
python basic_capture.py
The output will now include the tuple of all captured groups:
Username: john
Lastname: doe
Domain: example
TLD: com
All groups: ('john', 'doe', 'example', 'com')
This demonstrates how to use basic capture groups and access the captured data.
Named Capture Groups
Named capture groups provide a more readable way to access captured data by assigning a name to each group. The syntax for a named capture group is (?P<name>...).
Let's create a new Python script to demonstrate named capture groups.
Create a new file named named_capture.py in the ~/project directory.
touch ~/project/named_capture.py
Open named_capture.py in the WebIDE editor and add the following Python code:
import re
text = "Product: Laptop, Price: $999.99"
pattern = r"Product: (?P<product>\w+), Price: \$(?P<price>\d+\.\d+)"
match = re.search(pattern, text)
if match:
product = match.group('product')
price = match.group('price')
print(f"Product: {product}, Price: ${price}")
else:
print("No match found.")
Save the file.
Run the script using the python command.
python ~/project/named_capture.py
You should see the following output:
Product: Laptop, Price: $999.99
This output shows that the script successfully extracted the product name and price using named capture groups. You can access the captured data using the group name as a key in the group() method.
Named capture groups make your regex patterns and the subsequent code more understandable, especially for complex patterns with many capture groups.
Practical Capture Group Usage
Capture groups are widely used for data extraction from various text formats like log files, URLs, and structured data.
Let's create a script to parse a log entry using capture groups.
Create a new file named log_parser.py in the ~/project directory.
touch ~/project/log_parser.py
Open log_parser.py in the WebIDE editor and add the following Python code:
import re
log_entry = '2023-06-15 14:30:45 [ERROR] Database connection failed'
pattern = r'(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) \[(\w+)\] (.+)'
match = re.match(pattern, log_entry)
if match:
date = match.group(1)
time = match.group(2)
log_level = match.group(3)
message = match.group(4)
print(f"Date: {date}")
print(f"Time: {time}")
print(f"Level: {log_level}")
print(f"Message: {message}")
else:
print("No match found.")
Save the file.
Run the script using the python command.
python ~/project/log_parser.py
You should see the following output:
Date: 2023-06-15
Time: 14:30:45
Level: ERROR
Message: Database connection failed
This script successfully parsed the log entry and extracted the date, time, log level, and message using capture groups.
Another common use case is extracting information from URLs. Create a new file named url_parser.py in the ~/project directory.
touch ~/project/url_parser.py
Open url_parser.py and add the following code:
import re
def parse_url(url):
pattern = r'(https?://)?([^/]+)(/.*)?'
match = re.match(pattern, url)
if match:
protocol = match.group(1) or 'http://'
domain = match.group(2)
path = match.group(3) or '/'
return {
'protocol': protocol,
'domain': domain,
'path': path
}
return None
## Example usage
url = 'https://www.example.com/path/to/page'
parsed_url = parse_url(url)
if parsed_url:
print(f"Protocol: {parsed_url['protocol']}")
print(f"Domain: {parsed_url['domain']}")
print(f"Path: {parsed_url['path']}")
else:
print("Invalid URL format.")
url_no_protocol = 'example.org/another/path'
parsed_url_no_protocol = parse_url(url_no_protocol)
if parsed_url_no_protocol:
print(f"\nProtocol: {parsed_url_no_protocol['protocol']}")
print(f"Domain: {parsed_url_no_protocol['domain']}")
print(f"Path: {parsed_url_no_protocol['path']}")
else:
print("\nInvalid URL format.")
Save the file.
Run the script.
python ~/project/url_parser.py
The output will show the parsed components of the URLs:
Protocol: https://
Domain: www.example.com
Path: /path/to/page
Protocol: http://
Domain: example.org
Path: /another/path
These examples demonstrate the practical application of capture groups in extracting structured data from text.
Advanced Capture Group Techniques
Beyond basic capture groups, Python regex offers more advanced features like nested capture groups, non-capturing groups, and lookarounds.
Nested Capture Groups
Capture groups can be nested within other capture groups to extract more granular information.
Create a new file named nested_capture.py in the ~/project directory.
touch ~/project/nested_capture.py
Open nested_capture.py and add the following code:
import re
def parse_complex_data(text):
pattern = r'((\w+)\s(\w+))\s\[(\d+)\]'
match = re.match(pattern, text)
if match:
full_name = match.group(1)
first_name = match.group(2)
last_name = match.group(3)
id_number = match.group(4)
return {
'full_name': full_name,
'first_name': first_name,
'last_name': last_name,
'id': id_number
}
return None
text = 'John Doe [12345]'
result = parse_complex_data(text)
if result:
print(f"Full Name: {result['full_name']}")
print(f"First Name: {result['first_name']}")
print(f"Last Name: {result['last_name']}")
print(f"ID: {result['id']}")
else:
print("No match found.")
Save the file.
Run the script.
python ~/project/nested_capture.py
The output will show the extracted data, including the full name and its components:
Full Name: John Doe
First Name: John
Last Name: Doe
ID: 12345
Here, ((\w+)\s(\w+)) is a nested capture group. group(1) captures the entire "John Doe", group(2) captures "John", and group(3) captures "Doe". group(4) captures the ID.
Non-Capturing Groups
Sometimes you need to group parts of a pattern for applying quantifiers or alternatives, but you don't need to capture the content. Non-capturing groups (?:...) are used for this purpose.
Create a new file named non_capturing.py in the ~/project directory.
touch ~/project/non_capturing.py
Open non_capturing.py and add the following code:
import re
def extract_domain_info(url):
## (?:) creates a non-capturing group
pattern = r'https?://(?:www\.)?([^/]+)'
match = re.match(pattern, url)
if match:
domain = match.group(1) ## Only the domain is captured
return domain
return None
url1 = 'https://www.example.com/path'
domain1 = extract_domain_info(url1)
print(f"Domain from '{url1}': {domain1}")
url2 = 'http://example.org/another/path'
domain2 = extract_domain_info(url2)
print(f"Domain from '{url2}': {domain2}")
Save the file.
Run the script.
python ~/project/non_capturing.py
The output will show the extracted domain names:
Domain from 'https://www.example.com/path': example.com
Domain from 'http://example.org/another/path': example.org
In this example, (?:www\.)? matches "www." if it exists but does not capture it, so group(1) directly captures the domain name.
Using non-capturing groups can slightly improve performance and keeps the captured group indices cleaner when you only need to capture specific parts of a larger pattern.
Summary
In this lab, you have learned how to use regex capture groups in Python. You started with basic capture groups, then explored named capture groups for better readability. You also practiced using capture groups for practical data extraction tasks like parsing log files and URLs. Finally, you were introduced to advanced techniques like nested and non-capturing groups. By mastering these concepts, you can effectively extract and manipulate specific parts of text data using regular expressions in Python.



