Advanced Capture Group Techniques
Beyond basic capture groups, Python regex offers more advanced features like nested capture groups, non-capturing groups, and lookarounds.
Nested Capture Groups
Capture groups can be nested within other capture groups to extract more granular information.
Create a new file named nested_capture.py in the ~/project directory.
touch ~/project/nested_capture.py
Open nested_capture.py and add the following code:
import re
def parse_complex_data(text):
pattern = r'((\w+)\s(\w+))\s\[(\d+)\]'
match = re.match(pattern, text)
if match:
full_name = match.group(1)
first_name = match.group(2)
last_name = match.group(3)
id_number = match.group(4)
return {
'full_name': full_name,
'first_name': first_name,
'last_name': last_name,
'id': id_number
}
return None
text = 'John Doe [12345]'
result = parse_complex_data(text)
if result:
print(f"Full Name: {result['full_name']}")
print(f"First Name: {result['first_name']}")
print(f"Last Name: {result['last_name']}")
print(f"ID: {result['id']}")
else:
print("No match found.")
Save the file.
Run the script.
python ~/project/nested_capture.py
The output will show the extracted data, including the full name and its components:
Full Name: John Doe
First Name: John
Last Name: Doe
ID: 12345
Here, ((\w+)\s(\w+)) is a nested capture group. group(1) captures the entire "John Doe", group(2) captures "John", and group(3) captures "Doe". group(4) captures the ID.
Non-Capturing Groups
Sometimes you need to group parts of a pattern for applying quantifiers or alternatives, but you don't need to capture the content. Non-capturing groups (?:...) are used for this purpose.
Create a new file named non_capturing.py in the ~/project directory.
touch ~/project/non_capturing.py
Open non_capturing.py and add the following code:
import re
def extract_domain_info(url):
## (?:) creates a non-capturing group
pattern = r'https?://(?:www\.)?([^/]+)'
match = re.match(pattern, url)
if match:
domain = match.group(1) ## Only the domain is captured
return domain
return None
url1 = 'https://www.example.com/path'
domain1 = extract_domain_info(url1)
print(f"Domain from '{url1}': {domain1}")
url2 = 'http://example.org/another/path'
domain2 = extract_domain_info(url2)
print(f"Domain from '{url2}': {domain2}")
Save the file.
Run the script.
python ~/project/non_capturing.py
The output will show the extracted domain names:
Domain from 'https://www.example.com/path': example.com
Domain from 'http://example.org/another/path': example.org
In this example, (?:www\.)? matches "www." if it exists but does not capture it, so group(1) directly captures the domain name.
Using non-capturing groups can slightly improve performance and keeps the captured group indices cleaner when you only need to capture specific parts of a larger pattern.