How to perform text parsing with regex

PythonPythonBeginner
Practice Now

Introduction

This comprehensive tutorial explores the powerful world of text parsing using regular expressions in Python. Whether you're a beginner or an experienced programmer, you'll learn essential techniques for pattern matching, data extraction, and text manipulation using regex. By mastering these skills, you'll be able to process and analyze text data more efficiently and precisely.

Regex Fundamentals

What is Regular Expression?

Regular Expression (Regex) is a powerful sequence of characters that defines a search pattern. It provides a concise and flexible means for matching strings, parsing text, and performing complex text manipulations in programming.

Basic Regex Components

1. Literal Characters

Literal characters match themselves exactly in a text.

import re

text = "Hello, LabEx!"
pattern = "Hello"
result = re.search(pattern, text)
print(result.group())  ## Output: Hello

2. Special Characters and Metacharacters

Metacharacter Description Example
. Matches any single character a.c matches "abc", "adc"
^ Matches start of string ^Hello matches "Hello world"
$ Matches end of string world$ matches "Hello world"
* Matches 0 or more repetitions ab*c matches "ac", "abc", "abbc"
+ Matches 1 or more repetitions ab+c matches "abc", "abbc"
? Matches 0 or 1 repetition colou?r matches "color", "colour"

Regex Workflow

graph TD A[Input Text] --> B[Regex Pattern] B --> C{Pattern Matching} C -->|Match Found| D[Extract/Manipulate Text] C -->|No Match| E[No Action]

Character Classes

Predefined Character Classes

  • \d: Matches any digit
  • \w: Matches any word character
  • \s: Matches any whitespace
import re

text = "LabEx 2023 Tutorial"
digit_pattern = r'\d+'
result = re.findall(digit_pattern, text)
print(result)  ## Output: ['2023']

Quantifiers

Quantifiers specify how many times a character or group should occur:

  • {n}: Exactly n times
  • {n,}: n or more times
  • {n,m}: Between n and m times

Regex in Python

Python's re module provides comprehensive regex support:

import re

## Matching email pattern
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
email = "[email protected]"
if re.match(email_pattern, email):
    print("Valid email")

Best Practices

  1. Use raw strings (r'') for regex patterns
  2. Test patterns incrementally
  3. Be mindful of performance with complex patterns
  4. Use online regex testers for validation

Pattern Matching Techniques

re.search() vs re.match()

import re

text = "Welcome to LabEx Programming"

## search() finds pattern anywhere in string
search_result = re.search(r'LabEx', text)
print(search_result.group())  ## Output: LabEx

## match() finds pattern only at beginning
match_result = re.match(r'Welcome', text)
print(match_result.group())  ## Output: Welcome

Finding All Matches

re.findall() and re.finditer()

text = "Python 3.8, Python 3.9, Python 3.10"

## findall() returns all matched substrings
versions = re.findall(r'Python \d+\.\d+', text)
print(versions)  ## Output: ['Python 3.8', 'Python 3.9', 'Python 3.10']

## finditer() returns iterator of match objects
for match in re.finditer(r'Python (\d+\.\d+)', text):
    print(match.group(1))  ## Output: 3.8, 3.9, 3.10

Grouping and Capturing

Regex Capture Groups

log_entry = "2023-06-15 ERROR: Database connection failed"

pattern = r'(\d{4}-\d{2}-\d{2}) (\w+): (.+)'
match = re.match(pattern, log_entry)

if match:
    date = match.group(1)
    level = match.group(2)
    message = match.group(3)
    print(f"Date: {date}, Level: {level}, Message: {message}")

Advanced Pattern Matching Techniques

Lookahead and Lookbehind

Technique Syntax Description
Positive Lookahead (?=...) Matches if followed by pattern
Negative Lookahead (?!...) Matches if not followed by pattern
Positive Lookbehind (?<=...) Matches if preceded by pattern
Negative Lookbehind (?<!...) Matches if not preceded by pattern
text = "price: $50, discount: $10"

## Find prices not preceded by 'discount:'
prices = re.findall(r'(?<!discount: )\$\d+', text)
print(prices)  ## Output: ['$50']

Pattern Matching Workflow

graph TD A[Input Text] --> B[Regex Pattern] B --> C{Pattern Matching Method} C -->|search()| D[Find First Occurrence] C -->|match()| E[Match from Start] C -->|findall()| F[Find All Matches] C -->|finditer()| G[Iterate Through Matches]

Substitution Techniques

re.sub() and re.subn()

text = "Contact us at [email protected] or [email protected]"

## Replace email domains
anonymized = re.sub(r'@\w+\.\w+', '@example.com', text)
print(anonymized)
## Output: Contact us at [email protected] or [email protected]

## Count replacements with subn()
result, count = re.subn(r'@\w+\.\w+', '@example.com', text)
print(f"Replaced {count} occurrences")

Performance Considerations

  1. Use specific patterns
  2. Compile regex patterns for repeated use
  3. Avoid excessive backtracking
  4. Use non-capturing groups (?:...) when possible
## Compiled pattern for efficiency
compiled_pattern = re.compile(r'\d+')
text = "Numbers: 100, 200, 300"
matches = compiled_pattern.findall(text)

Practical Regex Applications

Data Validation

Email Validation

import re

def validate_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return re.match(pattern, email) is not None

## Examples
emails = [
    '[email protected]',
    'invalid.email',
    '[email protected]'
]

for email in emails:
    print(f"{email}: {validate_email(email)}")

Password Strength Checker

def check_password_strength(password):
    patterns = [
        r'.{8,}',     ## Minimum 8 characters
        r'[A-Z]',     ## At least one uppercase
        r'[a-z]',     ## At least one lowercase
        r'\d',        ## At least one digit
        r'[!@#$%^&*]' ## At least one special character
    ]

    return all(re.search(pattern, password) for pattern in patterns)

passwords = ['weak', 'Strong1!', 'LabEx2023']
for pwd in passwords:
    print(f"{pwd}: {check_password_strength(pwd)}")

Log Parsing

Extract Log Information

import re

log_entries = [
    '2023-06-15 14:30:45 ERROR Database connection failed',
    '2023-06-15 15:45:22 INFO Server started successfully',
    '2023-06-16 09:12:33 WARNING Low disk space'
]

log_pattern = r'(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+) (.+)'

for entry in log_entries:
    match = re.match(log_pattern, entry)
    if match:
        date, time, level, message = match.groups()
        print(f"Date: {date}, Time: {time}, Level: {level}, Message: {message}")

Data Extraction

Parsing CSV-like Strings

def parse_csv_like_string(data):
    pattern = r'"([^"]*)"'
    return re.findall(pattern, data)

csv_data = 'Name,Age,City\n"John Doe",30,"New York"\n"Jane Smith",25,"San Francisco"'
parsed_data = parse_csv_like_string(csv_data)
print(parsed_data)

Web Scraping Preprocessing

URL Extraction

def extract_urls(text):
    url_pattern = r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+[/\w .-]*'
    return re.findall(url_pattern, text)

sample_text = """
Check out these websites:
https://www.labex.io
http://example.com/page
Invalid: not a url
"""

urls = extract_urls(sample_text)
print(urls)

Text Transformation

Formatting Phone Numbers

def standardize_phone_number(phone):
    ## Remove non-digit characters
    digits = re.sub(r'\D', '', phone)

    ## Format to (XXX) XXX-XXXX
    if len(digits) == 10:
        return re.sub(r'(\d{3})(\d{3})(\d{4})', r'(\1) \2-\3', digits)
    return phone

phone_numbers = [
    '123-456-7890',
    '(987) 654-3210',
    '1234567890'
]

for number in phone_numbers:
    print(f"{number} -> {standardize_phone_number(number)}")

Regex Application Workflow

graph TD A[Raw Data Input] --> B[Regex Pattern] B --> C{Pattern Matching} C -->|Match Found| D[Extract/Transform Data] C -->|No Match| E[Handle Exception] D --> F[Processed Data]

Performance and Best Practices

Technique Recommendation
Compilation Use re.compile() for repeated patterns
Specificity Write precise patterns
Readability Use verbose regex with re.VERBOSE flag
Error Handling Always validate regex matches

Complex Example: Log Analysis

def analyze_system_logs(log_file):
    error_pattern = r'(\d{4}-\d{2}-\d{2}) .*ERROR: (.+)'
    critical_errors = []

    with open(log_file, 'r') as file:
        for line in file:
            match = re.search(error_pattern, line)
            if match:
                date, error_message = match.groups()
                critical_errors.append((date, error_message))

    return critical_errors

## Hypothetical usage
logs = analyze_system_logs('/var/log/system.log')

Summary

By understanding regex fundamentals, pattern matching techniques, and practical applications in Python, you've gained a robust toolkit for text processing. Regular expressions provide a flexible and powerful method to search, validate, and extract information from text data, enabling more sophisticated and efficient programming solutions across various domains.