Praktische Anwendungen der Textbereinigung
Nachdem wir verschiedene Methoden zum Filtern von nicht-alphanumerischen Zeichen kennengelernt haben, wenden wir diese Techniken jetzt auf reale Szenarien an.
Bereinigung von Benutzereingaben
Benutzereingaben enthalten oft unerwartete Zeichen, die bereinigt werden müssen. Erstellen wir eine Datei namens text_cleaning_app.py, um dies zu demonstrieren:
## Text cleaning application for user input
import re
def clean_username(username):
"""Cleans a username by removing special characters and spaces"""
return re.sub(r'[^a-zA-Z0-9_]', '', username)
def clean_search_query(query):
"""Preserves alphanumeric chars and spaces, replaces multiple spaces with one"""
## First, replace non-alphanumeric chars (except spaces) with empty string
cleaned = re.sub(r'[^a-zA-Z0-9\s]', '', query)
## Then, replace multiple spaces with a single space
cleaned = re.sub(r'\s+', ' ', cleaned)
## Finally, strip leading and trailing spaces
return cleaned.strip()
## Simulate user input
usernames = [
"john.doe",
"user@example",
"my username!",
"admin_123"
]
search_queries = [
"python programming",
"how to filter?! special chars",
"$ regex examples $",
" string methods "
]
## Clean and display usernames
print("Username Cleaning:")
print("-" * 40)
for username in usernames:
cleaned = clean_username(username)
print(f"Original: {username}")
print(f"Cleaned: {cleaned}")
print("-" * 40)
## Clean and display search queries
print("\nSearch Query Cleaning:")
print("-" * 40)
for query in search_queries:
cleaned = clean_search_query(query)
print(f"Original: '{query}'")
print(f"Cleaned: '{cleaned}'")
print("-" * 40)
Führen Sie diese Datei aus:
python3 /home/labex/project/text_cleaning_app.py
Verarbeitung von Dateidaten
Erstellen wir eine Beispieltextdatei und bereinigen wir sie. Zunächst erstellen wir eine Datei namens sample_data.txt mit folgendem Inhalt:
User1: john.doe@example.com (Active: Yes)
User2: jane_smith@example.com (Active: No)
User3: admin#123@system.org (Active: Yes)
Notes: Users should change their passwords every 90 days!
Sie können diese Datei mit dem WebIDE-Editor erstellen. Jetzt erstellen wir eine Datei namens file_cleaner.py, um diese Daten zu bereinigen:
## File cleaning application
import re
def extract_emails(text):
"""Extract email addresses from text"""
## Simple regex for email extraction
email_pattern = r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'
return re.findall(email_pattern, text)
def extract_usernames(text):
"""Extract the username part from email addresses"""
emails = extract_emails(text)
usernames = [email.split('@')[0] for email in emails]
return usernames
def clean_usernames(usernames):
"""Clean usernames by removing non-alphanumeric characters"""
return [re.sub(r'[^a-zA-Z0-9]', '', username) for username in usernames]
## Read the sample data file
try:
with open('/home/labex/project/sample_data.txt', 'r') as file:
data = file.read()
except FileNotFoundError:
print("Error: sample_data.txt file not found!")
exit(1)
## Process the data
print("File Cleaning Results:")
print("-" * 50)
print("Original data:")
print(data)
print("-" * 50)
## Extract emails
emails = extract_emails(data)
print(f"Extracted {len(emails)} email addresses:")
for email in emails:
print(f" - {email}")
## Extract and clean usernames
usernames = extract_usernames(data)
cleaned_usernames = clean_usernames(usernames)
print("\nUsername extraction and cleaning:")
for i, (original, cleaned) in enumerate(zip(usernames, cleaned_usernames)):
print(f" - User {i+1}: {original} → {cleaned}")
print("-" * 50)
Führen Sie diese Datei aus:
python3 /home/labex/project/file_cleaner.py
Leistungsvergleich
Verschiedene Filterungsmethoden können unterschiedliche Leistungseigenschaften haben. Erstellen wir eine Datei namens performance_test.py, um sie zu vergleichen:
## Performance comparison of different filtering methods
import re
import time
def filter_with_loop(text):
"""Filter using a simple loop"""
result = ""
for char in text:
if char.isalnum():
result += char
return result
def filter_with_comprehension(text):
"""Filter using list comprehension"""
return ''.join(char for char in text if char.isalnum())
def filter_with_filter_function(text):
"""Filter using the built-in filter function"""
return ''.join(filter(str.isalnum, text))
def filter_with_regex(text):
"""Filter using regular expressions"""
return re.sub(r'[^a-zA-Z0-9]', '', text)
def filter_with_translate(text):
"""Filter using string.translate"""
## Create a translation table that maps all non-alphanumeric chars to None
from string import ascii_letters, digits
allowed = ascii_letters + digits
translation_table = str.maketrans('', '', ''.join(c for c in map(chr, range(128)) if c not in allowed))
return text.translate(translation_table)
## Generate test data (a string with a mix of alphanumeric and other characters)
test_data = "".join(chr(i) for i in range(33, 127)) * 1000 ## ASCII printable characters repeated
## Define the filtering methods to test
methods = [
("Simple Loop", filter_with_loop),
("List Comprehension", filter_with_comprehension),
("Filter Function", filter_with_filter_function),
("Regular Expression", filter_with_regex),
("String Translate", filter_with_translate)
]
print("Performance Comparison:")
print("-" * 60)
print(f"Test data length: {len(test_data)} characters")
print("-" * 60)
print(f"{'Method':<20} | {'Time (seconds)':<15} | {'Characters Removed':<20}")
print("-" * 60)
## Test each method
for name, func in methods:
start_time = time.time()
result = func(test_data)
end_time = time.time()
execution_time = end_time - start_time
chars_removed = len(test_data) - len(result)
print(f"{name:<20} | {execution_time:<15.6f} | {chars_removed:<20}")
print("-" * 60)
Führen Sie diese Datei aus:
python3 /home/labex/project/performance_test.py
Die Ausgabe zeigt Ihnen, welche Methode am effizientesten für die Filterung von nicht-alphanumerischen Zeichen ist. Dies kann wichtig sein, wenn Sie große Mengen an Textdaten verarbeiten.