如何从 Python 字符串中过滤非字母数字字符

简介

在 Python 编程中，处理字符串是一项基本技能。通常，你需要通过去除特殊字符、标点符号或其他非字母数字字符来清理文本数据。这个过程对于数据分析、自然语言处理和 Web 开发等各种应用程序至关重要。

本教程将指导你使用不同的方法从 Python 字符串中过滤掉非字母数字字符。学完之后，你将能够把杂乱的文本转换为干净、结构化的数据，以便在你的 Python 应用程序中更轻松地进行处理。

Python 字符串基础与字母数字字符

在深入探讨过滤非字母数字字符之前，让我们先了解一下 Python 中的字符串和字母数字字符是什么。

什么是 Python 字符串？

Python 中的字符串是用引号括起来的字符序列。你可以使用单引号 (')、双引号 (") 或三引号 (''' 或 """) 来定义字符串。

让我们创建一个新的 Python 文件来对字符串进行实验。在 WebIDE 中，通过点击资源管理器面板中的“New File”图标，在 /home/labex/project 目录下创建一个新文件。将该文件命名为 string_basics.py。

在文件中添加以下代码：

## Different ways to define strings in Python
string1 = 'Hello, World!'
string2 = "Python Programming"
string3 = '''This is a
multiline string.'''

## Display each string
print("String 1:", string1)
print("String 2:", string2)
print("String 3:", string3)

要运行这个文件，打开一个终端（如果尚未打开）并执行以下命令：

python3 /home/labex/project/string_basics.py

你应该会看到类似以下的输出：

String 1: Hello, World!
String 2: Python Programming
String 3: This is a
multiline string.

什么是字母数字字符？

字母数字字符包括：

字母（A - Z、a - z）
数字（0 - 9）

任何其他字符（如标点符号、空格、符号）都被视为非字母数字字符。

让我们创建另一个文件来检查一个字符是否为字母数字字符。创建一个名为 alphanumeric_check.py 的新文件，内容如下：

## Check if characters are alphanumeric
test_string = "Hello123!@#"

print("Testing each character in:", test_string)
print("Character | Alphanumeric?")
print("-" * 24)

for char in test_string:
    is_alnum = char.isalnum()
    print(f"{char:^9} | {is_alnum}")

## Check entire strings
examples = ["ABC123", "Hello!", "12345", "a b c"]
print("\nChecking entire strings:")
for ex in examples:
    print(f"{ex:10} | {ex.isalnum()}")

运行这个文件：

python3 /home/labex/project/alphanumeric_check.py

你应该会看到输出显示哪些字符是字母数字字符，哪些不是：

Testing each character in: Hello123!@#
Character | Alphanumeric?
------------------------
    H     | True
    e     | True
    l     | True
    l     | True
    o     | True
    1     | True
    2     | True
    3     | True
    !     | False
    @     | False
    ##     | False

Checking entire strings:
ABC123     | True
Hello!     | False
12345      | True
a b c      | False

如你所见，isalnum() 方法对字母和数字返回 True，对任何其他字符返回 False。当我们需要识别非字母数字字符时，这将非常有用。

使用字符串方法进行过滤

Python 提供了几个内置的字符串方法，可帮助我们过滤掉非字母数字字符。在这一步中，我们将探索这些方法并创建自己的过滤函数。

使用字符串推导式

过滤字符的一种常见方法是使用字符串推导式。让我们创建一个名为 string_filter.py 的新文件：

## Using string comprehension to filter non-alphanumeric characters

def filter_alphanumeric(text):
    ## Keep only alphanumeric characters
    filtered_text = ''.join(char for char in text if char.isalnum())
    return filtered_text

## Test the function with different examples
test_strings = [
    "Hello, World!",
    "Python 3.10 is amazing!",
    "Email: user@example.com",
    "Phone: (123) 456-7890"
]

print("Original vs Filtered:")
print("-" * 40)

for text in test_strings:
    filtered = filter_alphanumeric(text)
    print(f"Original: {text}")
    print(f"Filtered: {filtered}")
    print("-" * 40)

运行这个文件：

python3 /home/labex/project/string_filter.py

你应该会看到如下输出：

Original vs Filtered:
----------------------------------------
Original: Hello, World!
Filtered: HelloWorld
----------------------------------------
Original: Python 3.10 is amazing!
Filtered: Python310isamazing
----------------------------------------
Original: Email: user@example.com
Filtered: Emailuserexamplecom
----------------------------------------
Original: Phone: (123) 456-7890
Filtered: Phone1234567890
----------------------------------------

filter_alphanumeric() 函数会遍历字符串中的每个字符，只保留通过 isalnum() 检查的字符。

使用 `filter()` 函数

Python 的内置 filter() 函数提供了另一种实现相同结果的方法。让我们将此方法添加到我们的文件中：

## Add to the string_filter.py file

def filter_alphanumeric_using_filter(text):
    ## Using the built-in filter() function
    filtered_text = ''.join(filter(str.isalnum, text))
    return filtered_text

print("\nUsing the filter() function:")
print("-" * 40)

for text in test_strings:
    filtered = filter_alphanumeric_using_filter(text)
    print(f"Original: {text}")
    print(f"Filtered: {filtered}")
    print("-" * 40)

在 WebIDE 中打开 string_filter.py 文件，并将上述代码添加到文件末尾。然后再次运行它：

python3 /home/labex/project/string_filter.py

你会发现这两种方法产生的结果相同。

自定义过滤

有时你可能想保留一些非字母数字字符，同时去除其他字符。让我们添加一个函数，允许我们指定要保留的其他字符：

## Add to the string_filter.py file

def custom_filter(text, keep_chars=""):
    ## Keep alphanumeric characters and any characters specified in keep_chars
    filtered_text = ''.join(char for char in text if char.isalnum() or char in keep_chars)
    return filtered_text

print("\nCustom filtering (keeping spaces and @):")
print("-" * 40)

for text in test_strings:
    filtered = custom_filter(text, keep_chars=" @")
    print(f"Original: {text}")
    print(f"Filtered: {filtered}")
    print("-" * 40)

将此代码添加到 string_filter.py 文件的末尾，然后再次运行它：

python3 /home/labex/project/string_filter.py

现在你会看到，过滤结果中保留了空格和 @ 符号，这在你需要保留某些格式或特殊字符时非常有用。

使用正则表达式进行文本清理

正则表达式（regex）为识别和处理文本中的模式提供了强大的方法。Python 的 re 模块提供了处理正则表达式的函数。

用于字符过滤的基本正则表达式介绍

让我们创建一个名为 regex_filter.py 的新文件：

## Using regular expressions to filter non-alphanumeric characters
import re

def filter_with_regex(text):
    ## Replace all non-alphanumeric characters with an empty string
    filtered_text = re.sub(r'[^a-zA-Z0-9]', '', text)
    return filtered_text

## Test the function with different examples
test_strings = [
    "Hello, World!",
    "Python 3.10 is amazing!",
    "Email: user@example.com",
    "Phone: (123) 456-7890"
]

print("Original vs Regex Filtered:")
print("-" * 40)

for text in test_strings:
    filtered = filter_with_regex(text)
    print(f"Original: {text}")
    print(f"Filtered: {filtered}")
    print("-" * 40)

正则表达式模式 [^a-zA-Z0-9] 匹配任何不是大写字母、小写字母或数字的字符。re.sub() 函数将所有匹配的字符替换为空字符串。

运行该文件：

python3 /home/labex/project/regex_filter.py

你应该会看到类似以下的输出：

Original vs Regex Filtered:
----------------------------------------
Original: Hello, World!
Filtered: HelloWorld
----------------------------------------
Original: Python 3.10 is amazing!
Filtered: Python310isamazing
----------------------------------------
Original: Email: user@example.com
Filtered: Emailuserexamplecom
----------------------------------------
Original: Phone: (123) 456-7890
Filtered: Phone1234567890
----------------------------------------

使用正则表达式的自定义模式

正则表达式允许更复杂的模式和替换。让我们添加一个允许自定义模式的函数：

## Add to the regex_filter.py file

def custom_regex_filter(text, pattern=r'[^a-zA-Z0-9]', replacement=''):
    ## Replace characters matching the pattern with the replacement
    filtered_text = re.sub(pattern, replacement, text)
    return filtered_text

print("\nCustom regex filtering (keeping spaces and some punctuation):")
print("-" * 60)

## Keep alphanumeric chars, spaces, and @.
custom_pattern = r'[^a-zA-Z0-9\s@\.]'

for text in test_strings:
    filtered = custom_regex_filter(text, pattern=custom_pattern)
    print(f"Original: {text}")
    print(f"Filtered: {filtered}")
    print("-" * 60)

模式 [^a-zA-Z0-9\s@\.] 匹配任何不是字母数字字符、空白字符 (\s)、@ 符号或句号的字符。将此代码添加到你的 regex_filter.py 文件中并再次运行：

python3 /home/labex/project/regex_filter.py

识别非字母数字字符

有时，你可能想识别字符串中存在哪些非字母数字字符。让我们添加一个函数来识别这些字符：

## Add to the regex_filter.py file

def identify_non_alphanumeric(text):
    ## Find all non-alphanumeric characters in the text
    non_alphanumeric = re.findall(r'[^a-zA-Z0-9]', text)
    ## Return unique characters as a set
    return set(non_alphanumeric)

print("\nIdentifying non-alphanumeric characters:")
print("-" * 40)

for text in test_strings:
    characters = identify_non_alphanumeric(text)
    print(f"Text: {text}")
    print(f"Non-alphanumeric characters: {characters}")
    print("-" * 40)

将此代码添加到你的 regex_filter.py 文件中并再次运行：

python3 /home/labex/project/regex_filter.py

输出将显示每个字符串中存在哪些非字母数字字符，这有助于你了解数据中需要过滤的内容。

现实世界中的文本清理应用

既然我们已经学习了过滤非字母数字字符的不同方法，那么让我们将这些技术应用到现实场景中。

清理用户输入

用户输入通常包含需要清理的意外字符。让我们创建一个名为 text_cleaning_app.py 的文件来演示这一点：

## Text cleaning application for user input
import re

def clean_username(username):
    """Cleans a username by removing special characters and spaces"""
    return re.sub(r'[^a-zA-Z0-9_]', '', username)

def clean_search_query(query):
    """Preserves alphanumeric chars and spaces, replaces multiple spaces with one"""
    ## First, replace non-alphanumeric chars (except spaces) with empty string
    cleaned = re.sub(r'[^a-zA-Z0-9\s]', '', query)
    ## Then, replace multiple spaces with a single space
    cleaned = re.sub(r'\s+', ' ', cleaned)
    ## Finally, strip leading and trailing spaces
    return cleaned.strip()

## Simulate user input
usernames = [
    "john.doe",
    "user@example",
    "my username!",
    "admin_123"
]

search_queries = [
    "python   programming",
    "how to filter?!  special chars",
    "$ regex      examples $",
    "   string methods   "
]

## Clean and display usernames
print("Username Cleaning:")
print("-" * 40)
for username in usernames:
    cleaned = clean_username(username)
    print(f"Original: {username}")
    print(f"Cleaned:  {cleaned}")
    print("-" * 40)

## Clean and display search queries
print("\nSearch Query Cleaning:")
print("-" * 40)
for query in search_queries:
    cleaned = clean_search_query(query)
    print(f"Original: '{query}'")
    print(f"Cleaned:  '{cleaned}'")
    print("-" * 40)

运行这个文件：

python3 /home/labex/project/text_cleaning_app.py

处理文件数据

让我们创建一个示例文本文件并对其进行清理。首先，创建一个名为 sample_data.txt 的文件，内容如下：

User1: john.doe@example.com (Active: Yes)
User2: jane_smith@example.com (Active: No)
User3: admin#123@system.org (Active: Yes)
Notes: Users should change their passwords every 90 days!

你可以使用 WebIDE 编辑器创建这个文件。现在，让我们创建一个名为 file_cleaner.py 的文件来清理这些数据：

## File cleaning application
import re

def extract_emails(text):
    """Extract email addresses from text"""
    ## Simple regex for email extraction
    email_pattern = r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'
    return re.findall(email_pattern, text)

def extract_usernames(text):
    """Extract the username part from email addresses"""
    emails = extract_emails(text)
    usernames = [email.split('@')[0] for email in emails]
    return usernames

def clean_usernames(usernames):
    """Clean usernames by removing non-alphanumeric characters"""
    return [re.sub(r'[^a-zA-Z0-9]', '', username) for username in usernames]

## Read the sample data file
try:
    with open('/home/labex/project/sample_data.txt', 'r') as file:
        data = file.read()
except FileNotFoundError:
    print("Error: sample_data.txt file not found!")
    exit(1)

## Process the data
print("File Cleaning Results:")
print("-" * 50)
print("Original data:")
print(data)
print("-" * 50)

## Extract emails
emails = extract_emails(data)
print(f"Extracted {len(emails)} email addresses:")
for email in emails:
    print(f"  - {email}")

## Extract and clean usernames
usernames = extract_usernames(data)
cleaned_usernames = clean_usernames(usernames)

print("\nUsername extraction and cleaning:")
for i, (original, cleaned) in enumerate(zip(usernames, cleaned_usernames)):
    print(f"  - User {i+1}: {original} → {cleaned}")

print("-" * 50)

运行这个文件：

python3 /home/labex/project/file_cleaner.py

性能比较

不同的过滤方法可能具有不同的性能特征。让我们创建一个名为 performance_test.py 的文件来比较它们：

## Performance comparison of different filtering methods
import re
import time

def filter_with_loop(text):
    """Filter using a simple loop"""
    result = ""
    for char in text:
        if char.isalnum():
            result += char
    return result

def filter_with_comprehension(text):
    """Filter using list comprehension"""
    return ''.join(char for char in text if char.isalnum())

def filter_with_filter_function(text):
    """Filter using the built-in filter function"""
    return ''.join(filter(str.isalnum, text))

def filter_with_regex(text):
    """Filter using regular expressions"""
    return re.sub(r'[^a-zA-Z0-9]', '', text)

def filter_with_translate(text):
    """Filter using string.translate"""
    ## Create a translation table that maps all non-alphanumeric chars to None
    from string import ascii_letters, digits
    allowed = ascii_letters + digits
    translation_table = str.maketrans('', '', ''.join(c for c in map(chr, range(128)) if c not in allowed))
    return text.translate(translation_table)

## Generate test data (a string with a mix of alphanumeric and other characters)
test_data = "".join(chr(i) for i in range(33, 127)) * 1000  ## ASCII printable characters repeated

## Define the filtering methods to test
methods = [
    ("Simple Loop", filter_with_loop),
    ("List Comprehension", filter_with_comprehension),
    ("Filter Function", filter_with_filter_function),
    ("Regular Expression", filter_with_regex),
    ("String Translate", filter_with_translate)
]

print("Performance Comparison:")
print("-" * 60)
print(f"Test data length: {len(test_data)} characters")
print("-" * 60)
print(f"{'Method':<20} | {'Time (seconds)':<15} | {'Characters Removed':<20}")
print("-" * 60)

## Test each method
for name, func in methods:
    start_time = time.time()
    result = func(test_data)
    end_time = time.time()

    execution_time = end_time - start_time
    chars_removed = len(test_data) - len(result)

    print(f"{name:<20} | {execution_time:<15.6f} | {chars_removed:<20}")

print("-" * 60)

运行这个文件：

python3 /home/labex/project/performance_test.py

输出将显示哪种方法在过滤非字母数字字符方面最有效，这在处理大量文本数据时非常重要。

总结

在这个实验中，你学习了几种从 Python 字符串中过滤非字母数字字符的方法：

字符串方法：使用 Python 内置的字符串方法，如 isalnum() 来检查和过滤字符。
列表推导式和过滤函数：使用列表推导式和内置的 filter() 函数来创建干净的字符串。
正则表达式：利用 Python 的 re 模块进行强大的模式匹配和替换。
实际应用：将这些技术应用于实际场景，如清理用户输入、处理文件数据和比较性能。

这些技术是各个领域中文本处理任务的基础，包括：

数据分析和机器学习中的数据清理
自然语言处理
网页抓取和数据提取
网页应用中的用户输入验证

通过掌握这些方法，你现在具备了将杂乱的文本数据转换为干净、结构化格式的技能，便于在 Python 应用程序中进行分析和处理。

如何从 Python 字符串中过滤非字母数字字符

简介

Python 字符串基础与字母数字字符

什么是 Python 字符串？

什么是字母数字字符？

使用字符串方法进行过滤

使用字符串推导式

使用 filter() 函数

自定义过滤

使用正则表达式进行文本清理

用于字符过滤的基本正则表达式介绍

使用正则表达式的自定义模式

识别非字母数字字符

现实世界中的文本清理应用

清理用户输入

处理文件数据

性能比较

总结

使用 `filter()` 函数