如何在 Python 中使用 re.findall() 查找所有匹配的子字符串

简介

在本教程中，我们将探索 Python 的 re.findall() 函数，这是一个从文本中提取匹配子字符串的强大工具。该函数是 Python 内置正则表达式（regex）模块的一部分，对于文本处理任务至关重要。

在本实验结束时，你将能够使用 re.findall() 从文本中提取各种模式，如电子邮件地址、电话号码和 URL。这些技能在数据分析、网页抓取和文本处理应用中非常有用。

无论你是 Python 新手，还是希望提升文本处理能力，本循序渐进的指南都将为你提供实用知识，让你在 Python 项目中有效使用正则表达式。

开始使用 re.findall()

在第一步中，我们将学习 re.findall() 函数以及如何使用它进行基本的模式匹配。

理解正则表达式

正则表达式（regex）是用于描述搜索模式的特殊文本字符串。当你需要完成以下任务时，它们特别有用：

在文本中查找特定的字符模式
验证文本格式（如电子邮件地址）
从文本中提取信息
替换文本

Python 中的 re 模块

Python 提供了一个名为 re 的内置模块，用于处理正则表达式。其中最有用的函数之一就是 re.findall()。

让我们先创建一个简单的 Python 脚本，来看看 re.findall() 是如何工作的。

cd ~/project

使用代码编辑器创建一个名为 basic_findall.py 的新 Python 文件。在 VSCode 中，你可以点击“资源管理器”图标（通常是侧边栏中的第一个图标），然后点击“新建文件”按钮并将其命名为 basic_findall.py。
在 basic_findall.py 文件中，编写以下代码：

import re

## 示例文本
text = "Python is amazing. Python is versatile. I love learning Python programming."

## 使用 re.findall() 查找所有 "Python" 的出现
matches = re.findall(r"Python", text)

## 打印结果
print("Original text:")
print(text)
print("\nMatches found:", len(matches))
print("Matching substrings:", matches)

保存文件并从终端运行它：

python3 ~/project/basic_findall.py

你应该会看到类似以下的输出：

Original text:
Python is amazing. Python is versatile. I love learning Python programming.

Matches found: 3
Matching substrings: ['Python', 'Python', 'Python']

代码解析

让我们来理解代码中发生了什么：

我们使用 import re 导入了 re 模块
我们定义了一个包含多个“Python”单词的示例文本
我们使用 re.findall(r"Python", text) 来查找文本中所有“Python”的出现
字符串前的 r 表示原始字符串，在处理正则表达式时建议使用
该函数返回了所有匹配子字符串的列表
我们打印了结果，显示“Python”在我们的文本中出现了 3 次

查找不同的模式

现在，让我们尝试查找不同的模式。创建一个名为 findall_words.py 的新文件：

import re

text = "The rain in Spain falls mainly on the plain."

## 查找所有以 'ain' 结尾的单词
matches = re.findall(r"\w+ain\b", text)

print("Original text:")
print(text)
print("\nWords ending with 'ain':", matches)

运行这个脚本：

python3 ~/project/findall_words.py

输出应该是：

Original text:
The rain in Spain falls mainly on the plain.

Words ending with 'ain': ['rain', 'Spain', 'plain']

在这个示例中：

\w+ 匹配一个或多个单词字符（字母、数字或下划线）
ain 匹配字面字符“ain”
\b 表示一个单词边界，确保我们匹配以“ain”结尾的完整单词

通过这些示例进行实验，感受 re.findall() 如何处理基本模式。

处理更复杂的模式

在这一步中，我们将使用 re.findall() 探索更复杂的模式，并学习如何使用字符类和量词来创建灵活的搜索模式。

在文本中查找数字

首先，让我们编写一个脚本来从文本中提取所有数字。创建一个名为 extract_numbers.py 的新文件：

import re

text = "There are 42 apples, 15 oranges, and 123 bananas in the basket. The price is $9.99."

## 查找所有数字（整数和小数）
numbers = re.findall(r'\d+\.?\d*', text)

print("Original text:")
print(text)
print("\nNumbers found:", numbers)

## 仅查找整数
whole_numbers = re.findall(r'\b\d+\b', text)
print("Whole numbers only:", whole_numbers)

运行脚本：

python3 ~/project/extract_numbers.py

你应该会看到类似以下的输出：

Original text:
There are 42 apples, 15 oranges, and 123 bananas in the basket. The price is $9.99.

Numbers found: ['42', '15', '123', '9.99']
Whole numbers only: ['42', '15', '123', '9']

让我们来分析一下所使用的模式：

\d+\.?\d* 匹配：
- \d+：一个或多个数字
- \.?：一个可选的小数点
- \d*：小数点后的零个或多个数字
\b\d+\b 匹配：
- \b：单词边界
- \d+：一个或多个数字
- \b：另一个单词边界（确保我们匹配独立的数字）

查找特定长度的单词

让我们创建一个脚本来查找文本中所有四个字母的单词。创建 find_word_length.py：

import re

text = "The quick brown fox jumps over the lazy dog. A good day to code."

## 查找所有四个字母的单词
four_letter_words = re.findall(r'\b\w{4}\b', text)

print("Original text:")
print(text)
print("\nFour-letter words:", four_letter_words)

## 查找所有 3 到 5 个字母的单词
words_3_to_5 = re.findall(r'\b\w{3,5}\b', text)
print("Words with 3 to 5 letters:", words_3_to_5)

运行这个脚本：

python3 ~/project/find_word_length.py

输出应该是：

Original text:
The quick brown fox jumps over the lazy dog. A good day to code.

Four-letter words: ['over', 'lazy', 'good', 'code']
Words with 3 to 5 letters: ['The', 'over', 'the', 'lazy', 'dog', 'good', 'day', 'code']

在这些模式中：

\b\w{4}\b 匹配被单词边界包围的恰好 4 个单词字符
\b\w{3,5}\b 匹配被单词边界包围的 3 到 5 个单词字符

使用字符类

字符类允许我们匹配特定的字符集。让我们创建 character_classes.py：

import re

text = "The temperature is 72°F or 22°C. Contact us at: info@example.com"

## 查找同时包含字母和数字的单词
mixed_words = re.findall(r'\b[a-z0-9]+\b', text.lower())

print("Original text:")
print(text)
print("\nWords with letters and digits:", mixed_words)

## 查找所有电子邮件地址
emails = re.findall(r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b', text)
print("Email addresses:", emails)

运行脚本：

python3 ~/project/character_classes.py

输出应该类似于：

Original text:
The temperature is 72°F or 22°C. Contact us at: info@example.com

Words with letters and digits: ['72°f', '22°c', 'info@example.com']
Email addresses: ['info@example.com']

这些模式展示了：

\b[a-z0-9]+\b：包含小写字母和数字的单词
电子邮件模式匹配电子邮件地址的标准格式

通过这些示例进行实验，以了解不同的模式组件如何协同工作来创建强大的搜索模式。

使用标志和捕获组

在这一步中，我们将学习如何使用标志来修改正则表达式的行为，以及如何使用捕获组来提取匹配模式的特定部分。

理解正则表达式中的标志

标志可以修改正则表达式引擎执行搜索的方式。Python 的 re 模块提供了几个标志，这些标志可以作为可选参数传递给 re.findall()。让我们来探索一些常见的标志。

创建一个名为 regex_flags.py 的新文件：

import re

text = """
Python is a great language.
PYTHON is versatile.
python is easy to learn.
"""

## 区分大小写的搜索（默认）
matches_case_sensitive = re.findall(r"python", text)

## 使用 re.IGNORECASE 标志进行不区分大小写的搜索
matches_case_insensitive = re.findall(r"python", text, re.IGNORECASE)

print("Original text:")
print(text)
print("\nCase-sensitive matches:", matches_case_sensitive)
print("Case-insensitive matches:", matches_case_insensitive)

## 使用多行标志
multiline_text = "First line\nSecond line\nThird line"
## 查找以 'S' 开头的行
starts_with_s = re.findall(r"^S.*", multiline_text, re.MULTILINE)
print("\nMultiline text:")
print(multiline_text)
print("\nLines starting with 'S':", starts_with_s)

运行脚本：

python3 ~/project/regex_flags.py

输出应该类似于：

Original text:

Python is a great language.
PYTHON is versatile.
python is easy to learn.


Case-sensitive matches: ['python']
Case-insensitive matches: ['Python', 'PYTHON', 'python']

Multiline text:
First line
Second line
Third line

Lines starting with 'S': ['Second line']

常见的标志包括：

re.IGNORECASE（或 re.I）：使模式不区分大小写
re.MULTILINE（或 re.M）：使 ^ 和 $ 匹配每行的开头/结尾
re.DOTALL（或 re.S）：使 . 匹配任何字符，包括换行符

使用捕获组

捕获组允许你提取匹配文本的特定部分。它们是通过将正则表达式的一部分放在括号内来创建的。

创建一个名为 capturing_groups.py 的文件：

import re

## 包含各种格式日期的示例文本
text = "Important dates: 2023-11-15, 12/25/2023, and Jan 1, 2024."

## 提取 YYYY-MM-DD 格式的日期
iso_dates = re.findall(r'(\d{4})-(\d{1,2})-(\d{1,2})', text)

## 提取 MM/DD/YYYY 格式的日期
us_dates = re.findall(r'(\d{1,2})/(\d{1,2})/(\d{4})', text)

print("Original text:")
print(text)
print("\nISO dates (Year, Month, Day):", iso_dates)
print("US dates (Month, Day, Year):", us_dates)

## 使用捕获组提取月份名称
month_dates = re.findall(r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+(\d{1,2}),\s+(\d{4})', text)
print("Month name dates (Month, Day, Year):", month_dates)

运行脚本：

python3 ~/project/capturing_groups.py

输出应该是：

Original text:
Important dates: 2023-11-15, 12/25/2023, and Jan 1, 2024.

ISO dates (Year, Month, Day): [('2023', '11', '15')]
US dates (Month, Day, Year): [('12', '25', '2023')]
Month name dates (Month, Day, Year): [('Jan', '1', '2024')]

在这个示例中：

每组括号 () 创建一个捕获组
函数返回一个元组列表，其中每个元组包含捕获的组
这使我们能够从文本中提取和组织结构化数据

实际示例：解析日志文件

现在，让我们将所学知识应用到一个实际示例中。假设我们有一个日志文件，其中包含我们想要解析的条目。创建一个名为 log_parser.py 的文件：

import re

## 示例日志条目
logs = """
[2023-11-15 08:30:45] INFO: System started
[2023-11-15 08:35:12] WARNING: High memory usage (85%)
[2023-11-15 08:42:11] ERROR: Connection timeout
[2023-11-15 09:15:27] INFO: Backup completed
"""

## 从日志条目中提取时间戳、级别和消息
log_pattern = r'\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] (\w+): (.+)'
log_entries = re.findall(log_pattern, logs)

print("Original logs:")
print(logs)
print("\nParsed log entries (timestamp, level, message):")
for entry in log_entries:
    timestamp, level, message = entry
    print(f"Time: {timestamp} | Level: {level} | Message: {message}")

## 查找所有 ERROR 日志
error_logs = re.findall(r'\[\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\] ERROR: (.+)', logs)
print("\nError messages:", error_logs)

运行脚本：

python3 ~/project/log_parser.py

输出应该类似于：

Original logs:

[2023-11-15 08:30:45] INFO: System started
[2023-11-15 08:35:12] WARNING: High memory usage (85%)
[2023-11-15 08:42:11] ERROR: Connection timeout
[2023-11-15 09:15:27] INFO: Backup completed


Parsed log entries (timestamp, level, message):
Time: 2023-11-15 08:30:45 | Level: INFO | Message: System started
Time: 2023-11-15 08:35:12 | Level: WARNING | Message: High memory usage (85%)
Time: 2023-11-15 08:42:11 | Level: ERROR | Message: Connection timeout
Time: 2023-11-15 09:15:27 | Level: INFO | Message: Backup completed

Error messages: ['Connection timeout']

这个示例展示了：

使用捕获组提取结构化信息
处理和显示捕获的信息
过滤特定类型的日志条目

标志和捕获组增强了正则表达式的功能和灵活性，允许进行更精确和结构化的数据提取。

`re.findall()` 的实际应用

在最后这一步，我们将探索 re.findall() 在实际场景中的应用。我们会编写代码来提取电子邮件地址、URL，并执行数据清理任务。

提取电子邮件地址

提取电子邮件地址是数据挖掘、网页抓取和文本分析中常见的任务。创建一个名为 email_extractor.py 的文件：

import re

## 包含电子邮件地址的示例文本
text = """
Contact information:
- Support: support@example.com
- Sales: sales@example.com, international.sales@example.co.uk
- Technical team: tech.team@subdomain.example.org
Personal email: john.doe123@gmail.com
"""

## 提取所有电子邮件地址
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(email_pattern, text)

print("Original text:")
print(text)
print("\nExtracted email addresses:")
for i, email in enumerate(emails, 1):
    print(f"{i}. {email}")

## 提取特定域名的电子邮件地址
gmail_emails = re.findall(r'\b[A-Za-z0-9._%+-]+@gmail\.com\b', text)
print("\nGmail addresses:", gmail_emails)

运行脚本：

python3 ~/project/email_extractor.py

输出应该类似于：

Original text:

Contact information:
- Support: support@example.com
- Sales: sales@example.com, international.sales@example.co.uk
- Technical team: tech.team@subdomain.example.org
Personal email: john.doe123@gmail.com


Extracted email addresses:
1. support@example.com
2. sales@example.com
3. international.sales@example.co.uk
4. tech.team@subdomain.example.org
5. john.doe123@gmail.com

Gmail addresses: ['john.doe123@gmail.com']

提取 URL

提取 URL 对于网页抓取、链接验证和内容分析很有用。创建一个名为 url_extractor.py 的文件：

import re

## 包含各种 URL 的示例文本
text = """
Visit our website at https://www.example.com
Documentation: http://docs.example.org/guide
Repository: https://github.com/user/project
Forum: https://community.example.net/forum
Image: https://images.example.com/logo.png
"""

## 提取所有 URL
url_pattern = r'https?://[^\s]+'
urls = re.findall(url_pattern, text)

print("Original text:")
print(text)
print("\nExtracted URLs:")
for i, url in enumerate(urls, 1):
    print(f"{i}. {url}")

## 提取特定域名的 URL
github_urls = re.findall(r'https?://github\.com/[^\s]+', text)
print("\nGitHub URLs:", github_urls)

## 提取图片 URL
image_urls = re.findall(r'https?://[^\s]+\.(jpg|jpeg|png|gif)', text)
print("\nImage URLs:", image_urls)

运行脚本：

python3 ~/project/url_extractor.py

输出应该类似于：

Original text:

Visit our website at https://www.example.com
Documentation: http://docs.example.org/guide
Repository: https://github.com/user/project
Forum: https://community.example.net/forum
Image: https://images.example.com/logo.png


Extracted URLs:
1. https://www.example.com
2. http://docs.example.org/guide
3. https://github.com/user/project
4. https://community.example.net/forum
5. https://images.example.com/logo.png

GitHub URLs: ['https://github.com/user/project']

Image URLs: ['https://images.example.com/logo.png']

使用 `re.findall()` 进行数据清理

让我们创建一个脚本来清理并从杂乱的数据集中提取信息。创建一个名为 data_cleaning.py 的文件：

import re

## 示例杂乱数据
data = """
Product: Laptop X200, Price: $899.99, SKU: LP-X200-2023
Product: Smartphone S10+, Price: $699.50, SKU: SP-S10P-2023
Product: Tablet T7, Price: $299.99, SKU: TB-T7-2023
Product: Wireless Earbuds, Price: $129.95, SKU: WE-PRO-2023
"""

## 提取产品信息
product_pattern = r'Product: (.*?), Price: \$([\d.]+), SKU: ([A-Z0-9-]+)'
products = re.findall(product_pattern, data)

print("Original data:")
print(data)
print("\nExtracted and structured product information:")
print("Name\t\tPrice\t\tSKU")
print("-" * 50)
for product in products:
    name, price, sku = product
    print(f"{name}\t${price}\t{sku}")

## 计算总价格
total_price = sum(float(price) for _, price, _ in products)
print(f"\nTotal price of all products: ${total_price:.2f}")

## 仅提取价格超过 500 美元的产品
expensive_products = [name for name, price, _ in products if float(price) > 500]
print("\nExpensive products (>$500):", expensive_products)

运行脚本：

python3 ~/project/data_cleaning.py

输出应该类似于：

Original data:

Product: Laptop X200, Price: $899.99, SKU: LP-X200-2023
Product: Smartphone S10+, Price: $699.50, SKU: SP-S10P-2023
Product: Tablet T7, Price: $299.99, SKU: TB-T7-2023
Product: Wireless Earbuds, Price: $129.95, SKU: WE-PRO-2023


Extracted and structured product information:
Name		Price		SKU
--------------------------------------------------
Laptop X200	$899.99	LP-X200-2023
Smartphone S10+	$699.50	SP-S10P-2023
Tablet T7	$299.99	TB-T7-2023
Wireless Earbuds	$129.95	WE-PRO-2023

Total price of all products: $2029.43

Expensive products (>$500): ['Laptop X200', 'Smartphone S10+']

将 `re.findall()` 与其他字符串函数结合使用

最后，让我们看看如何将 re.findall() 与其他字符串函数结合使用，以进行高级文本处理。创建一个名为 combined_processing.py 的文件：

import re

## 包含混合内容的示例文本
text = """
Temperature readings:
- New York: 72°F (22.2°C)
- London: 59°F (15.0°C)
- Tokyo: 80°F (26.7°C)
- Sydney: 68°F (20.0°C)
"""

## 提取所有华氏温度读数
fahrenheit_pattern = r'(\d+)°F'
fahrenheit_temps = re.findall(fahrenheit_pattern, text)

## 转换为整数
fahrenheit_temps = [int(temp) for temp in fahrenheit_temps]

print("Original text:")
print(text)
print("\nFahrenheit temperatures:", fahrenheit_temps)

## 计算平均温度
avg_temp = sum(fahrenheit_temps) / len(fahrenheit_temps)
print(f"Average temperature: {avg_temp:.1f}°F")

## 提取城市和温度对
city_temp_pattern = r'- ([A-Za-z\s]+): (\d+)°F'
city_temps = re.findall(city_temp_pattern, text)

print("\nCity and temperature pairs:")
for city, temp in city_temps:
    print(f"{city}: {temp}°F")

## 找出最热和最冷的城市
hottest_city = max(city_temps, key=lambda x: int(x[1]))
coldest_city = min(city_temps, key=lambda x: int(x[1]))

print(f"\nHottest city: {hottest_city[0]} ({hottest_city[1]}°F)")
print(f"Coldest city: {coldest_city[0]} ({coldest_city[1]}°F)")

运行脚本：

python3 ~/project/combined_processing.py

输出应该类似于：

Original text:

Temperature readings:
- New York: 72°F (22.2°C)
- London: 59°F (15.0°C)
- Tokyo: 80°F (26.7°C)
- Sydney: 68°F (20.0°C)


Fahrenheit temperatures: [72, 59, 80, 68]
Average temperature: 69.8°F

City and temperature pairs:
New York: 72°F
London: 59°F
Tokyo: 80°F
Sydney: 68°F

Hottest city: Tokyo (80°F)
Coldest city: London (59°F)

这些示例展示了如何将 re.findall() 与其他 Python 功能结合起来，以解决实际的文本处理问题。从非结构化文本中提取结构化数据的能力，是数据分析、网页抓取和许多其他编程任务中一项重要的技能。

总结

在本教程中，你学习了如何使用 Python 中强大的 re.findall() 函数进行文本模式匹配和提取。你在几个关键领域获得了实用知识：

基本模式匹配 —— 你学习了如何查找简单的子字符串，并使用基本的正则表达式模式来匹配特定的文本模式。
复杂模式 —— 你探索了更复杂的模式，包括字符类、单词边界和量词，以创建灵活的搜索模式。
标志和捕获组 —— 你了解了如何使用 re.IGNORECASE 等标志来修改搜索行为，以及如何使用捕获组提取结构化数据。
实际应用 —— 你将所学知识应用到实际场景中，如提取电子邮件地址和 URL、解析日志文件以及清理数据。

你在本次实验中掌握的技能对于广泛的文本处理任务非常有价值，包括：

数据提取和清理
内容分析
网页抓取
日志文件解析
数据验证

借助正则表达式和 re.findall() 函数，你现在拥有了一个强大的工具，可用于在 Python 项目中处理文本数据。随着你不断练习和应用这些技术，你将更加熟练地为特定的文本处理需求创建高效的模式。

如何在 Python 中使用 re.findall() 查找所有匹配的子字符串

简介

开始使用 re.findall()

理解正则表达式

Python 中的 re 模块

代码解析

查找不同的模式

处理更复杂的模式

在文本中查找数字

查找特定长度的单词

使用字符类

使用标志和捕获组

理解正则表达式中的标志

使用捕获组

实际示例：解析日志文件

re.findall() 的实际应用

提取电子邮件地址

提取 URL

使用 re.findall() 进行数据清理

将 re.findall() 与其他字符串函数结合使用

总结

`re.findall()` 的实际应用

使用 `re.findall()` 进行数据清理

将 `re.findall()` 与其他字符串函数结合使用