re.findall()
的实际应用
在最后这一步,我们将探索 re.findall()
在实际场景中的应用。我们会编写代码来提取电子邮件地址、URL,并执行数据清理任务。
提取电子邮件地址
提取电子邮件地址是数据挖掘、网页抓取和文本分析中常见的任务。创建一个名为 email_extractor.py
的文件:
import re
## 包含电子邮件地址的示例文本
text = """
Contact information:
- Support: [email protected]
- Sales: [email protected], [email protected]
- Technical team: [email protected]
Personal email: [email protected]
"""
## 提取所有电子邮件地址
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(email_pattern, text)
print("Original text:")
print(text)
print("\nExtracted email addresses:")
for i, email in enumerate(emails, 1):
print(f"{i}. {email}")
## 提取特定域名的电子邮件地址
gmail_emails = re.findall(r'\b[A-Za-z0-9._%+-]+@gmail\.com\b', text)
print("\nGmail addresses:", gmail_emails)
运行脚本:
python3 ~/project/email_extractor.py
输出应该类似于:
Original text:
Contact information:
- Support: [email protected]
- Sales: [email protected], [email protected]
- Technical team: [email protected]
Personal email: [email protected]
Extracted email addresses:
1. [email protected]
2. [email protected]
3. [email protected]
4. [email protected]
5. [email protected]
Gmail addresses: ['[email protected]']
提取 URL
提取 URL 对于网页抓取、链接验证和内容分析很有用。创建一个名为 url_extractor.py
的文件:
import re
## 包含各种 URL 的示例文本
text = """
Visit our website at https://www.example.com
Documentation: http://docs.example.org/guide
Repository: https://github.com/user/project
Forum: https://community.example.net/forum
Image: https://images.example.com/logo.png
"""
## 提取所有 URL
url_pattern = r'https?://[^\s]+'
urls = re.findall(url_pattern, text)
print("Original text:")
print(text)
print("\nExtracted URLs:")
for i, url in enumerate(urls, 1):
print(f"{i}. {url}")
## 提取特定域名的 URL
github_urls = re.findall(r'https?://github\.com/[^\s]+', text)
print("\nGitHub URLs:", github_urls)
## 提取图片 URL
image_urls = re.findall(r'https?://[^\s]+\.(jpg|jpeg|png|gif)', text)
print("\nImage URLs:", image_urls)
运行脚本:
python3 ~/project/url_extractor.py
输出应该类似于:
Original text:
Visit our website at https://www.example.com
Documentation: http://docs.example.org/guide
Repository: https://github.com/user/project
Forum: https://community.example.net/forum
Image: https://images.example.com/logo.png
Extracted URLs:
1. https://www.example.com
2. http://docs.example.org/guide
3. https://github.com/user/project
4. https://community.example.net/forum
5. https://images.example.com/logo.png
GitHub URLs: ['https://github.com/user/project']
Image URLs: ['https://images.example.com/logo.png']
使用 re.findall()
进行数据清理
让我们创建一个脚本来清理并从杂乱的数据集中提取信息。创建一个名为 data_cleaning.py
的文件:
import re
## 示例杂乱数据
data = """
Product: Laptop X200, Price: $899.99, SKU: LP-X200-2023
Product: Smartphone S10+, Price: $699.50, SKU: SP-S10P-2023
Product: Tablet T7, Price: $299.99, SKU: TB-T7-2023
Product: Wireless Earbuds, Price: $129.95, SKU: WE-PRO-2023
"""
## 提取产品信息
product_pattern = r'Product: (.*?), Price: \$([\d.]+), SKU: ([A-Z0-9-]+)'
products = re.findall(product_pattern, data)
print("Original data:")
print(data)
print("\nExtracted and structured product information:")
print("Name\t\tPrice\t\tSKU")
print("-" * 50)
for product in products:
name, price, sku = product
print(f"{name}\t${price}\t{sku}")
## 计算总价格
total_price = sum(float(price) for _, price, _ in products)
print(f"\nTotal price of all products: ${total_price:.2f}")
## 仅提取价格超过 500 美元的产品
expensive_products = [name for name, price, _ in products if float(price) > 500]
print("\nExpensive products (>$500):", expensive_products)
运行脚本:
python3 ~/project/data_cleaning.py
输出应该类似于:
Original data:
Product: Laptop X200, Price: $899.99, SKU: LP-X200-2023
Product: Smartphone S10+, Price: $699.50, SKU: SP-S10P-2023
Product: Tablet T7, Price: $299.99, SKU: TB-T7-2023
Product: Wireless Earbuds, Price: $129.95, SKU: WE-PRO-2023
Extracted and structured product information:
Name Price SKU
--------------------------------------------------
Laptop X200 $899.99 LP-X200-2023
Smartphone S10+ $699.50 SP-S10P-2023
Tablet T7 $299.99 TB-T7-2023
Wireless Earbuds $129.95 WE-PRO-2023
Total price of all products: $2029.43
Expensive products (>$500): ['Laptop X200', 'Smartphone S10+']
将 re.findall()
与其他字符串函数结合使用
最后,让我们看看如何将 re.findall()
与其他字符串函数结合使用,以进行高级文本处理。创建一个名为 combined_processing.py
的文件:
import re
## 包含混合内容的示例文本
text = """
Temperature readings:
- New York: 72°F (22.2°C)
- London: 59°F (15.0°C)
- Tokyo: 80°F (26.7°C)
- Sydney: 68°F (20.0°C)
"""
## 提取所有华氏温度读数
fahrenheit_pattern = r'(\d+)°F'
fahrenheit_temps = re.findall(fahrenheit_pattern, text)
## 转换为整数
fahrenheit_temps = [int(temp) for temp in fahrenheit_temps]
print("Original text:")
print(text)
print("\nFahrenheit temperatures:", fahrenheit_temps)
## 计算平均温度
avg_temp = sum(fahrenheit_temps) / len(fahrenheit_temps)
print(f"Average temperature: {avg_temp:.1f}°F")
## 提取城市和温度对
city_temp_pattern = r'- ([A-Za-z\s]+): (\d+)°F'
city_temps = re.findall(city_temp_pattern, text)
print("\nCity and temperature pairs:")
for city, temp in city_temps:
print(f"{city}: {temp}°F")
## 找出最热和最冷的城市
hottest_city = max(city_temps, key=lambda x: int(x[1]))
coldest_city = min(city_temps, key=lambda x: int(x[1]))
print(f"\nHottest city: {hottest_city[0]} ({hottest_city[1]}°F)")
print(f"Coldest city: {coldest_city[0]} ({coldest_city[1]}°F)")
运行脚本:
python3 ~/project/combined_processing.py
输出应该类似于:
Original text:
Temperature readings:
- New York: 72°F (22.2°C)
- London: 59°F (15.0°C)
- Tokyo: 80°F (26.7°C)
- Sydney: 68°F (20.0°C)
Fahrenheit temperatures: [72, 59, 80, 68]
Average temperature: 69.8°F
City and temperature pairs:
New York: 72°F
London: 59°F
Tokyo: 80°F
Sydney: 68°F
Hottest city: Tokyo (80°F)
Coldest city: London (59°F)
这些示例展示了如何将 re.findall()
与其他 Python 功能结合起来,以解决实际的文本处理问题。从非结构化文本中提取结构化数据的能力,是数据分析、网页抓取和许多其他编程任务中一项重要的技能。