re.findall() 의 실제 사용 사례
이 마지막 단계에서는 re.findall()의 실용적이고 실제적인 사용 사례를 살펴보겠습니다. 이메일, URL 을 추출하고 데이터 정리 작업을 수행하는 코드를 작성합니다.
이메일 주소 추출
이메일 추출은 데이터 마이닝, 웹 스크래핑 (web scraping), 텍스트 분석에서 일반적인 작업입니다. email_extractor.py라는 파일을 만듭니다.
import re
## Sample text with email addresses
text = """
Contact information:
- Support: support@example.com
- Sales: sales@example.com, international.sales@example.co.uk
- Technical team: tech.team@subdomain.example.org
Personal email: john.doe123@gmail.com
"""
## Extract all email addresses
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(email_pattern, text)
print("Original text:")
print(text)
print("\nExtracted email addresses:")
for i, email in enumerate(emails, 1):
print(f"{i}. {email}")
## Extract specific domain emails
gmail_emails = re.findall(r'\b[A-Za-z0-9._%+-]+@gmail\.com\b', text)
print("\nGmail addresses:", gmail_emails)
스크립트를 실행합니다.
python3 ~/project/email_extractor.py
출력은 다음과 유사해야 합니다.
Original text:
Contact information:
- Support: support@example.com
- Sales: sales@example.com, international.sales@example.co.uk
- Technical team: tech.team@subdomain.example.org
Personal email: john.doe123@gmail.com
Extracted email addresses:
1. support@example.com
2. sales@example.com
3. international.sales@example.co.uk
4. tech.team@subdomain.example.org
5. john.doe123@gmail.com
Gmail addresses: ['john.doe123@gmail.com']
URL 추출
URL 추출은 웹 스크래핑, 링크 유효성 검사 및 콘텐츠 분석에 유용합니다. url_extractor.py라는 파일을 만듭니다.
import re
## Sample text with various URLs
text = """
Visit our website at https://www.example.com
Documentation: http://docs.example.org/guide
Repository: https://github.com/user/project
Forum: https://community.example.net/forum
Image: https://images.example.com/logo.png
"""
## Extract all URLs
url_pattern = r'https?://[^\s]+'
urls = re.findall(url_pattern, text)
print("Original text:")
print(text)
print("\nExtracted URLs:")
for i, url in enumerate(urls, 1):
print(f"{i}. {url}")
## Extract specific domain URLs
github_urls = re.findall(r'https?://github\.com/[^\s]+', text)
print("\nGitHub URLs:", github_urls)
## Extract image URLs
image_urls = re.findall(r'https?://[^\s]+\.(jpg|jpeg|png|gif)', text)
print("\nImage URLs:", image_urls)
스크립트를 실행합니다.
python3 ~/project/url_extractor.py
출력은 다음과 유사해야 합니다.
Original text:
Visit our website at https://www.example.com
Documentation: http://docs.example.org/guide
Repository: https://github.com/user/project
Forum: https://community.example.net/forum
Image: https://images.example.com/logo.png
Extracted URLs:
1. https://www.example.com
2. http://docs.example.org/guide
3. https://github.com/user/project
4. https://community.example.net/forum
5. https://images.example.com/logo.png
GitHub URLs: ['https://github.com/user/project']
Image URLs: ['https://images.example.com/logo.png']
re.findall() 을 사용한 데이터 정리
지저분한 데이터 세트에서 정보를 정리하고 추출하는 스크립트를 만들어 보겠습니다. data_cleaning.py라는 파일을 만듭니다.
import re
## Sample messy data
data = """
Product: Laptop X200, Price: $899.99, SKU: LP-X200-2023
Product: Smartphone S10+, Price: $699.50, SKU: SP-S10P-2023
Product: Tablet T7, Price: $299.99, SKU: TB-T7-2023
Product: Wireless Earbuds, Price: $129.95, SKU: WE-PRO-2023
"""
## Extract product information
product_pattern = r'Product: (.*?), Price: \$([\d.]+), SKU: ([A-Z0-9-]+)'
products = re.findall(product_pattern, data)
print("Original data:")
print(data)
print("\nExtracted and structured product information:")
print("Name\t\tPrice\t\tSKU")
print("-" * 50)
for product in products:
name, price, sku = product
print(f"{name}\t${price}\t{sku}")
## Calculate total price
total_price = sum(float(price) for _, price, _ in products)
print(f"\nTotal price of all products: ${total_price:.2f}")
## Extract only products above $500
expensive_products = [name for name, price, _ in products if float(price) > 500]
print("\nExpensive products (>$500):", expensive_products)
스크립트를 실행합니다.
python3 ~/project/data_cleaning.py
출력은 다음과 유사해야 합니다.
Original data:
Product: Laptop X200, Price: $899.99, SKU: LP-X200-2023
Product: Smartphone S10+, Price: $699.50, SKU: SP-S10P-2023
Product: Tablet T7, Price: $299.99, SKU: TB-T7-2023
Product: Wireless Earbuds, Price: $129.95, SKU: WE-PRO-2023
Extracted and structured product information:
Name Price SKU
--------------------------------------------------
Laptop X200 $899.99 LP-X200-2023
Smartphone S10+ $699.50 SP-S10P-2023
Tablet T7 $299.99 TB-T7-2023
Wireless Earbuds $129.95 WE-PRO-2023
Total price of all products: $2029.43
Expensive products (>$500): ['Laptop X200', 'Smartphone S10+']
re.findall() 을 다른 문자열 함수와 결합하기
마지막으로, 고급 텍스트 처리를 위해 re.findall()을 다른 문자열 함수와 결합하는 방법을 살펴보겠습니다. combined_processing.py라는 파일을 만듭니다.
import re
## Sample text with mixed content
text = """
Temperature readings:
- New York: 72°F (22.2°C)
- London: 59°F (15.0°C)
- Tokyo: 80°F (26.7°C)
- Sydney: 68°F (20.0°C)
"""
## Extract all temperature readings in Fahrenheit
fahrenheit_pattern = r'(\d+)°F'
fahrenheit_temps = re.findall(fahrenheit_pattern, text)
## Convert to integers
fahrenheit_temps = [int(temp) for temp in fahrenheit_temps]
print("Original text:")
print(text)
print("\nFahrenheit temperatures:", fahrenheit_temps)
## Calculate average temperature
avg_temp = sum(fahrenheit_temps) / len(fahrenheit_temps)
print(f"Average temperature: {avg_temp:.1f}°F")
## Extract city and temperature pairs
city_temp_pattern = r'- ([A-Za-z\s]+): (\d+)°F'
city_temps = re.findall(city_temp_pattern, text)
print("\nCity and temperature pairs:")
for city, temp in city_temps:
print(f"{city}: {temp}°F")
## Find the hottest and coldest cities
hottest_city = max(city_temps, key=lambda x: int(x[1]))
coldest_city = min(city_temps, key=lambda x: int(x[1]))
print(f"\nHottest city: {hottest_city[0]} ({hottest_city[1]}°F)")
print(f"Coldest city: {coldest_city[0]} ({coldest_city[1]}°F)")
스크립트를 실행합니다.
python3 ~/project/combined_processing.py
출력은 다음과 유사해야 합니다.
Original text:
Temperature readings:
- New York: 72°F (22.2°C)
- London: 59°F (15.0°C)
- Tokyo: 80°F (26.7°C)
- Sydney: 68°F (20.0°C)
Fahrenheit temperatures: [72, 59, 80, 68]
Average temperature: 69.8°F
City and temperature pairs:
New York: 72°F
London: 59°F
Tokyo: 80°F
Sydney: 68°F
Hottest city: Tokyo (80°F)
Coldest city: London (59°F)
이러한 예는 re.findall()을 다른 Python 기능과 결합하여 실제 텍스트 처리 문제를 해결하는 방법을 보여줍니다. 구조화되지 않은 텍스트에서 구조화된 데이터를 추출하는 능력은 데이터 분석, 웹 스크래핑 및 기타 많은 프로그래밍 작업에 필수적인 기술입니다.