Aplicaciones del Mundo Real de re.findall()
En este último paso, exploraremos aplicaciones prácticas del mundo real de re.findall()
. Escribiremos código para extraer correos electrónicos, URLs y realizar tareas de limpieza de datos.
La extracción de correos electrónicos es una tarea común en la minería de datos, el web scraping y el análisis de texto. Crea un archivo llamado email_extractor.py
:
import re
## Sample text with email addresses
text = """
Contact information:
- Support: [email protected]
- Sales: [email protected], [email protected]
- Technical team: [email protected]
Personal email: [email protected]
"""
## Extract all email addresses
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(email_pattern, text)
print("Original text:")
print(text)
print("\nExtracted email addresses:")
for i, email in enumerate(emails, 1):
print(f"{i}. {email}")
## Extract specific domain emails
gmail_emails = re.findall(r'\b[A-Za-z0-9._%+-]+@gmail\.com\b', text)
print("\nGmail addresses:", gmail_emails)
Ejecuta el script:
python3 ~/project/email_extractor.py
La salida debería ser similar a:
Original text:
Contact information:
- Support: [email protected]
- Sales: [email protected], [email protected]
- Technical team: [email protected]
Personal email: [email protected]
Extracted email addresses:
1. [email protected]
2. [email protected]
3. [email protected]
4. [email protected]
5. [email protected]
Gmail addresses: ['[email protected]']
La extracción de URLs es útil para el web scraping, la validación de enlaces y el análisis de contenido. Crea un archivo llamado url_extractor.py
:
import re
## Sample text with various URLs
text = """
Visit our website at https://www.example.com
Documentation: http://docs.example.org/guide
Repository: https://github.com/user/project
Forum: https://community.example.net/forum
Image: https://images.example.com/logo.png
"""
## Extract all URLs
url_pattern = r'https?://[^\s]+'
urls = re.findall(url_pattern, text)
print("Original text:")
print(text)
print("\nExtracted URLs:")
for i, url in enumerate(urls, 1):
print(f"{i}. {url}")
## Extract specific domain URLs
github_urls = re.findall(r'https?://github\.com/[^\s]+', text)
print("\nGitHub URLs:", github_urls)
## Extract image URLs
image_urls = re.findall(r'https?://[^\s]+\.(jpg|jpeg|png|gif)', text)
print("\nImage URLs:", image_urls)
Ejecuta el script:
python3 ~/project/url_extractor.py
La salida debería ser similar a:
Original text:
Visit our website at https://www.example.com
Documentation: http://docs.example.org/guide
Repository: https://github.com/user/project
Forum: https://community.example.net/forum
Image: https://images.example.com/logo.png
Extracted URLs:
1. https://www.example.com
2. http://docs.example.org/guide
3. https://github.com/user/project
4. https://community.example.net/forum
5. https://images.example.com/logo.png
GitHub URLs: ['https://github.com/user/project']
Image URLs: ['https://images.example.com/logo.png']
Limpieza de Datos con re.findall()
Creemos un script para limpiar y extraer información de un conjunto de datos desordenado. Crea un archivo llamado data_cleaning.py
:
import re
## Sample messy data
data = """
Product: Laptop X200, Price: $899.99, SKU: LP-X200-2023
Product: Smartphone S10+, Price: $699.50, SKU: SP-S10P-2023
Product: Tablet T7, Price: $299.99, SKU: TB-T7-2023
Product: Wireless Earbuds, Price: $129.95, SKU: WE-PRO-2023
"""
## Extract product information
product_pattern = r'Product: (.*?), Price: \$([\d.]+), SKU: ([A-Z0-9-]+)'
products = re.findall(product_pattern, data)
print("Original data:")
print(data)
print("\nExtracted and structured product information:")
print("Name\t\tPrice\t\tSKU")
print("-" * 50)
for product in products:
name, price, sku = product
print(f"{name}\t${price}\t{sku}")
## Calculate total price
total_price = sum(float(price) for _, price, _ in products)
print(f"\nTotal price of all products: ${total_price:.2f}")
## Extract only products above $500
expensive_products = [name for name, price, _ in products if float(price) > 500]
print("\nExpensive products (>$500):", expensive_products)
Ejecuta el script:
python3 ~/project/data_cleaning.py
La salida debería ser similar a:
Original data:
Product: Laptop X200, Price: $899.99, SKU: LP-X200-2023
Product: Smartphone S10+, Price: $699.50, SKU: SP-S10P-2023
Product: Tablet T7, Price: $299.99, SKU: TB-T7-2023
Product: Wireless Earbuds, Price: $129.95, SKU: WE-PRO-2023
Extracted and structured product information:
Name Price SKU
--------------------------------------------------
Laptop X200 $899.99 LP-X200-2023
Smartphone S10+ $699.50 SP-S10P-2023
Tablet T7 $299.99 TB-T7-2023
Wireless Earbuds $129.95 WE-PRO-2023
Total price of all products: $2029.43
Expensive products (>$500): ['Laptop X200', 'Smartphone S10+']
Combinando re.findall() con Otras Funciones de Cadenas
Finalmente, veamos cómo podemos combinar re.findall()
con otras funciones de cadenas para un procesamiento de texto avanzado. Crea un archivo llamado combined_processing.py
:
import re
## Sample text with mixed content
text = """
Temperature readings:
- New York: 72°F (22.2°C)
- London: 59°F (15.0°C)
- Tokyo: 80°F (26.7°C)
- Sydney: 68°F (20.0°C)
"""
## Extract all temperature readings in Fahrenheit
fahrenheit_pattern = r'(\d+)°F'
fahrenheit_temps = re.findall(fahrenheit_pattern, text)
## Convert to integers
fahrenheit_temps = [int(temp) for temp in fahrenheit_temps]
print("Original text:")
print(text)
print("\nFahrenheit temperatures:", fahrenheit_temps)
## Calculate average temperature
avg_temp = sum(fahrenheit_temps) / len(fahrenheit_temps)
print(f"Average temperature: {avg_temp:.1f}°F")
## Extract city and temperature pairs
city_temp_pattern = r'- ([A-Za-z\s]+): (\d+)°F'
city_temps = re.findall(city_temp_pattern, text)
print("\nCity and temperature pairs:")
for city, temp in city_temps:
print(f"{city}: {temp}°F")
## Find the hottest and coldest cities
hottest_city = max(city_temps, key=lambda x: int(x[1]))
coldest_city = min(city_temps, key=lambda x: int(x[1]))
print(f"\nHottest city: {hottest_city[0]} ({hottest_city[1]}°F)")
print(f"Coldest city: {coldest_city[0]} ({coldest_city[1]}°F)")
Ejecuta el script:
python3 ~/project/combined_processing.py
La salida debería ser similar a:
Original text:
Temperature readings:
- New York: 72°F (22.2°C)
- London: 59°F (15.0°C)
- Tokyo: 80°F (26.7°C)
- Sydney: 68°F (20.0°C)
Fahrenheit temperatures: [72, 59, 80, 68]
Average temperature: 69.8°F
City and temperature pairs:
New York: 72°F
London: 59°F
Tokyo: 80°F
Sydney: 68°F
Hottest city: Tokyo (80°F)
Coldest city: London (59°F)
Estos ejemplos demuestran cómo re.findall()
se puede combinar con otras funcionalidades de Python para resolver problemas de procesamiento de texto del mundo real. La capacidad de extraer datos estructurados de texto no estructurado es una habilidad esencial para el análisis de datos, el web scraping y muchas otras tareas de programación.