Python re.findall()：一致する部分文字列をすべて見つける方法 | テキスト処理

はじめに

このチュートリアルでは、Python の re.findall() 関数について探求します。これは、テキストから一致する部分文字列を抽出するための強力なツールです。この関数は、Python の組み込み正規表現 (regex) モジュールの一部であり、テキスト処理タスクに不可欠です。

この実験（Lab）の終わりには、re.findall() を使用して、電子メールアドレス、電話番号、URL など、テキストからさまざまなパターンを抽出できるようになります。これらのスキルは、データ分析、Web スクレイピング、およびテキスト処理アプリケーションで役立ちます。

Python を初めて使用する方でも、テキスト処理能力を向上させたい方でも、このステップバイステップガイドは、Python プロジェクトで正規表現を効果的に使用するための実践的な知識を身につけるのに役立ちます。

re.findall() の始め方

最初のステップでは、re.findall() 関数について学び、基本的なパターンマッチングにどのように使用するかを学びます。

正規表現の理解

正規表現 (regex) は、検索パターンを記述するために使用される特別なテキスト文字列です。以下のような場合に特に役立ちます。

テキスト内の特定の文字パターンを見つける
テキスト形式を検証する (電子メールアドレスなど)
テキストから情報を抽出する
テキストを置換する

Python の re モジュール

Python には、正規表現を扱うための re という組み込みモジュールが用意されています。その最も便利な関数の 1 つが re.findall() です。

re.findall() がどのように機能するかを確認するために、簡単な Python スクリプトを作成することから始めましょう。

まず、ターミナルを開き、プロジェクトディレクトリに移動します。

cd ~/project

コードエディタを使用して、basic_findall.py という名前の新しい Python ファイルを作成します。VSCode では、「Explorer」アイコン (通常はサイドバーの最初のアイコン) をクリックし、「New File」ボタンをクリックして、basic_findall.py と名前を付けます。
basic_findall.py ファイルに、次のコードを記述します。

import re

## Sample text
text = "Python is amazing. Python is versatile. I love learning Python programming."

## Using re.findall() to find all occurrences of "Python"
matches = re.findall(r"Python", text)

## Print the results
print("Original text:")
print(text)
print("\nMatches found:", len(matches))
print("Matching substrings:", matches)

ファイルを保存し、ターミナルから実行します。

python3 ~/project/basic_findall.py

次のような出力が表示されるはずです。

Original text:
Python is amazing. Python is versatile. I love learning Python programming.

Matches found: 3
Matching substrings: ['Python', 'Python', 'Python']

コードの分解

コードで何が起こっているのかを理解しましょう。

import re で re モジュールをインポートしました。
単語 "Python" が複数回出現するサンプルテキストを定義しました。
re.findall(r"Python", text) を使用して、テキスト内の "Python" のすべての出現箇所を検索しました。
文字列の前の r は raw string を示し、正規表現を扱う際に推奨されます。
関数は、一致するすべての部分文字列のリストを返しました。
結果を出力し、テキストに "Python" が 3 回出現したことを示しました。

さまざまなパターンの検索

次に、別のパターンを検索してみましょう。findall_words.py という名前の新しいファイルを作成します。

import re

text = "The rain in Spain falls mainly on the plain."

## Find all words ending with 'ain'
matches = re.findall(r"\w+ain\b", text)

print("Original text:")
print(text)
print("\nWords ending with 'ain':", matches)

このスクリプトを実行します。

python3 ~/project/findall_words.py

出力は次のようになります。

Original text:
The rain in Spain falls mainly on the plain.

Words ending with 'ain': ['rain', 'Spain', 'plain']

この例では、

\w+ は、1 つ以上の単語文字 (文字、数字、またはアンダースコア) に一致します。
ain は、リテラル文字 "ain" に一致します。
\b は単語境界を表し、"ain" で終わる完全な単語に一致することを確認します。

これらの例を試して、re.findall() が基本的なパターンでどのように機能するかを理解してください。

より複雑なパターンを扱う

このステップでは、re.findall() でより複雑なパターンを探求し、文字クラスと数量子を使用して柔軟な検索パターンを作成する方法を学びます。

テキスト内の数字の検索

まず、テキストからすべての数字を抽出するスクリプトを記述しましょう。extract_numbers.py という名前の新しいファイルを作成します。

import re

text = "There are 42 apples, 15 oranges, and 123 bananas in the basket. The price is $9.99."

## Find all numbers (integers and decimals)
numbers = re.findall(r'\d+\.?\d*', text)

print("Original text:")
print(text)
print("\nNumbers found:", numbers)

## Finding only whole numbers
whole_numbers = re.findall(r'\b\d+\b', text)
print("Whole numbers only:", whole_numbers)

スクリプトを実行します。

python3 ~/project/extract_numbers.py

次のような出力が表示されるはずです。

Original text:
There are 42 apples, 15 oranges, and 123 bananas in the basket. The price is $9.99.

Numbers found: ['42', '15', '123', '9.99']
Whole numbers only: ['42', '15', '123', '9']

使用されているパターンを分解してみましょう。

\d+\.?\d* は以下に一致します。
- \d+: 1 つ以上の数字
- \.?: オプションの小数点
- \d*: 小数点の後の 0 個以上の数字
\b\d+\b は以下に一致します。
- \b: 単語境界
- \d+: 1 つ以上の数字
- \b: もう 1 つの単語境界 (スタンドアロンの数字に一致することを確認)

特定の長さの単語の検索

テキスト内のすべての 4 文字の単語を検索するスクリプトを作成しましょう。find_word_length.py を作成します。

import re

text = "The quick brown fox jumps over the lazy dog. A good day to code."

## Find all 4-letter words
four_letter_words = re.findall(r'\b\w{4}\b', text)

print("Original text:")
print(text)
print("\nFour-letter words:", four_letter_words)

## Find all words between 3 and 5 letters
words_3_to_5 = re.findall(r'\b\w{3,5}\b', text)
print("Words with 3 to 5 letters:", words_3_to_5)

このスクリプトを実行します。

python3 ~/project/find_word_length.py

出力は次のようになります。

Original text:
The quick brown fox jumps over the lazy dog. A good day to code.

Four-letter words: ['over', 'lazy', 'good', 'code']
Words with 3 to 5 letters: ['The', 'over', 'the', 'lazy', 'dog', 'good', 'day', 'code']

これらのパターンでは、

\b\w{4}\b は、単語境界で囲まれた正確に 4 つの単語文字に一致します。
\b\w{3,5}\b は、単語境界で囲まれた 3 ～ 5 個の単語文字に一致します。

文字クラスの使用

文字クラスを使用すると、特定の文字セットに一致させることができます。character_classes.py を作成しましょう。

import re

text = "The temperature is 72°F or 22°C. Contact us at: info@example.com"

## Find words containing both letters and digits
mixed_words = re.findall(r'\b[a-z0-9]+\b', text.lower())

print("Original text:")
print(text)
print("\nWords with letters and digits:", mixed_words)

## Find all email addresses
emails = re.findall(r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b', text)
print("Email addresses:", emails)

スクリプトを実行します。

python3 ~/project/character_classes.py

出力は次のようになります。

Original text:
The temperature is 72°F or 22°C. Contact us at: info@example.com

Words with letters and digits: ['72°f', '22°c', 'info@example.com']
Email addresses: ['info@example.com']

これらのパターンは以下を示しています。

\b[a-z0-9]+\b: 小文字の文字と数字を含む単語
電子メールパターンは、電子メールアドレスの標準形式に一致します

これらの例を試して、さまざまなパターンコンポーネントがどのように連携して強力な検索パターンを作成するかを理解してください。

フラグとキャプチャグループの使用

このステップでは、フラグを使用して正規表現の動作を変更する方法と、キャプチャグループを使用して一致したパターンの特定の部分を抽出する方法を学びます。

正規表現におけるフラグの理解

フラグは、正規表現エンジンが検索を実行する方法を変更します。Python の re モジュールは、re.findall() にオプションパラメータとして渡すことができるいくつかのフラグを提供しています。いくつかの一般的なフラグを見てみましょう。

regex_flags.py という名前の新しいファイルを作成します。

import re

text = """
Python is a great language.
PYTHON is versatile.
python is easy to learn.
"""

## Case-sensitive search (default)
matches_case_sensitive = re.findall(r"python", text)

## Case-insensitive search using re.IGNORECASE flag
matches_case_insensitive = re.findall(r"python", text, re.IGNORECASE)

print("Original text:")
print(text)
print("\nCase-sensitive matches:", matches_case_sensitive)
print("Case-insensitive matches:", matches_case_insensitive)

## Using the multiline flag
multiline_text = "First line\nSecond line\nThird line"
## Find lines starting with 'S'
starts_with_s = re.findall(r"^S.*", multiline_text, re.MULTILINE)
print("\nMultiline text:")
print(multiline_text)
print("\nLines starting with 'S':", starts_with_s)

スクリプトを実行します。

python3 ~/project/regex_flags.py

出力は次のようになります。

Original text:

Python is a great language.
PYTHON is versatile.
python is easy to learn.


Case-sensitive matches: ['python']
Case-insensitive matches: ['Python', 'PYTHON', 'python']

Multiline text:
First line
Second line
Third line

Lines starting with 'S': ['Second line']

一般的なフラグには以下が含まれます。

re.IGNORECASE (または re.I): パターンを大文字と小文字を区別しないようにします。
re.MULTILINE (または re.M): ^ と $ が各行の先頭/末尾に一致するようにします。
re.DOTALL (または re.S): . が改行を含むすべての文字に一致するようにします。

キャプチャグループの使用

キャプチャグループを使用すると、一致したテキストの特定の部分を抽出できます。正規表現の一部を括弧内に入れることで作成されます。

capturing_groups.py という名前のファイルを作成します。

import re

## Sample text with dates in various formats
text = "Important dates: 2023-11-15, 12/25/2023, and Jan 1, 2024."

## Extract dates in YYYY-MM-DD format
iso_dates = re.findall(r'(\d{4})-(\d{1,2})-(\d{1,2})', text)

## Extract dates in MM/DD/YYYY format
us_dates = re.findall(r'(\d{1,2})/(\d{1,2})/(\d{4})', text)

print("Original text:")
print(text)
print("\nISO dates (Year, Month, Day):", iso_dates)
print("US dates (Month, Day, Year):", us_dates)

## Extract month names with capturing groups
month_dates = re.findall(r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+(\d{1,2}),\s+(\d{4})', text)
print("Month name dates (Month, Day, Year):", month_dates)

スクリプトを実行します。

python3 ~/project/capturing_groups.py

出力は次のようになります。

Original text:
Important dates: 2023-11-15, 12/25/2023, and Jan 1, 2024.

ISO dates (Year, Month, Day): [('2023', '11', '15')]
US dates (Month, Day, Year): [('12', '25', '2023')]
Month name dates (Month, Day, Year): [('Jan', '1', '2024')]

この例では、

各括弧のセット () がキャプチャグループを作成します。
関数はタプルのリストを返します。各タプルには、キャプチャされたグループが含まれています。
これにより、テキストから構造化されたデータを抽出して整理できます。

実用的な例：ログファイルの解析

次に、学んだことを実用的な例に適用してみましょう。解析したいエントリを含むログファイルがあるとします。log_parser.py という名前のファイルを作成します。

import re

## Sample log entries
logs = """
[2023-11-15 08:30:45] INFO: System started
[2023-11-15 08:35:12] WARNING: High memory usage (85%)
[2023-11-15 08:42:11] ERROR: Connection timeout
[2023-11-15 09:15:27] INFO: Backup completed
"""

## Extract timestamp, level, and message from log entries
log_pattern = r'\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] (\w+): (.+)'
log_entries = re.findall(log_pattern, logs)

print("Original logs:")
print(logs)
print("\nParsed log entries (timestamp, level, message):")
for entry in log_entries:
    timestamp, level, message = entry
    print(f"Time: {timestamp} | Level: {level} | Message: {message}")

## Find all ERROR logs
error_logs = re.findall(r'\[\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\] ERROR: (.+)', logs)
print("\nError messages:", error_logs)

スクリプトを実行します。

python3 ~/project/log_parser.py

出力は次のようになります。

Original logs:

[2023-11-15 08:30:45] INFO: System started
[2023-11-15 08:35:12] WARNING: High memory usage (85%)
[2023-11-15 08:42:11] ERROR: Connection timeout
[2023-11-15 09:15:27] INFO: Backup completed


Parsed log entries (timestamp, level, message):
Time: 2023-11-15 08:30:45 | Level: INFO | Message: System started
Time: 2023-11-15 08:35:12 | Level: WARNING | Message: High memory usage (85%)
Time: 2023-11-15 08:42:11 | Level: ERROR | Message: Connection timeout
Time: 2023-11-15 09:15:27 | Level: INFO | Message: Backup completed

Error messages: ['Connection timeout']

この例では、以下を示しています。

キャプチャグループを使用して構造化された情報を抽出する
キャプチャされた情報を処理して表示する
特定のタイプのログエントリをフィルタリングする

フラグとキャプチャグループは、正規表現のパワーと柔軟性を高め、より正確で構造化されたデータ抽出を可能にします。

re.findall() の実際のアプリケーション

この最終ステップでは、re.findall() の実用的で現実的なアプリケーションを探求します。電子メール、URL を抽出したり、データクリーニングタスクを実行したりするコードを記述します。

電子メールアドレスの抽出

電子メールの抽出は、データマイニング、Web スクレイピング、テキスト分析における一般的なタスクです。email_extractor.py という名前のファイルを作成します。

import re

## Sample text with email addresses
text = """
Contact information:
- Support: support@example.com
- Sales: sales@example.com, international.sales@example.co.uk
- Technical team: tech.team@subdomain.example.org
Personal email: john.doe123@gmail.com
"""

## Extract all email addresses
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(email_pattern, text)

print("Original text:")
print(text)
print("\nExtracted email addresses:")
for i, email in enumerate(emails, 1):
    print(f"{i}. {email}")

## Extract specific domain emails
gmail_emails = re.findall(r'\b[A-Za-z0-9._%+-]+@gmail\.com\b', text)
print("\nGmail addresses:", gmail_emails)

スクリプトを実行します。

python3 ~/project/email_extractor.py

出力は次のようになります。

Original text:

Contact information:
- Support: support@example.com
- Sales: sales@example.com, international.sales@example.co.uk
- Technical team: tech.team@subdomain.example.org
Personal email: john.doe123@gmail.com


Extracted email addresses:
1. support@example.com
2. sales@example.com
3. international.sales@example.co.uk
4. tech.team@subdomain.example.org
5. john.doe123@gmail.com

Gmail addresses: ['john.doe123@gmail.com']

URL の抽出

URL の抽出は、Web スクレイピング、リンク検証、コンテンツ分析に役立ちます。url_extractor.py という名前のファイルを作成します。

import re

## Sample text with various URLs
text = """
Visit our website at https://www.example.com
Documentation: http://docs.example.org/guide
Repository: https://github.com/user/project
Forum: https://community.example.net/forum
Image: https://images.example.com/logo.png
"""

## Extract all URLs
url_pattern = r'https?://[^\s]+'
urls = re.findall(url_pattern, text)

print("Original text:")
print(text)
print("\nExtracted URLs:")
for i, url in enumerate(urls, 1):
    print(f"{i}. {url}")

## Extract specific domain URLs
github_urls = re.findall(r'https?://github\.com/[^\s]+', text)
print("\nGitHub URLs:", github_urls)

## Extract image URLs
image_urls = re.findall(r'https?://[^\s]+\.(jpg|jpeg|png|gif)', text)
print("\nImage URLs:", image_urls)

スクリプトを実行します。

python3 ~/project/url_extractor.py

出力は次のようになります。

Original text:

Visit our website at https://www.example.com
Documentation: http://docs.example.org/guide
Repository: https://github.com/user/project
Forum: https://community.example.net/forum
Image: https://images.example.com/logo.png


Extracted URLs:
1. https://www.example.com
2. http://docs.example.org/guide
3. https://github.com/user/project
4. https://community.example.net/forum
5. https://images.example.com/logo.png

GitHub URLs: ['https://github.com/user/project']

Image URLs: ['https://images.example.com/logo.png']

re.findall() を使用したデータクリーニング

汚いデータセットから情報をクリーンアップして抽出するスクリプトを作成しましょう。data_cleaning.py という名前のファイルを作成します。

import re

## Sample messy data
data = """
Product: Laptop X200, Price: $899.99, SKU: LP-X200-2023
Product: Smartphone S10+, Price: $699.50, SKU: SP-S10P-2023
Product: Tablet T7, Price: $299.99, SKU: TB-T7-2023
Product: Wireless Earbuds, Price: $129.95, SKU: WE-PRO-2023
"""

## Extract product information
product_pattern = r'Product: (.*?), Price: \$([\d.]+), SKU: ([A-Z0-9-]+)'
products = re.findall(product_pattern, data)

print("Original data:")
print(data)
print("\nExtracted and structured product information:")
print("Name\t\tPrice\t\tSKU")
print("-" * 50)
for product in products:
    name, price, sku = product
    print(f"{name}\t${price}\t{sku}")

## Calculate total price
total_price = sum(float(price) for _, price, _ in products)
print(f"\nTotal price of all products: ${total_price:.2f}")

## Extract only products above $500
expensive_products = [name for name, price, _ in products if float(price) > 500]
print("\nExpensive products (>$500):", expensive_products)

スクリプトを実行します。

python3 ~/project/data_cleaning.py

出力は次のようになります。

Original data:

Product: Laptop X200, Price: $899.99, SKU: LP-X200-2023
Product: Smartphone S10+, Price: $699.50, SKU: SP-S10P-2023
Product: Tablet T7, Price: $299.99, SKU: TB-T7-2023
Product: Wireless Earbuds, Price: $129.95, SKU: WE-PRO-2023


Extracted and structured product information:
Name		Price		SKU
--------------------------------------------------
Laptop X200	$899.99	LP-X200-2023
Smartphone S10+	$699.50	SP-S10P-2023
Tablet T7	$299.99	TB-T7-2023
Wireless Earbuds	$129.95	WE-PRO-2023

Total price of all products: $2029.43

Expensive products (>$500): ['Laptop X200', 'Smartphone S10+']

re.findall() と他の文字列関数の組み合わせ

最後に、高度なテキスト処理のために、re.findall() を他の文字列関数と組み合わせる方法を見てみましょう。combined_processing.py という名前のファイルを作成します。

import re

## Sample text with mixed content
text = """
Temperature readings:
- New York: 72°F (22.2°C)
- London: 59°F (15.0°C)
- Tokyo: 80°F (26.7°C)
- Sydney: 68°F (20.0°C)
"""

## Extract all temperature readings in Fahrenheit
fahrenheit_pattern = r'(\d+)°F'
fahrenheit_temps = re.findall(fahrenheit_pattern, text)

## Convert to integers
fahrenheit_temps = [int(temp) for temp in fahrenheit_temps]

print("Original text:")
print(text)
print("\nFahrenheit temperatures:", fahrenheit_temps)

## Calculate average temperature
avg_temp = sum(fahrenheit_temps) / len(fahrenheit_temps)
print(f"Average temperature: {avg_temp:.1f}°F")

## Extract city and temperature pairs
city_temp_pattern = r'- ([A-Za-z\s]+): (\d+)°F'
city_temps = re.findall(city_temp_pattern, text)

print("\nCity and temperature pairs:")
for city, temp in city_temps:
    print(f"{city}: {temp}°F")

## Find the hottest and coldest cities
hottest_city = max(city_temps, key=lambda x: int(x[1]))
coldest_city = min(city_temps, key=lambda x: int(x[1]))

print(f"\nHottest city: {hottest_city[0]} ({hottest_city[1]}°F)")
print(f"Coldest city: {coldest_city[0]} ({coldest_city[1]}°F)")

スクリプトを実行します。

python3 ~/project/combined_processing.py

出力は次のようになります。

Original text:

Temperature readings:
- New York: 72°F (22.2°C)
- London: 59°F (15.0°C)
- Tokyo: 80°F (26.7°C)
- Sydney: 68°F (20.0°C)


Fahrenheit temperatures: [72, 59, 80, 68]
Average temperature: 69.8°F

City and temperature pairs:
New York: 72°F
London: 59°F
Tokyo: 80°F
Sydney: 68°F

Hottest city: Tokyo (80°F)
Coldest city: London (59°F)

これらの例は、re.findall() を他の Python 機能と組み合わせて、現実世界のテキスト処理の問題を解決する方法を示しています。構造化されていないテキストから構造化されたデータを抽出する能力は、データ分析、Web スクレイピング、およびその他の多くのプログラミングタスクに不可欠なスキルです。

まとめ

このチュートリアルでは、Python でテキストパターンマッチングと抽出を行うための強力な re.findall() 関数の使用方法を学びました。いくつかの重要な分野で実践的な知識を習得しました。

基本的なパターンマッチング - 単純な部分文字列を見つけ、基本的な正規表現パターンを使用して特定のテキストパターンに一致させる方法を学びました。
複雑なパターン - 文字クラス、単語境界、および数量子を含む、より複雑なパターンを探求し、柔軟な検索パターンを作成しました。
フラグとキャプチャグループ - re.IGNORECASE などのフラグを使用して検索動作を変更する方法と、キャプチャグループを使用して構造化データを抽出する方法を発見しました。
実際のアプリケーション - 電子メールアドレスや URL の抽出、ログファイルの解析、データのクリーニングなど、実践的なシナリオに知識を適用しました。

この実験で開発したスキルは、以下を含む幅広いテキスト処理タスクに役立ちます。

データ抽出とクリーニング
コンテンツ分析
Web スクレイピング
ログファイルの解析
データ検証

正規表現と re.findall() 関数を使用することで、Python プロジェクトでテキストデータを処理するための強力なツールを手に入れました。これらのテクニックを継続的に練習し、適用することで、特定のテキスト処理ニーズに合わせて効率的なパターンを作成することに習熟するでしょう。

Python の re.findall() を使って、一致するすべての部分文字列を見つける方法

はじめに

re.findall() の始め方

正規表現の理解

Python の re モジュール

コードの分解

さまざまなパターンの検索

より複雑なパターンを扱う

テキスト内の数字の検索

特定の長さの単語の検索

文字クラスの使用

フラグとキャプチャグループの使用

正規表現におけるフラグの理解

キャプチャグループの使用

実用的な例：ログファイルの解析

re.findall() の実際のアプリケーション

電子メールアドレスの抽出

URL の抽出

re.findall() を使用したデータクリーニング

re.findall() と他の文字列関数の組み合わせ

まとめ