Python requests でのレスポンスコンテンツの解析方法 - データ解析と自動化

はじめに

Python の requests ライブラリは、Web サービスや API と対話するための強力なツールです。このチュートリアルでは、Python を使用して HTTP リクエストを送信し、レスポンスデータを解析する方法を学びます。この実験（Lab）の終わりには、さまざまな種類の API レスポンスから貴重な情報を抽出できるようになり、データ駆動型のアプリケーションを構築し、Web インタラクションを自動化できるようになります。

Requests ライブラリのインストールと基本的なリクエストの作成

この最初のステップでは、Python の requests ライブラリをインストールし、最初の HTTP リクエストを実行して、公開 API からデータを取得します。

Requests のインストール

requests ライブラリは、pip (Python のパッケージインストーラー) を使用してインストールする必要があるサードパーティパッケージです。まず、インストールから始めましょう。

pip install requests

requests が正常にインストールされたことを確認する出力が表示されるはずです。

最初の HTTP リクエストの作成

次に、シンプルな HTTP リクエストを作成するための Python ファイルを作成しましょう。WebIDE で、/home/labex/project ディレクトリに basic_request.py という新しいファイルを作成します。

ファイルに次のコードを追加します。

import requests

## Make a GET request to a public API
response = requests.get("https://jsonplaceholder.typicode.com/todos/1")

## Print the status code
print(f"Status code: {response.status_code}")

## Print the raw response content
print("\nRaw response content:")
print(response.text)

## Print the response headers
print("\nResponse headers:")
for header, value in response.headers.items():
    print(f"{header}: {value}")

このコードは、サンプル API エンドポイントに対して GET リクエストを行い、レスポンスに関する情報を出力します。

レスポンスオブジェクトの理解

コードを実行して、どのような情報が返されるか見てみましょう。ターミナルで、以下を実行します。

python basic_request.py

次のような出力が表示されるはずです。

Status code: 200

Raw response content:
{
  "userId": 1,
  "id": 1,
  "title": "delectus aut autem",
  "completed": false
}

Response headers:
Date: Mon, 01 Jan 2023 12:00:00 GMT
Content-Type: application/json; charset=utf-8
...

レスポンスオブジェクトには、いくつかの重要な属性が含まれています。

status_code: HTTP ステータスコード (200 は成功を意味します)
text: 文字列としてのレスポンスコンテンツ
headers: レスポンスヘッダーの辞書

Web リクエストを扱う際、これらの属性はサーバーのレスポンスを理解し、適切に処理するのに役立ちます。

HTTP ステータスコード

HTTP ステータスコードは、リクエストが成功したか失敗したかを示します。

2xx (200 など): 成功
3xx (301 など): リダイレクト
4xx (404 など): クライアントエラー
5xx (500 など): サーバーエラー

成功したレスポンスを確認するために、コードを変更しましょう。check_status.py という新しいファイルを作成し、次の内容を記述します。

import requests

try:
    ## Make a GET request to a valid URL
    response = requests.get("https://jsonplaceholder.typicode.com/todos/1")

    ## Check if the request was successful
    if response.status_code == 200:
        print("Request successful!")
    else:
        print(f"Request failed with status code: {response.status_code}")

    ## Try an invalid URL
    invalid_response = requests.get("https://jsonplaceholder.typicode.com/invalid")
    print(f"Invalid URL status code: {invalid_response.status_code}")

except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")

このコードを実行して、異なる URL が異なるステータスコードを返す様子を確認します。

python check_status.py

有効な URL がステータスコード 200 を返し、無効な URL がステータスコード 404 を返すことがわかるはずです。

JSON レスポンスデータの解析

多くの最新の API は、JSON (JavaScript Object Notation) 形式でデータを返します。このステップでは、JSON レスポンスを解析し、Python でデータを使用する方法を学びます。

JSON の理解

JSON は、人間が読み書きしやすく、機械が解析して生成しやすい軽量なデータ交換形式です。Python の辞書に似た、キーと値のペアに基づいています。

JSON オブジェクトの例を次に示します。

{
  "name": "John Doe",
  "age": 30,
  "email": "john@example.com",
  "is_active": true,
  "hobbies": ["reading", "swimming", "cycling"]
}

JSON レスポンスの解析

requests ライブラリを使用すると、.json() メソッドを使用して JSON レスポンスを簡単に解析できます。parse_json.py という新しいファイルを作成し、次のコードを追加しましょう。

import requests

## Make a request to a GitHub API endpoint that returns JSON data
response = requests.get("https://api.github.com/users/python")

## Check if the request was successful
if response.status_code == 200:
    ## Parse the JSON response
    data = response.json()

    ## Print the parsed data
    print("Parsed JSON data:")
    print(f"Username: {data['login']}")
    print(f"Name: {data.get('name', 'Not provided')}")
    print(f"Followers: {data['followers']}")
    print(f"Public repositories: {data['public_repos']}")

    ## Print the type to verify it's a Python dictionary
    print(f"\nType of parsed data: {type(data)}")

    ## Access nested data
    print("\nAccessing specific elements:")
    print(f"Avatar URL: {data['avatar_url']}")
else:
    print(f"Request failed with status code: {response.status_code}")

このスクリプトを実行して、JSON データがどのように Python 辞書に解析されるかを確認します。

python parse_json.py

GitHub ユーザーに関する情報 (ユーザー名、フォロワー数、リポジトリ数など) が表示されるはずです。

データのリストの操作

多くの API はオブジェクトのリストを返します。この種のレスポンスを処理する方法を見てみましょう。json_list.py というファイルを作成し、次の内容を記述します。

import requests

## Make a request to an API that returns a list of posts
response = requests.get("https://jsonplaceholder.typicode.com/posts")

## Check if the request was successful
if response.status_code == 200:
    ## Parse the JSON response (this will be a list of posts)
    posts = response.json()

    ## Print the total number of posts
    print(f"Total posts: {len(posts)}")

    ## Print details of the first 3 posts
    print("\nFirst 3 posts:")
    for i, post in enumerate(posts[:3], 1):
        print(f"\nPost #{i}")
        print(f"User ID: {post['userId']}")
        print(f"Post ID: {post['id']}")
        print(f"Title: {post['title']}")
        print(f"Body: {post['body'][:50]}...")  ## Print just the beginning of the body
else:
    print(f"Request failed with status code: {response.status_code}")

このスクリプトを実行して、JSON オブジェクトのリストを処理する方法を確認します。

python json_list.py

最初の 3 つの投稿に関する情報 (タイトルや本文の冒頭など) が表示されるはずです。

JSON 解析のエラー処理

場合によっては、レスポンスに有効な JSON データが含まれていないことがあります。これを適切に処理する方法を見てみましょう。json_error.py というファイルを作成し、このコードを記述します。

import requests
import json

def get_and_parse_json(url):
    try:
        ## Make the request
        response = requests.get(url)

        ## Check if the request was successful
        response.raise_for_status()

        ## Try to parse the JSON
        try:
            data = response.json()
            return data
        except json.JSONDecodeError:
            print(f"Response from {url} is not valid JSON")
            print(f"Raw response: {response.text[:100]}...")  ## Print part of the raw response
            return None

    except requests.exceptions.HTTPError as e:
        print(f"HTTP error: {e}")
    except requests.exceptions.RequestException as e:
        print(f"Request error: {e}")

    return None

## Test with a valid JSON endpoint
json_data = get_and_parse_json("https://jsonplaceholder.typicode.com/posts/1")
if json_data:
    print("\nValid JSON response:")
    print(f"Title: {json_data['title']}")

## Test with a non-JSON endpoint
html_data = get_and_parse_json("https://www.example.com")
if html_data:
    print("\nThis should not print as example.com returns HTML, not JSON")
else:
    print("\nAs expected, could not parse HTML as JSON")

このスクリプトを実行して、さまざまな種類のレスポンスを処理する方法を確認します。

python json_error.py

コードが有効な JSON レスポンスと非 JSON レスポンスの両方を正常に処理することがわかるはずです。

BeautifulSoup を使用した HTML コンテンツの解析

Web データを使用する場合、HTML レスポンスに遭遇することがよくあります。HTML の解析には、Python の BeautifulSoup ライブラリが優れたツールです。このステップでは、HTML レスポンスから情報を抽出する方法を学びます。

BeautifulSoup のインストール

まず、BeautifulSoup とその HTML パーサーをインストールしましょう。

pip install beautifulsoup4

基本的な HTML 解析

parse_html.py というファイルを作成して、Web ページを取得して解析しましょう。

import requests
from bs4 import BeautifulSoup

## Make a request to a webpage
url = "https://www.example.com"
response = requests.get(url)

## Check if the request was successful
if response.status_code == 200:
    ## Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    ## Extract the page title
    title = soup.title.text
    print(f"Page title: {title}")

    ## Extract all paragraphs
    paragraphs = soup.find_all('p')
    print(f"\nNumber of paragraphs: {len(paragraphs)}")

    ## Print the text of the first paragraph
    if paragraphs:
        print(f"\nFirst paragraph text: {paragraphs[0].text.strip()}")

    ## Extract all links
    links = soup.find_all('a')
    print(f"\nNumber of links: {len(links)}")

    ## Print the href attribute of the first link
    if links:
        print(f"First link href: {links[0].get('href')}")

else:
    print(f"Request failed with status code: {response.status_code}")

このスクリプトを実行して、HTML ページから基本的な情報を抽出する方法を確認します。

python parse_html.py

ページタイトル、段落数、最初の段落のテキスト、リンク数、最初のリンクの URL を示す出力が表示されるはずです。

特定の要素の検索

次に、CSS セレクターを使用して特定の要素を検索する方法を見てみましょう。html_selectors.py というファイルを作成します。

import requests
from bs4 import BeautifulSoup

## Make a request to a webpage with more complex structure
url = "https://quotes.toscrape.com/"
response = requests.get(url)

## Check if the request was successful
if response.status_code == 200:
    ## Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    ## Find all quote elements
    quote_elements = soup.select('.quote')
    print(f"Number of quotes found: {len(quote_elements)}")

    ## Process the first 3 quotes
    print("\nFirst 3 quotes:")
    for i, quote_element in enumerate(quote_elements[:3], 1):
        ## Extract the quote text
        text = quote_element.select_one('.text').text

        ## Extract the author
        author = quote_element.select_one('.author').text

        ## Extract the tags
        tags = [tag.text for tag in quote_element.select('.tag')]

        ## Print the information
        print(f"\nQuote #{i}")
        print(f"Text: {text}")
        print(f"Author: {author}")
        print(f"Tags: {', '.join(tags)}")

else:
    print(f"Request failed with status code: {response.status_code}")

このスクリプトを実行して、CSS セレクターを使用して特定の要素を抽出する方法を確認します。

python html_selectors.py

最初の 3 つの引用に関する情報 (引用テキスト、著者、タグなど) を示す出力が表示されるはずです。

シンプルな Web スクレーパーの構築

すべてを組み合わせて、Web ページから構造化データを抽出するシンプルな Web スクレーパーを構築しましょう。quotes_scraper.py というファイルを作成します。

import requests
from bs4 import BeautifulSoup
import json
import os

def scrape_quotes_page(url):
    ## Make a request to the webpage
    response = requests.get(url)

    ## Check if the request was successful
    if response.status_code != 200:
        print(f"Request failed with status code: {response.status_code}")
        return None

    ## Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    ## Extract all quotes
    quotes = []
    for quote_element in soup.select('.quote'):
        ## Extract the quote text
        text = quote_element.select_one('.text').text.strip('"')

        ## Extract the author
        author = quote_element.select_one('.author').text

        ## Extract the tags
        tags = [tag.text for tag in quote_element.select('.tag')]

        ## Add the quote to our list
        quotes.append({
            'text': text,
            'author': author,
            'tags': tags
        })

    ## Check if there's a next page
    next_page = soup.select_one('.next a')
    next_page_url = None
    if next_page:
        next_page_url = 'https://quotes.toscrape.com' + next_page['href']

    return {
        'quotes': quotes,
        'next_page': next_page_url
    }

## Scrape the first page
result = scrape_quotes_page('https://quotes.toscrape.com/')

if result:
    ## Print information about the quotes found
    quotes = result['quotes']
    print(f"Found {len(quotes)} quotes on the first page")

    ## Print the first 2 quotes
    print("\nFirst 2 quotes:")
    for i, quote in enumerate(quotes[:2], 1):
        print(f"\nQuote #{i}")
        print(f"Text: {quote['text']}")
        print(f"Author: {quote['author']}")
        print(f"Tags: {', '.join(quote['tags'])}")

    ## Save the quotes to a JSON file
    output_dir = '/home/labex/project'
    with open(os.path.join(output_dir, 'quotes.json'), 'w') as f:
        json.dump(quotes, f, indent=2)

    print(f"\nSaved {len(quotes)} quotes to {output_dir}/quotes.json")

    ## Print information about the next page
    if result['next_page']:
        print(f"\nNext page URL: {result['next_page']}")
    else:
        print("\nNo next page available")

このスクリプトを実行して、Web サイトから引用をスクレイピングします。

python quotes_scraper.py

最初のページで見つかった引用に関する情報を示す出力が表示され、引用は quotes.json という JSON ファイルに保存されます。

JSON ファイルをチェックして、構造化データを確認します。

cat quotes.json

ファイルには、テキスト、著者、およびタグのプロパティを持つ引用オブジェクトの JSON 配列が含まれているはずです。

バイナリレスポンスコンテンツの操作

これまでは、JSON や HTML などのテキストベースのレスポンスに焦点を当ててきました。しかし、requests ライブラリは、画像、PDF、その他のファイルなどのバイナリコンテンツも処理できます。このステップでは、バイナリコンテンツをダウンロードして処理する方法を学びます。

画像のダウンロード

まず、画像をダウンロードすることから始めましょう。download_image.py というファイルを作成します。

import requests
import os

## URL of an image to download
image_url = "https://httpbin.org/image/jpeg"

## Make a request to get the image
response = requests.get(image_url)

## Check if the request was successful
if response.status_code == 200:
    ## Get the content type
    content_type = response.headers.get('Content-Type', '')
    print(f"Content-Type: {content_type}")

    ## Check if the content is an image
    if 'image' in content_type:
        ## Create a directory to save the image if it doesn't exist
        output_dir = '/home/labex/project/downloads'
        os.makedirs(output_dir, exist_ok=True)

        ## Save the image to a file
        image_path = os.path.join(output_dir, 'sample_image.jpg')
        with open(image_path, 'wb') as f:
            f.write(response.content)

        ## Print information about the saved image
        print(f"Image saved to: {image_path}")
        print(f"Image size: {len(response.content)} bytes")
    else:
        print("The response does not contain an image")
else:
    print(f"Request failed with status code: {response.status_code}")

このスクリプトを実行して、画像をダウンロードします。

python download_image.py

画像がダウンロードされ、/home/labex/project/downloads/sample_image.jpg に保存されたことを確認する出力が表示されるはずです。

プログレスバー付きのファイルのダウンロード

大きなファイルをダウンロードする場合、プログレスインジケーターを表示すると便利です。ダウンロードの進行状況を示すスクリプトを作成しましょう。download_with_progress.py というファイルを作成します。

import requests
import os
import sys

def download_file(url, filename):
    ## Make a request to get the file
    ## Stream the response to handle large files efficiently
    response = requests.get(url, stream=True)

    ## Check if the request was successful
    if response.status_code != 200:
        print(f"Request failed with status code: {response.status_code}")
        return False

    ## Get the total file size if available
    total_size = int(response.headers.get('Content-Length', 0))
    if total_size:
        print(f"Total file size: {total_size/1024:.2f} KB")
    else:
        print("Content-Length header not found. Unable to determine file size.")

    ## Create a directory to save the file if it doesn't exist
    os.makedirs(os.path.dirname(filename), exist_ok=True)

    ## Download the file in chunks and show progress
    print(f"Downloading {url} to {filename}...")

    ## Initialize variables for progress tracking
    downloaded = 0
    chunk_size = 8192  ## 8 KB chunks

    ## Open the file for writing
    with open(filename, 'wb') as f:
        ## Iterate through the response chunks
        for chunk in response.iter_content(chunk_size=chunk_size):
            if chunk:  ## Filter out keep-alive chunks
                f.write(chunk)
                downloaded += len(chunk)

                ## Calculate and display progress
                if total_size:
                    percent = downloaded * 100 / total_size
                    sys.stdout.write(f"\rProgress: {percent:.1f}% ({downloaded/1024:.1f} KB)")
                    sys.stdout.flush()
                else:
                    sys.stdout.write(f"\rDownloaded: {downloaded/1024:.1f} KB")
                    sys.stdout.flush()

    ## Print a newline to ensure the next output starts on a new line
    print()

    return True

## URL of a file to download
file_url = "https://speed.hetzner.de/100MB.bin"

## Path where the file will be saved
output_path = '/home/labex/project/downloads/test_file.bin'

## Download the file
success = download_file(file_url, output_path)

if success:
    ## Get file stats
    file_size = os.path.getsize(output_path)
    print(f"\nDownload complete!")
    print(f"File saved to: {output_path}")
    print(f"File size: {file_size/1024/1024:.2f} MB")
else:
    print("\nDownload failed.")

このスクリプトを実行して、プログレスバーでファイルのダウンロードを行います。

python download_with_progress.py

ファイルのダウンロード中にプログレスバーが更新されるのがわかります。これは 100MB のファイルをダウンロードするため、接続速度によっては時間がかかる場合があります。

ダウンロードをキャンセルするには、Ctrl+C を押します。

レスポンスヘッダーとメタデータの操作

ファイルをダウンロードする場合、レスポンスヘッダーには多くの場合、有用なメタデータが含まれています。レスポンスヘッダーを詳細に調べるスクリプトを作成しましょう。response_headers.py というファイルを作成します。

import requests

def check_url(url):
    print(f"\nChecking URL: {url}")

    try:
        ## Make a HEAD request first to get headers without downloading the full content
        head_response = requests.head(url)

        print(f"HEAD request status code: {head_response.status_code}")

        if head_response.status_code == 200:
            ## Print all headers
            print("\nResponse headers:")
            for header, value in head_response.headers.items():
                print(f"  {header}: {value}")

            ## Extract content type and size
            content_type = head_response.headers.get('Content-Type', 'Unknown')
            content_length = head_response.headers.get('Content-Length', 'Unknown')

            print(f"\nContent Type: {content_type}")

            if content_length != 'Unknown':
                size_kb = int(content_length) / 1024
                size_mb = size_kb / 1024

                if size_mb >= 1:
                    print(f"Content Size: {size_mb:.2f} MB")
                else:
                    print(f"Content Size: {size_kb:.2f} KB")
            else:
                print("Content Size: Unknown")

            ## Check if the server supports range requests
            accept_ranges = head_response.headers.get('Accept-Ranges', 'none')
            print(f"Supports range requests: {'Yes' if accept_ranges != 'none' else 'No'}")

        else:
            print(f"HEAD request failed with status code: {head_response.status_code}")

    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")

## Check a few different URLs
check_url("https://httpbin.org/image/jpeg")
check_url("https://speed.hetzner.de/100MB.bin")
check_url("https://example.com")

このスクリプトを実行して、レスポンスヘッダーに関する詳細情報を確認します。

python response_headers.py

画像、バイナリファイル、HTML ページなど、さまざまな種類のコンテンツのヘッダーを示す出力が表示されます。

レスポンスヘッダーを理解することは、次のような多くの Web 開発タスクにとって重要です。

ダウンロード前にファイルの種類とサイズを決定する
範囲リクエストを使用した再開可能なダウンロードの実装
キャッシュポリシーと有効期限の確認
リダイレクトと認証の処理

まとめ

この実験では、Python の requests ライブラリを使用して、Web サービスや API と対話する方法を学びました。これで、以下のスキルを習得しました。

HTTP リクエストを行い、レスポンスステータスコードとエラーを処理する
API レスポンスから JSON データを解析する
BeautifulSoup を使用して HTML コンテンツから情報を抽出する
画像やファイルなどのバイナリコンテンツをダウンロードして処理する
レスポンスヘッダーとメタデータを操作する

これらのスキルは、Web スクレイピング、API 統合、データ収集、自動化など、多くの Python アプリケーションの基盤となります。Web サービスと対話し、Web サイトから有用な情報を抽出し、さまざまな種類の Web コンテンツを処理するアプリケーションを構築できるようになりました。

学習を続けるには、以下を検討するとよいでしょう。

保護された API にアクセスするための認証方法
特定のヘッダーまたはリクエスト形式を必要とする、より複雑な API の操作
データの収集と分析を行う完全な Web スクレイピングプロジェクトの構築
複数の API と統合する Python アプリケーションの作成

Web サイトをスクレイピングしたり、API を使用したりする場合は、利用規約を確認し、ブロックされないようにレート制限を遵守することが重要です。