Python requests 호출에서 응답 콘텐츠 파싱 방법 - 웹 데이터 추출 및 자동화

소개

Python requests 라이브러리는 웹 서비스 및 API 와 상호 작용하기 위한 강력한 도구입니다. 이 튜토리얼에서는 Python 을 사용하여 HTTP 요청을 보내고 응답 데이터를 파싱하는 방법을 배우게 됩니다. 이 랩을 마치면 다양한 유형의 API 응답에서 가치 있는 정보를 추출하여 데이터 기반 애플리케이션을 구축하고 웹 상호 작용을 자동화할 수 있게 됩니다.

Requests 라이브러리 설치 및 기본 요청

이 첫 번째 단계에서는 Python requests 라이브러리를 설치하고 공개 API 에서 데이터를 검색하기 위해 첫 번째 HTTP 요청을 수행합니다.

Requests 설치

requests 라이브러리는 Python 의 패키지 설치 프로그램인 pip 를 사용하여 설치해야 하는 타사 패키지입니다. 설치부터 시작해 보겠습니다.

pip install requests

requests 가 성공적으로 설치되었음을 확인하는 출력을 볼 수 있습니다.

첫 번째 HTTP 요청 수행

이제 간단한 HTTP 요청을 수행하기 위해 Python 파일을 만들어 보겠습니다. WebIDE 에서 /home/labex/project 디렉토리에 basic_request.py라는 새 파일을 만듭니다.

다음 코드를 파일에 추가합니다.

import requests

## Make a GET request to a public API
response = requests.get("https://jsonplaceholder.typicode.com/todos/1")

## Print the status code
print(f"Status code: {response.status_code}")

## Print the raw response content
print("\nRaw response content:")
print(response.text)

## Print the response headers
print("\nResponse headers:")
for header, value in response.headers.items():
    print(f"{header}: {value}")

이 코드는 샘플 API 엔드포인트에 GET 요청을 보내고 응답에 대한 정보를 출력합니다.

응답 객체 이해

코드를 실행하여 어떤 정보를 다시 얻는지 확인해 보겠습니다. 터미널에서 다음을 실행합니다.

python basic_request.py

다음과 유사한 출력을 볼 수 있습니다.

Status code: 200

Raw response content:
{
  "userId": 1,
  "id": 1,
  "title": "delectus aut autem",
  "completed": false
}

Response headers:
Date: Mon, 01 Jan 2023 12:00:00 GMT
Content-Type: application/json; charset=utf-8
...

응답 객체에는 몇 가지 중요한 속성이 포함되어 있습니다.

status_code: HTTP 상태 코드 (200 은 성공을 의미함)
text: 문자열로 된 응답 내용
headers: 응답 헤더의 딕셔너리

웹 요청을 사용할 때 이러한 속성은 서버의 응답을 이해하고 적절하게 처리하는 데 도움이 됩니다.

HTTP 상태 코드

HTTP 상태 코드는 요청이 성공했는지 실패했는지를 나타냅니다.

2xx (200 과 같음): 성공
3xx (301 과 같음): 리디렉션 (Redirection)
4xx (404 와 같음): 클라이언트 오류
5xx (500 과 같음): 서버 오류

성공적인 응답을 확인하도록 코드를 수정해 보겠습니다. 이 내용으로 check_status.py라는 새 파일을 만듭니다.

import requests

try:
    ## Make a GET request to a valid URL
    response = requests.get("https://jsonplaceholder.typicode.com/todos/1")

    ## Check if the request was successful
    if response.status_code == 200:
        print("Request successful!")
    else:
        print(f"Request failed with status code: {response.status_code}")

    ## Try an invalid URL
    invalid_response = requests.get("https://jsonplaceholder.typicode.com/invalid")
    print(f"Invalid URL status code: {invalid_response.status_code}")

except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")

이 코드를 실행하여 다른 URL 이 어떻게 다른 상태 코드를 반환하는지 확인합니다.

python check_status.py

유효한 URL 은 상태 코드 200 을 반환하고, 유효하지 않은 URL 은 404 상태 코드를 반환하는 것을 볼 수 있습니다.

JSON 응답 데이터 파싱

많은 최신 API 는 JSON (JavaScript Object Notation) 형식으로 데이터를 반환합니다. 이 단계에서는 JSON 응답을 파싱하고 Python 에서 데이터를 사용하는 방법을 배우게 됩니다.

JSON 이해

JSON 은 사람이 읽고 쓰기 쉽고, 기계가 파싱하고 생성하기 쉬운 경량 데이터 교환 형식입니다. Python 딕셔너리와 유사한 키 - 값 쌍을 기반으로 합니다.

다음은 JSON 객체의 예입니다.

{
  "name": "John Doe",
  "age": 30,
  "email": "john@example.com",
  "is_active": true,
  "hobbies": ["reading", "swimming", "cycling"]
}

JSON 응답 파싱

requests 라이브러리는 .json() 메서드를 사용하여 JSON 응답을 쉽게 파싱할 수 있도록 합니다. parse_json.py라는 새 파일을 만들고 다음 코드를 추가해 보겠습니다.

import requests

## Make a request to a GitHub API endpoint that returns JSON data
response = requests.get("https://api.github.com/users/python")

## Check if the request was successful
if response.status_code == 200:
    ## Parse the JSON response
    data = response.json()

    ## Print the parsed data
    print("Parsed JSON data:")
    print(f"Username: {data['login']}")
    print(f"Name: {data.get('name', 'Not provided')}")
    print(f"Followers: {data['followers']}")
    print(f"Public repositories: {data['public_repos']}")

    ## Print the type to verify it's a Python dictionary
    print(f"\nType of parsed data: {type(data)}")

    ## Access nested data
    print("\nAccessing specific elements:")
    print(f"Avatar URL: {data['avatar_url']}")
else:
    print(f"Request failed with status code: {response.status_code}")

이 스크립트를 실행하여 JSON 데이터가 Python 딕셔너리로 파싱되는 방식을 확인합니다.

python parse_json.py

GitHub 사용자에 대한 정보 (사용자 이름, 팔로워 수, 리포지토리 수 포함) 를 표시하는 출력을 볼 수 있습니다.

데이터 목록 사용

많은 API 는 객체 목록을 반환합니다. 이러한 종류의 응답을 처리하는 방법을 살펴보겠습니다. 이 내용으로 json_list.py 파일을 만듭니다.

import requests

## Make a request to an API that returns a list of posts
response = requests.get("https://jsonplaceholder.typicode.com/posts")

## Check if the request was successful
if response.status_code == 200:
    ## Parse the JSON response (this will be a list of posts)
    posts = response.json()

    ## Print the total number of posts
    print(f"Total posts: {len(posts)}")

    ## Print details of the first 3 posts
    print("\nFirst 3 posts:")
    for i, post in enumerate(posts[:3], 1):
        print(f"\nPost #{i}")
        print(f"User ID: {post['userId']}")
        print(f"Post ID: {post['id']}")
        print(f"Title: {post['title']}")
        print(f"Body: {post['body'][:50]}...")  ## Print just the beginning of the body
else:
    print(f"Request failed with status code: {response.status_code}")

이 스크립트를 실행하여 JSON 객체 목록을 처리하는 방법을 확인합니다.

python json_list.py

제목과 내용의 시작 부분을 포함하여 처음 세 개의 게시물에 대한 정보를 볼 수 있습니다.

JSON 파싱을 사용한 오류 처리

경우에 따라 응답에 유효한 JSON 데이터가 포함되지 않을 수 있습니다. 이를 적절하게 처리하는 방법을 살펴보겠습니다. 이 코드로 json_error.py 파일을 만듭니다.

import requests
import json

def get_and_parse_json(url):
    try:
        ## Make the request
        response = requests.get(url)

        ## Check if the request was successful
        response.raise_for_status()

        ## Try to parse the JSON
        try:
            data = response.json()
            return data
        except json.JSONDecodeError:
            print(f"Response from {url} is not valid JSON")
            print(f"Raw response: {response.text[:100]}...")  ## Print part of the raw response
            return None

    except requests.exceptions.HTTPError as e:
        print(f"HTTP error: {e}")
    except requests.exceptions.RequestException as e:
        print(f"Request error: {e}")

    return None

## Test with a valid JSON endpoint
json_data = get_and_parse_json("https://jsonplaceholder.typicode.com/posts/1")
if json_data:
    print("\nValid JSON response:")
    print(f"Title: {json_data['title']}")

## Test with a non-JSON endpoint
html_data = get_and_parse_json("https://www.example.com")
if html_data:
    print("\nThis should not print as example.com returns HTML, not JSON")
else:
    print("\nAs expected, could not parse HTML as JSON")

이 스크립트를 실행하여 다양한 유형의 응답을 처리하는 방법을 확인합니다.

python json_error.py

코드가 유효한 JSON 응답과 비 JSON 응답을 모두 성공적으로 처리하는 것을 볼 수 있습니다.

BeautifulSoup 를 사용한 HTML 콘텐츠 파싱

웹 데이터를 사용할 때 HTML 응답을 자주 접하게 됩니다. HTML 파싱을 위해 Python 의 BeautifulSoup 라이브러리는 훌륭한 도구입니다. 이 단계에서는 HTML 응답에서 정보를 추출하는 방법을 배우겠습니다.

BeautifulSoup 설치

먼저 BeautifulSoup 과 HTML 파서를 설치해 보겠습니다.

pip install beautifulsoup4

기본 HTML 파싱

parse_html.py라는 파일을 만들어 웹 페이지를 가져와 파싱해 보겠습니다.

import requests
from bs4 import BeautifulSoup

## Make a request to a webpage
url = "https://www.example.com"
response = requests.get(url)

## Check if the request was successful
if response.status_code == 200:
    ## Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    ## Extract the page title
    title = soup.title.text
    print(f"Page title: {title}")

    ## Extract all paragraphs
    paragraphs = soup.find_all('p')
    print(f"\nNumber of paragraphs: {len(paragraphs)}")

    ## Print the text of the first paragraph
    if paragraphs:
        print(f"\nFirst paragraph text: {paragraphs[0].text.strip()}")

    ## Extract all links
    links = soup.find_all('a')
    print(f"\nNumber of links: {len(links)}")

    ## Print the href attribute of the first link
    if links:
        print(f"First link href: {links[0].get('href')}")

else:
    print(f"Request failed with status code: {response.status_code}")

이 스크립트를 실행하여 HTML 페이지에서 기본 정보를 추출하는 방법을 확인합니다.

python parse_html.py

페이지 제목, 단락 수, 첫 번째 단락의 텍스트, 링크 수 및 첫 번째 링크의 URL 을 보여주는 출력을 볼 수 있습니다.

특정 요소 찾기

이제 CSS 선택자를 사용하여 특정 요소를 찾는 방법을 살펴보겠습니다. html_selectors.py 파일을 만듭니다.

import requests
from bs4 import BeautifulSoup

## Make a request to a webpage with more complex structure
url = "https://quotes.toscrape.com/"
response = requests.get(url)

## Check if the request was successful
if response.status_code == 200:
    ## Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    ## Find all quote elements
    quote_elements = soup.select('.quote')
    print(f"Number of quotes found: {len(quote_elements)}")

    ## Process the first 3 quotes
    print("\nFirst 3 quotes:")
    for i, quote_element in enumerate(quote_elements[:3], 1):
        ## Extract the quote text
        text = quote_element.select_one('.text').text

        ## Extract the author
        author = quote_element.select_one('.author').text

        ## Extract the tags
        tags = [tag.text for tag in quote_element.select('.tag')]

        ## Print the information
        print(f"\nQuote #{i}")
        print(f"Text: {text}")
        print(f"Author: {author}")
        print(f"Tags: {', '.join(tags)}")

else:
    print(f"Request failed with status code: {response.status_code}")

이 스크립트를 실행하여 CSS 선택자를 사용하여 특정 요소를 추출하는 방법을 확인합니다.

python html_selectors.py

인용구 텍스트, 작성자 및 태그를 포함하여 처음 세 개의 인용구에 대한 정보를 보여주는 출력을 볼 수 있습니다.

간단한 웹 스크레이퍼 구축

웹 페이지에서 구조화된 데이터를 추출하는 간단한 웹 스크레이퍼를 구축하기 위해 모든 것을 함께 살펴보겠습니다. quotes_scraper.py라는 파일을 만듭니다.

import requests
from bs4 import BeautifulSoup
import json
import os

def scrape_quotes_page(url):
    ## Make a request to the webpage
    response = requests.get(url)

    ## Check if the request was successful
    if response.status_code != 200:
        print(f"Request failed with status code: {response.status_code}")
        return None

    ## Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    ## Extract all quotes
    quotes = []
    for quote_element in soup.select('.quote'):
        ## Extract the quote text
        text = quote_element.select_one('.text').text.strip('"')

        ## Extract the author
        author = quote_element.select_one('.author').text

        ## Extract the tags
        tags = [tag.text for tag in quote_element.select('.tag')]

        ## Add the quote to our list
        quotes.append({
            'text': text,
            'author': author,
            'tags': tags
        })

    ## Check if there's a next page
    next_page = soup.select_one('.next a')
    next_page_url = None
    if next_page:
        next_page_url = 'https://quotes.toscrape.com' + next_page['href']

    return {
        'quotes': quotes,
        'next_page': next_page_url
    }

## Scrape the first page
result = scrape_quotes_page('https://quotes.toscrape.com/')

if result:
    ## Print information about the quotes found
    quotes = result['quotes']
    print(f"Found {len(quotes)} quotes on the first page")

    ## Print the first 2 quotes
    print("\nFirst 2 quotes:")
    for i, quote in enumerate(quotes[:2], 1):
        print(f"\nQuote #{i}")
        print(f"Text: {quote['text']}")
        print(f"Author: {quote['author']}")
        print(f"Tags: {', '.join(quote['tags'])}")

    ## Save the quotes to a JSON file
    output_dir = '/home/labex/project'
    with open(os.path.join(output_dir, 'quotes.json'), 'w') as f:
        json.dump(quotes, f, indent=2)

    print(f"\nSaved {len(quotes)} quotes to {output_dir}/quotes.json")

    ## Print information about the next page
    if result['next_page']:
        print(f"\nNext page URL: {result['next_page']}")
    else:
        print("\nNo next page available")

이 스크립트를 실행하여 웹사이트에서 인용구를 스크랩합니다.

python quotes_scraper.py

첫 번째 페이지에서 찾은 인용구에 대한 정보를 보여주는 출력을 볼 수 있으며, 인용구는 quotes.json이라는 JSON 파일에 저장됩니다.

구조화된 데이터를 보려면 JSON 파일을 확인하십시오.

cat quotes.json

파일에는 각 텍스트, 작성자 및 태그 속성이 있는 인용구 객체의 JSON 배열이 포함되어야 합니다.

바이너리 응답 콘텐츠 사용

지금까지 JSON 및 HTML 과 같은 텍스트 기반 응답에 중점을 두었습니다. 그러나 requests 라이브러리는 이미지, PDF 및 기타 파일과 같은 바이너리 콘텐츠도 처리할 수 있습니다. 이 단계에서는 바이너리 콘텐츠를 다운로드하고 처리하는 방법을 배우겠습니다.

이미지 다운로드

이미지를 다운로드하는 것부터 시작해 보겠습니다. download_image.py라는 파일을 만듭니다.

import requests
import os

## URL of an image to download
image_url = "https://httpbin.org/image/jpeg"

## Make a request to get the image
response = requests.get(image_url)

## Check if the request was successful
if response.status_code == 200:
    ## Get the content type
    content_type = response.headers.get('Content-Type', '')
    print(f"Content-Type: {content_type}")

    ## Check if the content is an image
    if 'image' in content_type:
        ## Create a directory to save the image if it doesn't exist
        output_dir = '/home/labex/project/downloads'
        os.makedirs(output_dir, exist_ok=True)

        ## Save the image to a file
        image_path = os.path.join(output_dir, 'sample_image.jpg')
        with open(image_path, 'wb') as f:
            f.write(response.content)

        ## Print information about the saved image
        print(f"Image saved to: {image_path}")
        print(f"Image size: {len(response.content)} bytes")
    else:
        print("The response does not contain an image")
else:
    print(f"Request failed with status code: {response.status_code}")

이 스크립트를 실행하여 이미지를 다운로드합니다.

python download_image.py

이미지가 다운로드되어 /home/labex/project/downloads/sample_image.jpg에 저장되었음을 확인하는 출력을 볼 수 있습니다.

진행률 표시와 함께 파일 다운로드

큰 파일을 다운로드할 때는 진행률 표시기를 표시하는 것이 유용할 수 있습니다. 다운로드 진행률을 표시하는 스크립트를 만들어 보겠습니다. download_with_progress.py라는 파일을 만듭니다.

import requests
import os
import sys

def download_file(url, filename):
    ## Make a request to get the file
    ## Stream the response to handle large files efficiently
    response = requests.get(url, stream=True)

    ## Check if the request was successful
    if response.status_code != 200:
        print(f"Request failed with status code: {response.status_code}")
        return False

    ## Get the total file size if available
    total_size = int(response.headers.get('Content-Length', 0))
    if total_size:
        print(f"Total file size: {total_size/1024:.2f} KB")
    else:
        print("Content-Length header not found. Unable to determine file size.")

    ## Create a directory to save the file if it doesn't exist
    os.makedirs(os.path.dirname(filename), exist_ok=True)

    ## Download the file in chunks and show progress
    print(f"Downloading {url} to {filename}...")

    ## Initialize variables for progress tracking
    downloaded = 0
    chunk_size = 8192  ## 8 KB chunks

    ## Open the file for writing
    with open(filename, 'wb') as f:
        ## Iterate through the response chunks
        for chunk in response.iter_content(chunk_size=chunk_size):
            if chunk:  ## Filter out keep-alive chunks
                f.write(chunk)
                downloaded += len(chunk)

                ## Calculate and display progress
                if total_size:
                    percent = downloaded * 100 / total_size
                    sys.stdout.write(f"\rProgress: {percent:.1f}% ({downloaded/1024:.1f} KB)")
                    sys.stdout.flush()
                else:
                    sys.stdout.write(f"\rDownloaded: {downloaded/1024:.1f} KB")
                    sys.stdout.flush()

    ## Print a newline to ensure the next output starts on a new line
    print()

    return True

## URL of a file to download
file_url = "https://speed.hetzner.de/100MB.bin"

## Path where the file will be saved
output_path = '/home/labex/project/downloads/test_file.bin'

## Download the file
success = download_file(file_url, output_path)

if success:
    ## Get file stats
    file_size = os.path.getsize(output_path)
    print(f"\nDownload complete!")
    print(f"File saved to: {output_path}")
    print(f"File size: {file_size/1024/1024:.2f} MB")
else:
    print("\nDownload failed.")

이 스크립트를 실행하여 진행률 표시와 함께 파일을 다운로드합니다.

python download_with_progress.py

파일이 다운로드됨에 따라 진행률 표시줄이 업데이트되는 것을 볼 수 있습니다. 이 스크립트는 100MB 파일을 다운로드하므로 연결 속도에 따라 시간이 걸릴 수 있습니다.

다운로드를 취소하려면 Ctrl+C 를 누를 수 있습니다.

응답 헤더 및 메타데이터 사용

파일을 다운로드할 때 응답 헤더에는 종종 유용한 메타데이터가 포함됩니다. 응답 헤더를 자세히 검사하는 스크립트를 만들어 보겠습니다. response_headers.py라는 파일을 만듭니다.

import requests

def check_url(url):
    print(f"\nChecking URL: {url}")

    try:
        ## Make a HEAD request first to get headers without downloading the full content
        head_response = requests.head(url)

        print(f"HEAD request status code: {head_response.status_code}")

        if head_response.status_code == 200:
            ## Print all headers
            print("\nResponse headers:")
            for header, value in head_response.headers.items():
                print(f"  {header}: {value}")

            ## Extract content type and size
            content_type = head_response.headers.get('Content-Type', 'Unknown')
            content_length = head_response.headers.get('Content-Length', 'Unknown')

            print(f"\nContent Type: {content_type}")

            if content_length != 'Unknown':
                size_kb = int(content_length) / 1024
                size_mb = size_kb / 1024

                if size_mb >= 1:
                    print(f"Content Size: {size_mb:.2f} MB")
                else:
                    print(f"Content Size: {size_kb:.2f} KB")
            else:
                print("Content Size: Unknown")

            ## Check if the server supports range requests
            accept_ranges = head_response.headers.get('Accept-Ranges', 'none')
            print(f"Supports range requests: {'Yes' if accept_ranges != 'none' else 'No'}")

        else:
            print(f"HEAD request failed with status code: {head_response.status_code}")

    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")

## Check a few different URLs
check_url("https://httpbin.org/image/jpeg")
check_url("https://speed.hetzner.de/100MB.bin")
check_url("https://example.com")

이 스크립트를 실행하여 응답 헤더에 대한 자세한 정보를 확인합니다.

python response_headers.py

이미지, 바이너리 파일 및 HTML 페이지를 포함한 다양한 유형의 콘텐츠에 대한 헤더를 보여주는 출력을 볼 수 있습니다.

응답 헤더를 이해하는 것은 다음과 같은 많은 웹 개발 작업에 매우 중요합니다.

다운로드 전에 파일 유형 및 크기 결정
범위 요청을 사용하여 재개 가능한 다운로드 구현
캐싱 정책 및 만료 날짜 확인
리디렉션 및 인증 처리

요약

이 Lab 에서는 Python requests 라이브러리를 사용하여 웹 서비스 및 API 와 상호 작용하는 방법을 배웠습니다. 이제 다음 기술을 갖추게 되었습니다.

HTTP 요청을 하고 응답 상태 코드 및 오류를 처리합니다.
API 응답에서 JSON 데이터를 파싱합니다.
BeautifulSoup 을 사용하여 HTML 콘텐츠에서 정보를 추출합니다.
이미지 및 파일과 같은 바이너리 콘텐츠를 다운로드하고 처리합니다.
응답 헤더 및 메타데이터를 사용합니다.

이러한 기술은 웹 스크래핑, API 통합, 데이터 수집 및 자동화를 포함한 많은 Python 애플리케이션의 기반을 형성합니다. 이제 웹 서비스와 상호 작용하고, 웹사이트에서 유용한 정보를 추출하며, 다양한 유형의 웹 콘텐츠를 처리하는 애플리케이션을 구축할 수 있습니다.

계속 학습하려면 다음을 탐색해 볼 수 있습니다.

보호된 API 에 액세스하기 위한 인증 방법
특정 헤더 또는 요청 형식이 필요한 더 복잡한 API 사용
데이터를 수집하고 분석하는 완전한 웹 스크래핑 프로젝트 구축
여러 API 와 통합되는 Python 애플리케이션 생성

웹사이트를 스크래핑하거나 API 를 사용할 때는 서비스 약관을 확인하고 차단되지 않도록 속도 제한을 준수하는 것이 중요합니다.