Python 嵌套 JSON 数据提取最佳实践

PythonBeginner
立即练习

介绍

在数据处理中,从嵌套的 Python JSON 对象中导航和提取数据是一项常见的任务。无论你使用 API、配置文件还是数据存储,理解如何有效地从复杂的 JSON 结构中提取值,对于任何 Python 开发者来说都是至关重要的。

在这个实验(Lab)中,你将学习使用 Python 从嵌套 JSON 对象中提取值的实用技术。你将探索各种方法,从基本的索引到更强大的方法,这些方法可以优雅地处理缺失的键。在这个实验(Lab)结束时,你将获得处理 Python 项目中嵌套 JSON 数据的最佳实践的实践经验。

创建和理解嵌套的 JSON 对象

JSON(JavaScript Object Notation)是一种轻量级的数据交换格式,它具有人类可读性,并且易于机器解析。在 Python 中,JSON 数据通常表示为字典和列表。

让我们从创建一个示例嵌套 JSON 对象开始,以便在整个实验(Lab)中使用。

创建一个示例 JSON 文件

  1. 打开 WebIDE 界面,并在 /home/labex/project 目录中创建一个名为 sample.json 的新文件。

  2. 将以下 JSON 内容复制到文件中:

{
  "person": {
    "name": "John Doe",
    "age": 35,
    "contact": {
      "email": "john.doe@example.com",
      "phone": "555-123-4567"
    },
    "address": {
      "street": "123 Main St",
      "city": "Anytown",
      "state": "CA",
      "zip": "12345"
    },
    "hobbies": ["reading", "hiking", "photography"],
    "employment": {
      "company": "Tech Solutions Inc.",
      "position": "Software Developer",
      "years": 5,
      "projects": [
        {
          "name": "Project Alpha",
          "status": "completed"
        },
        {
          "name": "Project Beta",
          "status": "in-progress"
        }
      ]
    }
  }
}
  1. 保存文件。

理解 JSON 结构

这个 JSON 对象代表一个具有各种属性的人。该结构包括:

  • 简单的键值对(name,age)
  • 嵌套对象(contact,address,employment)
  • 数组(hobbies)
  • 对象数组(projects)

理解 JSON 数据的结构是从其中有效提取值的第一步。让我们可视化这个结构:

person
 ├── name
 ├── age
 ├── contact
 │   ├── email
 │   └── phone
 ├── address
 │   ├── street
 │   ├── city
 │   ├── state
 │   └── zip
 ├── hobbies [array]
 └── employment
     ├── company
     ├── position
     ├── years
     └── projects [array of objects]
         ├── name
         └── status

在 Python 中加载 JSON

现在,让我们创建一个 Python 脚本来加载这个 JSON 文件。在同一目录中创建一个名为 json_basics.py 的新文件:

import json

## Load JSON from file
with open('sample.json', 'r') as file:
    data = json.load(file)

## Verify data loaded successfully
print("JSON data loaded successfully!")
print(f"Type of data: {type(data)}")
print(f"Keys at the top level: {list(data.keys())}")
print(f"Keys in the person object: {list(data['person'].keys())}")

使用以下命令运行脚本:

python3 json_basics.py

你应该看到类似于以下内容的输出:

JSON data loaded successfully!
Type of data: <class 'dict'>
Keys at the top level: ['person']
Keys in the person object: ['name', 'age', 'contact', 'address', 'hobbies', 'employment']

这确认了我们的 JSON 文件已作为 Python 字典正确加载。在下一步中,我们将探索从这个嵌套结构中提取值的不同方法。

访问 JSON 数据的基本方法

现在我们已经加载了 JSON 数据,让我们探索访问嵌套 JSON 对象中值的基本方法。

使用方括号的直接索引

访问嵌套 JSON 对象中的值的最直接方法是使用带有适当键的方括号。让我们创建一个名为 direct_access.py 的新文件:

import json

## Load JSON from file
with open('sample.json', 'r') as file:
    data = json.load(file)

## Access simple values
name = data["person"]["name"]
age = data["person"]["age"]
print(f"Name: {name}, Age: {age}")

## Access nested values
email = data["person"]["contact"]["email"]
city = data["person"]["address"]["city"]
print(f"Email: {email}, City: {city}")

## Access array values
first_hobby = data["person"]["hobbies"][0]
all_hobbies = data["person"]["hobbies"]
print(f"First hobby: {first_hobby}")
print(f"All hobbies: {all_hobbies}")

## Access values in arrays of objects
first_project_name = data["person"]["employment"]["projects"][0]["name"]
first_project_status = data["person"]["employment"]["projects"][0]["status"]
print(f"First project: {first_project_name}, Status: {first_project_status}")

运行脚本:

python3 direct_access.py

你应该看到类似于以下的输出:

Name: John Doe, Age: 35
Email: john.doe@example.com, City: Anytown
First hobby: reading
All hobbies: ['reading', 'hiking', 'photography']
First project: Project Alpha, Status: completed

直接索引的问题

当你确切知道 JSON 数据的结构,并且确定所有键都存在时,直接索引效果很好。但是,如果缺少某个键,它将引发 KeyError 异常。

让我们通过创建一个名为 error_demo.py 的文件来演示这个问题:

import json

## Load JSON from file
with open('sample.json', 'r') as file:
    data = json.load(file)

try:
    ## Try to access a key that doesn't exist
    occupation = data["person"]["occupation"]
    print(f"Occupation: {occupation}")
except KeyError as e:
    print(f"Error occurred: KeyError - {e}")

## Now let's try a nested key that doesn't exist
try:
    salary = data["person"]["employment"]["salary"]
    print(f"Salary: {salary}")
except KeyError as e:
    print(f"Error occurred: KeyError - {e}")

运行脚本:

python3 error_demo.py

你应该看到类似于以下的输出:

Error occurred: KeyError - 'occupation'
Error occurred: KeyError - 'salary'

正如你所看到的,当键不存在时,直接索引会引发异常。在实际应用中,尤其是在使用外部 API 或用户生成的数据时,JSON 对象的结构可能会有所不同。在下一步中,我们将探索更安全的方法来访问嵌套 JSON 数据,这些方法可以优雅地处理缺失的键。

访问嵌套 JSON 数据的安全方法

在实际应用中,你经常需要处理可能缺少键或 JSON 结构可能变化的情况。让我们探索访问嵌套 JSON 数据的更安全的方法。

使用 get() 方法

字典的 get() 方法允许你提供一个默认值,如果找不到某个键,则可以防止 KeyError 异常。让我们创建一个名为 safe_access.py 的文件:

import json

## Load JSON from file
with open('sample.json', 'r') as file:
    data = json.load(file)

## Using get() for safer access
name = data.get("person", {}).get("name", "Unknown")
## If the "occupation" key doesn't exist, "Not specified" will be returned
occupation = data.get("person", {}).get("occupation", "Not specified")

print(f"Name: {name}")
print(f"Occupation: {occupation}")

## Accessing deeply nested values with get()
company = data.get("person", {}).get("employment", {}).get("company", "Unknown")
salary = data.get("person", {}).get("employment", {}).get("salary", "Not specified")

print(f"Company: {company}")
print(f"Salary: {salary}")

## Providing default values for arrays
hobbies = data.get("person", {}).get("hobbies", [])
skills = data.get("person", {}).get("skills", ["No skills listed"])

print(f"Hobbies: {', '.join(hobbies)}")
print(f"Skills: {', '.join(skills)}")

运行脚本:

python3 safe_access.py

你应该看到类似于以下的输出:

Name: John Doe
Occupation: Not specified
Company: Tech Solutions Inc.
Salary: Not specified
Hobbies: reading, hiking, photography
Skills: No skills listed

请注意,即使我们的 JSON 中不存在“occupation”和“skills”等键,我们也没有收到任何错误。相反,我们得到了我们指定的默认值。

链接 get() 以获取深度嵌套的数据

当处理深度嵌套的 JSON 结构时,链接多个 get() 调用可能会变得冗长且难以阅读。让我们创建一个带有变量的更易读的版本。创建一个名为 chained_get.py 的文件:

import json

## Load JSON from file
with open('sample.json', 'r') as file:
    data = json.load(file)

## Step-by-step approach for deeply nested values
person = data.get("person", {})
employment = person.get("employment", {})
projects = employment.get("projects", [])

## Get the first project or an empty dict if projects list is empty
first_project = projects[0] if projects else {}
project_name = first_project.get("name", "No project")
project_status = first_project.get("status", "Unknown")

print(f"Project: {project_name}, Status: {project_status}")

## Let's try a non-existent path
person = data.get("person", {})
education = person.get("education", {})
degree = education.get("degree", "Not specified")

print(f"Degree: {degree}")

运行脚本:

python3 chained_get.py

你应该看到类似于以下的输出:

Project: Project Alpha, Status: completed
Degree: Not specified

这种方法更具可读性,因为我们将嵌套级别分解为单独的变量。它也很安全,因为我们在每个级别都提供了默认值。

使用 Default Dict

另一种方法是使用 Python 的 defaultdict(来自 collections 模块),它会自动为缺失的键提供默认值。创建一个名为 default_dict.py 的文件:

import json
from collections import defaultdict

## Function to create a nested defaultdict
def nested_defaultdict():
    return defaultdict(nested_defaultdict)

## Load JSON from file
with open('sample.json', 'r') as file:
    regular_data = json.load(file)

## Convert to defaultdict
def dict_to_defaultdict(d):
    if not isinstance(d, dict):
        return d
    result = defaultdict(nested_defaultdict)
    for k, v in d.items():
        if isinstance(v, dict):
            result[k] = dict_to_defaultdict(v)
        elif isinstance(v, list):
            result[k] = [dict_to_defaultdict(item) if isinstance(item, dict) else item for item in v]
        else:
            result[k] = v
    return result

## Convert our data to defaultdict
data = dict_to_defaultdict(regular_data)

## Now we can access keys that don't exist without errors
print(f"Name: {data['person']['name']}")
print(f"Occupation: {data['person']['occupation']}")  ## This key doesn't exist!
print(f"Education: {data['person']['education']['degree']}")  ## Deeply nested non-existent path

## defaultdict returns another defaultdict for missing keys, which might not be what you want
print(f"Type of data['person']['occupation']: {type(data['person']['occupation'])}")

## To get a specific default value, you would use get() even with defaultdict
occupation = data['person'].get('occupation', 'Not specified')
print(f"Occupation with default: {occupation}")

运行脚本:

python3 default_dict.py

你应该看到类似于以下的输出:

Name: John Doe
Occupation: defaultdict(<function nested_defaultdict at 0x...>, {})
Education: defaultdict(<function nested_defaultdict at 0x...>, {})
Type of data['person']['occupation']: <class 'collections.defaultdict'>
Occupation with default: Not specified

虽然 defaultdict 可以防止 KeyError 异常,但它会为缺失的键返回另一个 defaultdict,这可能不是你想要的默认值。这就是为什么 get() 方法通常更适合提供特定的默认值。

在下一步中,我们将探索如何创建一个实用函数,以安全且灵活地从深度嵌套的 JSON 中提取值。

创建用于 JSON 值提取的实用函数

现在我们已经探索了访问嵌套 JSON 数据的不同方法,让我们创建一个实用函数,以便更轻松地从复杂的嵌套结构中提取值。此函数将 get() 方法的安全性与处理不同类型数据的灵活性相结合。

基于路径的提取函数

创建一个名为 json_extractor.py 的新文件:

import json
from typing import Any, List, Dict, Union, Optional

def extract_value(data: Dict, path: List[str], default: Any = None) -> Any:
    """
    Safely extract a value from a nested dictionary using a path list.

    Args:
        data: The dictionary to extract value from
        path: A list of keys representing the path to the value
        default: The default value to return if the path doesn't exist

    Returns:
        The value at the specified path or the default value if not found
    """
    current = data
    for key in path:
        if isinstance(current, dict) and key in current:
            current = current[key]
        else:
            return default
    return current

## Load JSON from file
with open('sample.json', 'r') as file:
    data = json.load(file)

## Basic usage examples
name = extract_value(data, ["person", "name"], "Unknown")
age = extract_value(data, ["person", "age"], 0)
print(f"Name: {name}, Age: {age}")

## Extracting values that don't exist
occupation = extract_value(data, ["person", "occupation"], "Not specified")
print(f"Occupation: {occupation}")

## Extracting deeply nested values
email = extract_value(data, ["person", "contact", "email"], "No email")
phone = extract_value(data, ["person", "contact", "phone"], "No phone")
print(f"Email: {email}, Phone: {phone}")

## Extracting from arrays
if isinstance(extract_value(data, ["person", "hobbies"], []), list):
    first_hobby = extract_value(data, ["person", "hobbies"], [])[0] if extract_value(data, ["person", "hobbies"], []) else "No hobbies"
else:
    first_hobby = "No hobbies"
print(f"First hobby: {first_hobby}")

## Extracting from arrays of objects
projects = extract_value(data, ["person", "employment", "projects"], [])
if projects and len(projects) > 0:
    first_project_name = extract_value(projects[0], ["name"], "Unknown project")
    first_project_status = extract_value(projects[0], ["status"], "Unknown status")
    print(f"First project: {first_project_name}, Status: {first_project_status}")
else:
    print("No projects found")

运行脚本:

python3 json_extractor.py

你应该看到类似于以下的输出:

Name: John Doe, Age: 35
Occupation: Not specified
Email: john.doe@example.com, Phone: 555-123-4567
First hobby: reading
First project: Project Alpha, Status: completed

使用路径字符串增强的 JSON 提取器

让我们增强我们的提取器以支持点表示法(dot notation)的路径,这使得它更易于使用。创建一个名为 enhanced_extractor.py 的文件:

import json
from typing import Any, Dict, List, Union

def get_nested_value(data: Dict, path_string: str, default: Any = None) -> Any:
    """
    Safely extract a value from a nested dictionary using a dot-separated path string.

    Args:
        data: The dictionary to extract value from
        path_string: A dot-separated string representing the path to the value
        default: The default value to return if the path doesn't exist

    Returns:
        The value at the specified path or the default value if not found
    """
    ## Convert the path string to a list of keys
    path = path_string.split(".")

    ## Start with the full dictionary
    current = data

    ## Follow the path
    for key in path:
        ## Handle list indexing with [n] notation
        if key.endswith("]") and "[" in key:
            list_key, index_str = key.split("[")
            index = int(index_str[:-1])  ## Remove the closing bracket and convert to int

            ## Get the list
            if list_key:  ## If there's a key before the bracket
                if not isinstance(current, dict) or list_key not in current:
                    return default
                current = current[list_key]

            ## Get the item at the specified index
            if not isinstance(current, list) or index >= len(current):
                return default
            current = current[index]
        else:
            ## Regular dictionary key
            if not isinstance(current, dict) or key not in current:
                return default
            current = current[key]

    return current

## Load JSON from file
with open('sample.json', 'r') as file:
    data = json.load(file)

## Test the enhanced extractor
print("Basic access:")
print(f"Name: {get_nested_value(data, 'person.name', 'Unknown')}")
print(f"Age: {get_nested_value(data, 'person.age', 0)}")
print(f"Occupation: {get_nested_value(data, 'person.occupation', 'Not specified')}")

print("\nNested access:")
print(f"Email: {get_nested_value(data, 'person.contact.email', 'No email')}")
print(f"City: {get_nested_value(data, 'person.address.city', 'Unknown city')}")

print("\nArray access:")
print(f"First hobby: {get_nested_value(data, 'person.hobbies[0]', 'No hobbies')}")
print(f"Second hobby: {get_nested_value(data, 'person.hobbies[1]', 'No second hobby')}")
print(f"Non-existent hobby: {get_nested_value(data, 'person.hobbies[10]', 'No such hobby')}")

print("\nComplex access:")
print(f"Company: {get_nested_value(data, 'person.employment.company', 'Unknown company')}")
print(f"First project name: {get_nested_value(data, 'person.employment.projects[0].name', 'No project')}")
print(f"Second project status: {get_nested_value(data, 'person.employment.projects[1].status', 'Unknown status')}")
print(f"Non-existent project: {get_nested_value(data, 'person.employment.projects[2].name', 'No such project')}")
print(f"Education: {get_nested_value(data, 'person.education.degree', 'No education info')}")

运行脚本:

python3 enhanced_extractor.py

你应该看到类似于以下的输出:

Basic access:
Name: John Doe
Age: 35
Occupation: Not specified

Nested access:
Email: john.doe@example.com
City: Anytown

Array access:
First hobby: reading
Second hobby: hiking
Non-existent hobby: No such hobby

Complex access:
Company: Tech Solutions Inc.
First project name: Project Alpha
Second project status: in-progress
Non-existent project: No such project
Education: No education info

实际应用

现在,让我们将增强的 JSON 提取器应用于更复杂的实际场景。创建一个名为 practical_example.py 的文件:

import json
import os
from typing import Any, Dict, List

## Import our enhanced extractor function
from enhanced_extractor import get_nested_value

## Create a more complex JSON structure for reporting
report_data = {
    "company": "Global Analytics Ltd.",
    "report_date": "2023-11-01",
    "departments": [
        {
            "name": "Engineering",
            "manager": "Alice Johnson",
            "employee_count": 45,
            "projects": [
                {"id": "E001", "name": "API Gateway", "status": "completed", "budget": 125000},
                {"id": "E002", "name": "Mobile App", "status": "in-progress", "budget": 200000}
            ]
        },
        {
            "name": "Marketing",
            "manager": "Bob Smith",
            "employee_count": 28,
            "projects": [
                {"id": "M001", "name": "Q4 Campaign", "status": "planning", "budget": 75000}
            ]
        },
        {
            "name": "Customer Support",
            "manager": "Carol Williams",
            "employee_count": 32,
            "projects": []
        }
    ],
    "financial": {
        "current_quarter": {
            "revenue": 2500000,
            "expenses": 1800000,
            "profit_margin": 0.28
        },
        "previous_quarter": {
            "revenue": 2300000,
            "expenses": 1750000,
            "profit_margin": 0.24
        }
    }
}

## Save this data to a JSON file for demonstration
with open('report.json', 'w') as file:
    json.dump(report_data, file, indent=2)

print("Report data saved to report.json")

## Now let's extract useful information from this report
def generate_summary(data: Dict) -> str:
    """Generate a summary of the company report"""

    company = get_nested_value(data, "company", "Unknown Company")
    report_date = get_nested_value(data, "report_date", "Unknown Date")

    ## Financial summary
    current_revenue = get_nested_value(data, "financial.current_quarter.revenue", 0)
    previous_revenue = get_nested_value(data, "financial.previous_quarter.revenue", 0)
    revenue_change = current_revenue - previous_revenue
    revenue_change_percent = (revenue_change / previous_revenue * 100) if previous_revenue > 0 else 0

    ## Department summary
    departments = get_nested_value(data, "departments", [])
    total_employees = sum(get_nested_value(dept, "employee_count", 0) for dept in departments)

    ## Project counts
    total_projects = sum(len(get_nested_value(dept, "projects", [])) for dept in departments)
    completed_projects = sum(
        1 for dept in departments
        for proj in get_nested_value(dept, "projects", [])
        if get_nested_value(proj, "status", "") == "completed"
    )

    ## Generate summary text
    summary = f"Company Report Summary for {company} as of {report_date}\n"
    summary += "=" * 50 + "\n\n"

    summary += "Financial Overview:\n"
    summary += f"- Current Quarter Revenue: ${current_revenue:,}\n"
    summary += f"- Revenue Change: ${revenue_change:,} ({revenue_change_percent:.1f}%)\n\n"

    summary += "Operational Overview:\n"
    summary += f"- Total Departments: {len(departments)}\n"
    summary += f"- Total Employees: {total_employees}\n"
    summary += f"- Total Projects: {total_projects}\n"
    summary += f"- Completed Projects: {completed_projects}\n\n"

    summary += "Department Details:\n"
    for i, dept in enumerate(departments):
        dept_name = get_nested_value(dept, "name", f"Department {i+1}")
        manager = get_nested_value(dept, "manager", "No manager")
        employees = get_nested_value(dept, "employee_count", 0)
        projects = get_nested_value(dept, "projects", [])

        summary += f"- {dept_name} (Manager: {manager})\n"
        summary += f"  * Employees: {employees}\n"
        summary += f"  * Projects: {len(projects)}\n"

        if projects:
            for proj in projects:
                proj_name = get_nested_value(proj, "name", "Unnamed Project")
                proj_status = get_nested_value(proj, "status", "unknown")
                proj_budget = get_nested_value(proj, "budget", 0)

                summary += f"    - {proj_name} (Status: {proj_status}, Budget: ${proj_budget:,})\n"
        else:
            summary += "    - No active projects\n"

        summary += "\n"

    return summary

## Generate and display the summary
summary = generate_summary(report_data)
print("\nGenerated Report Summary:")
print(summary)

## Save the summary to a file
with open('report_summary.txt', 'w') as file:
    file.write(summary)

print("Summary saved to report_summary.txt")

运行脚本:

python3 practical_example.py

你应该会看到一条消息,确认报告数据已保存,然后是公司报告的详细摘要。

检查输出文件:

cat report_summary.txt

这个实际例子演示了我们的 JSON 提取器实用程序如何用于构建能够优雅地处理缺失数据的强大报告工具。 get_nested_value 函数允许我们安全地从复杂的嵌套结构中提取值,而无需担心 KeyErrors 或 NoneType 异常。

最佳实践总结

基于我们在本实验中探索的技术,以下是从嵌套 JSON 对象中提取值的最佳实践:

  1. 使用 get() 方法 而不是直接索引,为缺失的键提供默认值。
  2. 为常见的 JSON 提取模式创建实用函数,以避免重复代码。
  3. 通过提供合理的默认值来优雅地处理缺失的路径
  4. 在处理值之前对其进行类型检查 以避免错误(例如,在访问索引之前检查值是否为列表)。
  5. 将复杂的路径分解为单独的变量 以提高可读性。
  6. 使用带有点表示法的路径字符串 以更直观地访问嵌套值。
  7. 记录你的提取代码 以明确你在 JSON 结构中寻找什么。

通过遵循这些最佳实践,你可以编写更健壮且更易于维护的代码,以处理 Python 中的嵌套 JSON 对象。

总结

在本实验中,你学习了从 Python 中的嵌套 JSON 对象中提取值的实用技术。你首先了解了嵌套 JSON 的结构以及如何在 Python 中加载 JSON 数据。然后,你探索了各种访问嵌套数据的方法,从基本索引到更强大的方法。

本实验的主要收获包括:

  1. 理解 JSON 结构:认识到嵌套 JSON 对象的层次结构对于有效访问其值至关重要。

  2. 基本访问方法:当你确定 JSON 数据的结构时,使用方括号的直接索引有效。

  3. 安全访问技术:使用 get() 方法为缺失的键提供默认值,使你的代码在处理不确定的数据结构时更加健壮。

  4. 实用函数:为 JSON 提取创建专用函数可以显著简化你的代码并使其更易于维护。

  5. 基于路径的访问:使用带有点表示法的路径字符串提供了一种直观的方式来访问深度嵌套的值。

  6. 实际应用:将这些技术应用于实际场景有助于构建强大的数据处理工具。

通过遵循这些最佳实践,你可以编写更具弹性的代码,从而优雅地处理嵌套 JSON 数据的复杂性,即使结构不规则或包含缺失值。