Kubernetes 面试题及答案

Introduction

Welcome to this comprehensive guide designed to equip you with the knowledge and confidence needed to excel in Kubernetes interviews. Whether you're just starting your journey with container orchestration or are a seasoned professional looking to deepen your expertise, this document provides a structured approach to mastering Kubernetes concepts. We've meticulously curated a wide array of questions, spanning from fundamental principles and advanced architectural considerations to practical troubleshooting, scenario-based challenges, and role-specific inquiries for developers, administrators, and DevOps engineers. Prepare to enhance your understanding, refine your problem-solving skills, and confidently navigate any Kubernetes interview.

KUBERNETES

Kubernetes 基础和核心概念

什么是 Kubernetes 以及它为何被使用？

回答：

Kubernetes 是一个开源的容器编排平台，可自动化容器化应用程序的部署、扩展和管理。它用于处理生产环境中运行应用程序的复杂性，确保高可用性、可伸缩性和高效的资源利用率。

解释 Kubernetes 中 Pod 和 Container 的区别。

回答：

Container 是一个轻量级的、可执行的软件软件包，包含了运行应用程序所需的一切。Pod 是 Kubernetes 中最小的可部署单元，它封装了一个或多个容器、存储资源、唯一的网络 IP 以及控制容器如何运行的选项。Pod 内的所有容器共享相同的网络命名空间，可以通过 localhost 进行通信。

Kubernetes 中的 Node 是什么？

回答：

Node 是 Kubernetes 中的一个工作节点，可以是虚拟机或物理机。每个 Node 都包含运行 Pod 所需的必要组件，包括 Kubelet（主节点的代理）、Kube-proxy（网络代理）和容器运行时（例如 Docker、containerd）。

描述 Kubernetes 控制平面（Master Node）的主要组件。

回答：

控制平面包括 Kube-API Server（公开 Kubernetes API）、etcd（集群数据的一致且高可用的键值存储）、Kube-Scheduler（监视新 Pod 并将其分配给 Node）以及 Kube-Controller-Manager（运行 Node、Replication、Endpoint 和 Service Account 控制器等控制器进程）。

Kubernetes 中的 Deployment 是什么以及它为何被使用？

回答：

Deployment 是一个更高级别的抽象，用于管理 Pod 和 ReplicaSet 的期望状态。它为 Pod 和 ReplicaSet 提供声明式更新，允许你定义应用程序应运行的副本数量以及如何进行滚动更新或回滚到先前版本。

Kubernetes 如何处理 Pod 的网络？

回答：

Kubernetes 为每个 Pod 分配一个唯一的 IP 地址。Pod 内的所有容器共享此 IP，并可以通过 localhost 进行通信。不同 Node 上的 Pod 可以使用 CNI（Container Network Interface）插件进行通信，该插件实现了网络覆盖。Kube-proxy 在 Node 上管理网络规则，以实现服务发现和负载均衡。

Kubernetes 中的 Service 是什么以及它的类型有哪些？

回答：

Service 是一种抽象的方式，将运行在一组 Pod 上的应用程序公开为网络服务。它为一组 Pod 提供稳定的 IP 地址和 DNS 名称。常见的类型包括 ClusterIP（集群内部）、NodePort（在每个 Node 的 IP 上公开一个静态端口的服务）和 LoadBalancer（使用云提供商的负载均衡器公开外部服务）。

解释 ReplicaSet 的作用。

回答：

ReplicaSet 确保在任何给定时间都有指定数量的 Pod 副本在运行。它的主要目的是维护一组 Pod 的稳定性和可用性。虽然你可以直接使用 ReplicaSet，但它们通常由 Deployment 管理，以实现滚动更新等更高级的功能。

`kubectl` 是什么以及它的主要功能是什么？

回答：

kubectl 是与 Kubernetes 集群交互的命令行工具。它允许用户针对 Kubernetes 集群运行命令、部署应用程序、检查和管理集群资源以及查看日志。它与 Kubernetes API Server 通信。

`etcd` 在 Kubernetes 中的作用是什么？

回答：

etcd 是一个分布式、一致且高可用的键值存储，Kubernetes 使用它来存储所有集群数据。这包括配置数据、状态信息、元数据以及集群的期望状态。它充当 Kubernetes 集群的单一事实来源。

Advanced Kubernetes Topics and Architecture

Explain the concept of a Kubernetes Operator and provide an example of when you would use one.

Answer:

A Kubernetes Operator is a method of packaging, deploying, and managing a Kubernetes-native application. It extends Kubernetes API to create, configure, and manage instances of complex applications. You would use an Operator for stateful applications like databases (e.g., Cassandra, MySQL) to automate tasks like backups, upgrades, and scaling.

Describe the purpose of a Custom Resource Definition (CRD) in Kubernetes.

Answer:

A Custom Resource Definition (CRD) allows you to define your own custom resources in Kubernetes, extending the Kubernetes API. This enables you to store and retrieve structured data that Kubernetes can manage. CRDs are fundamental for building Operators and defining application-specific objects.

How does the Kubernetes API Server handle authentication and authorization for requests?

Answer:

The API Server handles authentication through various methods like client certificates, bearer tokens, or service account tokens. After authentication, authorization is performed using modules like RBAC (Role-Based Access Control), Node authorization, or ABAC (Attribute-Based Access Control). RBAC is the most common, defining roles with permissions and binding them to users or service accounts.

What is the difference between a DaemonSet and a Deployment in Kubernetes?

Answer:

A Deployment manages a set of identical pods, ensuring a desired number of replicas are running across the cluster, typically for stateless applications. A DaemonSet ensures that all (or some) nodes run a copy of a pod, useful for cluster-level services like log collectors (e.g., Fluentd) or monitoring agents (e.g., Node Exporter) that need to run on every node.

Explain the concept of Pod Security Policies (PSPs) and why they are being deprecated.

Answer:

Pod Security Policies (PSPs) were an admission controller that enforced security standards on pods and containers. They allowed cluster administrators to control security-sensitive aspects like privileged mode, host network access, and volume types. PSPs are being deprecated in favor of Pod Security Admission (PSA) and policy engines like OPA Gatekeeper, which offer more flexible and granular control.

How do you achieve high availability for the Kubernetes control plane?

Answer:

High availability for the control plane is achieved by running multiple instances of the API Server, etcd, Controller Manager, and Scheduler. etcd typically runs as a quorum-based cluster (e.g., 3 or 5 nodes). A load balancer is placed in front of the API Servers to distribute traffic and provide failover.

What is a mutating admission webhook and how can it be used?

Answer:

A mutating admission webhook is an HTTP callback that can modify requests to the Kubernetes API server before they are persisted. It can inject sidecar containers, add labels/annotations, or set default values for fields. For example, it can automatically inject a istio-proxy sidecar into pods for service mesh integration.

Describe the role of etcd in a Kubernetes cluster.

Answer:

etcd serves as Kubernetes' consistent and highly available key-value store. It stores all cluster data, including configuration, state, and metadata for all Kubernetes objects (pods, deployments, services, etc.). It's critical for the cluster's operation, and its availability directly impacts the cluster's health.

How does Kubernetes handle network policy enforcement?

Answer:

Kubernetes Network Policies are specifications that define how groups of pods are allowed to communicate with each other and with external endpoints. They are implemented by a network plugin (CNI) that supports NetworkPolicy, such as Calico, Cilium, or Weave Net. The CNI plugin translates these policies into firewall rules.

What are Taints and Tolerations, and how are they used for pod scheduling?

Answer:

Taints are applied to nodes, marking them as 'unsuitable' for certain pods unless those pods have matching Tolerations. Tolerations are applied to pods, allowing them to be scheduled on tainted nodes. This mechanism is used to reserve nodes for specific workloads (e.g., GPU nodes) or to evict pods from unhealthy nodes.

场景化和设计问题

你的应用程序 Pod 频繁重启。你如何在 Kubernetes 中排查这个问题？

回答：

我会首先使用 kubectl describe pod <pod-name> 来检查事件和状态。然后，我会使用 kubectl logs <pod-name> 来查看应用程序日志中的错误。最后，我会检查 kubectl logs <pod-name> -p 来查看先前容器实例的日志，以了解崩溃的原因。

你需要以零停机的方式部署应用程序的新版本。你如何在 Kubernetes 中实现这一点？

回答：

我会为 Deployment 使用 RollingUpdate 策略。这允许 Kubernetes 逐步替换旧的 Pod，确保始终有一定数量的 Pod 可用。健康检查（就绪探针）对于确保新 Pod 在流量路由到它们之前已准备就绪至关重要。

描述一个你会使用 StatefulSet 而不是 Deployment 的场景。

回答：

我会为需要稳定、唯一的网络标识符、稳定的持久存储以及有序、优雅的部署/扩展/删除的应用程序使用 StatefulSet。例如数据库（如 PostgreSQL）或分布式系统（如 Apache Kafka），其中每个副本都需要自己的持久卷和可预测的主机名。

你的 Kubernetes 集群资源（CPU/内存）不足。你会采取哪些步骤来诊断和缓解这个问题？

回答：

首先，我会使用 kubectl top nodes 和 kubectl top pods 来识别资源占用过多的对象。然后，我会检查 Pod 的资源请求（requests）和限制（limits），确保它们设置得当。缓解措施包括优化应用程序的资源使用、水平扩展集群或调整资源请求/限制。

你如何安全地将 Kubernetes 中运行的 Web 应用程序暴露给互联网？

回答：

我会使用类型为 LoadBalancer 或 NodePort 的 Kubernetes Service 将应用程序暴露给集群内部或外部流量。为了安全地访问 HTTP/HTTPS，我会部署一个 Ingress 控制器（例如 Nginx Ingress），并定义带有 TLS 终止的 Ingress 资源，通常与 Cert-Manager 集成以实现自动证书供应。

你需要运行一个一次性的批处理作业，该作业处理数据然后退出。你会使用哪个 Kubernetes 对象？

回答：

我会使用 Kubernetes Job 对象。Job 确保指定数量的 Pod 成功完成其任务。对于周期性任务，我会使用 CronJob，它会在预定的时间表上创建 Job 对象。

为 Kubernetes 中关键的微服务设计一个高可用性策略。

回答：

我会将该微服务部署为一个 Deployment，并使用 Pod anti-affinity 规则将其分布在不同的节点上，拥有多个副本（例如 3 个或更多）。我会实现健壮的就绪和存活探针。对于数据持久化，我会使用分布式数据库或带有持久卷的 StatefulSet。最后，我会确保适当的资源请求/限制和自动伸缩。

你将如何处理应用程序在 Kubernetes 中的敏感信息，如 API 密钥或数据库凭据？

回答：

我会使用 Kubernetes Secrets 来存储敏感信息。这些 Secrets 可以作为文件挂载到 Pod 中，或作为环境变量暴露。为了增强安全性，我会与外部密钥管理系统集成，如 HashiCorp Vault 或云提供商的 KMS 服务。

你的应用程序需要访问运行在 Kubernetes 集群外部的数据库。你如何安全地配置它？

回答：

我会在集群内部创建一个类型为 ExternalName 的 Kubernetes Service 或一个带有 Endpoints 的 Headless Service 来表示外部数据库。这允许 Pod 通过 Kubernetes 服务名称解析数据库。网络策略将用于限制出口流量仅到数据库的 IP 和端口，并且凭据将通过 Kubernetes Secrets 进行管理。

你注意到在重负载下应用程序的响应时间正在增加。你如何通过 Kubernetes 扩展你的应用程序来处理这种情况？

回答：

我会为 Deployment 实现 Horizontal Pod Autoscaling (HPA)，并配置它根据 CPU 利用率或自定义指标（如每秒请求数）进行扩展。这会在需求增加时自动添加更多的 Pod 副本。我还会确保底层集群有足够的节点容量或实现 Cluster Autoscaler。

特定角色问题 (开发者、管理员、DevOps)

开发者：如何排查一个处于“待定”(Pending) 状态的 Pod？

回答：

我会首先检查 kubectl describe pod <pod-name> 来查看指示问题的事件，例如资源不足（CPU/内存）、节点亲和性/污点（taint）问题，或者持久卷声明（Persistent Volume Claims）未绑定。接下来，我会使用 kubectl describe node <node-name> 来检查节点的状况和资源可用性。

开发者：你需要部署应用程序的新版本。在 Kubernetes 中，最小化停机时间的最佳部署方式是什么？

回答：

我会为 Deployment 使用 RollingUpdate 策略。这会逐步用新的 Pod 替换旧的 Pod，确保持续可用性。我还会定义就绪探针（readiness probes），以确保新的 Pod 在流量路由到它们之前是健康的。

管理员：有用户报告无法访问集群中运行的服务。你会采取哪些步骤来诊断这个问题？

回答：

我会首先检查服务的 kubectl describe service <service-name> 来验证其配置和端点就绪情况。然后，我会检查支持该服务的 Pod 的健康状况（kubectl get pods -o wide）并查看它们的日志以查找应用程序错误。网络策略（Network policies）或防火墙规则也可能是导致问题的原因。

管理员：你如何确保只有授权用户才能访问 Kubernetes 集群中的特定资源？

回答：

我会实施基于角色的访问控制（Role-Based Access Control, RBAC）。这包括定义 Roles（命名空间内的权限）或 ClusterRoles（集群范围的权限），然后使用 RoleBindings 或 ClusterRoleBindings 将它们绑定到用户或服务账户。

管理员：描述一个你会使用 NetworkPolicy 的场景。

回答：

我会使用 NetworkPolicy 来控制 Pod 之间或 Pod 与外部端点之间的流量。例如，隔离一个数据库 Pod，使其只能被特定的应用程序 Pod 连接，或者限制开发命名空间（namespace）的出站流量。

DevOps：你如何安全地管理 Kubernetes 中的 Secrets（例如，API 密钥、数据库凭据）？

回答：

虽然 Kubernetes Secrets 提供基本的编码，但为了真正的安全性，我会与外部密钥管理解决方案集成，如 HashiCorp Vault、AWS Secrets Manager 或 Azure Key Vault。这些解决方案可以将 Secrets 直接注入到 Pod 中，或使用 CSI 驱动程序进行动态挂载，避免将敏感数据直接存储在 Git 中。

DevOps：解释 Helm chart 的目的以及它如何使 CI/CD 流水线受益。

回答：

Helm chart 是 Kubernetes 的包管理器，它将所有必需的 Kubernetes 资源（Deployments、Services、ConfigMaps 等）捆绑成一个单一的、可版本化的单元。在 CI/CD 中，它允许在不同环境之间进行一致、可重复的部署，方便版本升级/回滚，并对配置进行参数化。

DevOps：你将如何为 Kubernetes 中的微服务应用程序实现持续部署？

回答：

我会使用 GitOps 方法，并借助 Argo CD 或 Flux 等工具。代码合并并测试后，CI 流水线会构建 Docker 镜像，并在 Kubernetes manifest 中（例如在 Git 仓库中）更新镜像标签。然后 GitOps 操作符会检测到 Git 中的更改，并自动将其应用于集群，确保所需状态的同步。

DevOps：你会监控 Kubernetes 集群及其应用程序的哪些关键指标？

回答：

对于集群，我会监控节点资源利用率（CPU、内存、磁盘）、API 服务器延迟和 etcd 健康状况。对于应用程序，关键指标包括 Pod 的 CPU/内存使用量、请求速率、错误率、延迟以及应用程序特定的业务指标。Prometheus 和 Grafana 是实现此目的的常用工具。

DevOps：描述你将如何为有状态应用程序在 Kubernetes 中处理持久化存储。

回答：

我会使用 PersistentVolumes (PVs) 和 PersistentVolumeClaims (PVCs)。PVC 请求 PV 的存储，PV 由 StorageClass 预置。这抽象了底层的存储基础设施，允许应用程序在不了解其具体细节的情况下请求存储，并确保即使 Pod 被重新调度，数据也能持久化。

Troubleshooting and Debugging Kubernetes

Your pod is stuck in 'Pending' state. What are the common reasons and how would you troubleshoot?

Answer:

Common reasons include insufficient resources (CPU/memory), node taints/tolerations, or persistent volume issues. I'd use kubectl describe pod <pod-name> to check events for scheduling failures, resource requests, and volume binding status.

A pod is in 'CrashLoopBackOff' state. What does this indicate and how do you debug it?

Answer:

This indicates the container inside the pod is repeatedly starting and crashing. I'd first check kubectl logs <pod-name> for application errors. If logs aren't helpful, I'd use kubectl describe pod <pod-name> to look for OOMKilled events or readiness/liveness probe failures.

How do you check the logs of a specific container within a multi-container pod?

Answer:

You can specify the container name using the -c flag with kubectl logs. For example: kubectl logs <pod-name> -c <container-name>. This allows isolating logs from a particular service.

A service is not reachable from outside the cluster. What steps would you take to diagnose this?

Answer:

I'd check the service type (e.g., NodePort, LoadBalancer) and its external IP/port. Then, I'd verify firewall rules, security groups, and network policies. Finally, I'd check if the service's selectors correctly match the pod labels and if the pods are running and healthy.

You suspect a network policy is blocking traffic to your application. How would you confirm this?

Answer:

I'd use kubectl describe networkpolicy <policy-name> to understand its rules. Then, I'd check the pod's labels and namespaces to see if they are targeted by any policies. Tools like kube-no-trouble or netshoot within a debug pod can also help test connectivity.

How do you get a shell into a running container for debugging purposes?

Answer:

You can use kubectl exec -it <pod-name> -- /bin/bash (or /bin/sh if bash isn't available). This allows you to inspect the container's filesystem, run commands, and diagnose issues directly within its environment.

What are common causes for 'ImagePullBackOff' and how do you troubleshoot them?

Answer:

Common causes include incorrect image name/tag, private registry authentication issues, or network connectivity problems to the registry. I'd check kubectl describe pod <pod-name> for image pull errors and verify image names, registry credentials (secrets), and network access.

Your application is experiencing high latency, but the pods appear healthy. What could be the issue?

Answer:

This could indicate resource contention (CPU throttling), inefficient application code, or issues with external dependencies. I'd check resource utilization metrics (CPU/memory) for the pods, review application logs for slow queries, and inspect network latency to external services.

How would you debug a liveness or readiness probe failure?

Answer:

I'd check kubectl describe pod <pod-name> for probe failure events and the specific command/path being used. Then, I'd use kubectl logs <pod-name> to see if the application is crashing or not responding to the probe's endpoint. Executing the probe command manually inside the container can also help.

A node is in 'NotReady' state. What are the typical reasons and how do you investigate?

Answer:

Typical reasons include kubelet not running, network issues preventing communication with the control plane, or insufficient node resources. I'd SSH into the node, check systemctl status kubelet, review kubelet logs (journalctl -u kubelet), and verify network connectivity to the API server.

Kubernetes 故障排查与调试

你的 Pod 卡在“待定”(Pending) 状态。常见原因有哪些以及如何排查？

回答：

常见原因包括资源不足（CPU/内存）、节点污点/容忍（node taints/tolerations）或持久卷问题。我会使用 kubectl describe pod <pod-name> 来检查事件，以了解调度失败、资源请求和卷绑定状态。

一个 Pod 处于“崩溃循环中的错误”(CrashLoopBackOff) 状态。这表示什么以及如何调试？

回答：

这表示 Pod 内的容器反复启动和崩溃。我会首先检查 kubectl logs <pod-name> 来查看应用程序错误。如果日志没有帮助，我会使用 kubectl describe pod <pod-name> 来查找 OOMKilled 事件或就绪/存活探针（readiness/liveness probe）失败。

如何查看多容器 Pod 中特定容器的日志？

回答：

你可以使用 kubectl logs 的 -c 标志来指定容器名称。例如：kubectl logs <pod-name> -c <container-name>。这允许隔离特定服务的日志。

服务无法从集群外部访问。你会采取哪些步骤来诊断这个问题？

回答：

我会检查服务的类型（例如，NodePort、LoadBalancer）及其外部 IP/端口。然后，我会验证防火墙规则、安全组和网络策略。最后，我会检查服务的选择器（selectors）是否正确匹配 Pod 标签，以及 Pod 是否正在运行且健康。

你怀疑网络策略（Network Policy）正在阻止到你应用程序的流量。如何确认这一点？

回答：

我会使用 kubectl describe networkpolicy <policy-name> 来了解其规则。然后，我会检查 Pod 的标签和命名空间，看看它们是否被任何策略所针对。像 kube-no-trouble 或在调试 Pod 中使用 netshoot 等工具也可以帮助测试连接性。

如何获取正在运行的容器的 shell 以进行调试？

回答：

你可以使用 kubectl exec -it <pod-name> -- /bin/bash（如果 bash 不可用，则使用 /bin/sh）。这允许你检查容器的文件系统，运行命令，并在其环境中直接诊断问题。

“镜像拉取错误”(ImagePullBackOff) 的常见原因有哪些以及如何排查？

回答：

常见原因包括镜像名称/标签不正确、私有注册表认证问题或到注册表的网络连接问题。我会检查 kubectl describe pod <pod-name> 来查看镜像拉取错误，并验证镜像名称、注册表凭据（secrets）和网络访问。

你的应用程序出现高延迟，但 Pod 看起来是健康的。可能是什么问题？

回答：

这可能表明资源争用（CPU 节流）、低效的应用程序代码或外部依赖项存在问题。我会检查 Pod 的资源利用率指标（CPU/内存），查看应用程序日志中的慢查询，并检查到外部服务的网络延迟。

如何调试存活或就绪探针（liveness or readiness probe）失败？

回答：

我会检查 kubectl describe pod <pod-name> 来查看探针失败事件和正在使用的具体命令/路径。然后，我会使用 kubectl logs <pod-name> 来查看应用程序是否崩溃或未响应探针的端点。在容器内部手动执行探针命令也有帮助。

一个节点处于“未就绪”(NotReady) 状态。典型原因是什么以及如何调查？

回答：

典型原因包括 kubelet 未运行、阻止与控制平面通信的网络问题或节点资源不足。我会 SSH 到节点，检查 systemctl status kubelet，查看 kubelet 日志（journalctl -u kubelet），并验证到 API 服务器的网络连接。

实用的 Kubernetes 挑战

你有一个不断崩溃的 Pod。如何排查这个问题？

回答：

我会先检查 kubectl describe pod <pod-name> 来查看事件和状态。然后，我会使用 kubectl logs <pod-name> 来查看应用程序日志。如果它是崩溃循环，我会检查 kubectl logs --previous <pod-name> 来查看上一个已终止容器的日志。

一个 Deployment 卡在待定状态。常见原因有哪些以及如何诊断它们？

回答：

常见原因包括资源不足（CPU/内存）、节点污点/容忍（node taints/tolerations）或节点选择器/亲和性（node selectors/affinity）问题。我会使用 kubectl describe pod <pod-name> 来查看调度事件，并使用 kubectl get events --field-selector involvedObject.kind=Node 来检查节点状态。

如何将 Deployment 中运行的无状态应用程序暴露给外部流量？

回答：

我会创建一个 LoadBalancer 或 NodePort 类型的 Service 来暴露 Deployment。对于更高级的路由和 SSL 终止，我会使用 Ingress 资源，这需要一个 Ingress Controller。

你需要对 Deployment 进行零停机时间的滚动更新。Kubernetes 如何处理这个问题，关键考虑因素是什么？

回答：

Kubernetes Deployment 默认处理滚动更新，根据 maxUnavailable 和 maxSurge 设置，在终止旧 Pod 之前创建新 Pod。关键考虑因素包括适当的就绪探针（readiness probes）、充足的资源分配以及在完全推出前测试新版本。

描述一个你会使用 ConfigMap 而不是 Secret 的场景。

回答：

我会使用 ConfigMap 来存储非敏感的配置数据，例如应用程序环境变量或配置文件。我会使用 Secret 来存储敏感数据，如 API 密钥、数据库凭据或 TLS 证书，这些数据默认是加密存储的。

如何确保 Pod 只在具有特定硬件（例如 GPU）的节点上运行？

回答：

我会使用节点选择器（Node Selectors）或节点亲和性（Node Affinity）。节点选择器更简单，用于精确匹配（nodeSelector: {gpu: 'true'}）。节点亲和性提供了更大的灵活性，可以使用 requiredDuringSchedulingIgnoredDuringExecution 或 preferredDuringSchedulingIgnoredDuringExecution 规则。

一个 Service 没有将流量路由到其 Pod。你会采取哪些步骤来调试这个问题？

回答：

首先，检查 kubectl describe service <service-name> 来验证其选择器是否匹配 Pod 标签。然后，检查 kubectl get endpoints <service-name> 来查看是否列出了任何 Pod IP。最后，确保 Pod 健康且其就绪探针通过。

你需要在集群中运行一次性任务，例如数据库迁移。你会使用哪个 Kubernetes 资源？

回答：

我会使用 Kubernetes Job 资源。Job 会创建一到多个 Pod，并确保指定数量的 Pod 成功终止。对于计划任务，我会使用 CronJob。

解释 PersistentVolume (PV) 和 PersistentVolumeClaim (PVC) 的目的。

回答：

PersistentVolume (PV) 是集群中由管理员提供的存储块。PersistentVolumeClaim (PVC) 是用户对存储的请求。PVC 会绑定到一个合适的 PV，允许 Pod 在其生命周期之外独立地使用持久化存储。

如何手动和自动扩展 Deployment？

回答：

手动扩展，我会使用 kubectl scale deployment <deployment-name> --replicas=<number>。自动扩展，我会使用 Horizontal Pod Autoscaler (HPA)，它根据观察到的 CPU 利用率或其他自定义指标来扩展 Deployment 或 ReplicaSet 中的 Pod 数量。

总结

参加 Kubernetes 面试可能是一次充满挑战但富有成效的经历。本文档提供了常见问题和富有洞察力的答案的全面概述，旨在为你提供成功所需的知识和信心。请记住，充分的准备至关重要；理解核心概念、实际应用和故障排除场景将显著提高你的表现。

在面试之外，与 Kubernetes 的旅程是一个持续学习和适应的过程。这个领域发展迅速，保持好奇心、尝试新功能以及与社区互动将确保你的技能保持敏锐和相关性。拥抱挑战，庆祝你的成功，并在这个动态且至关重要的技术领域不断提升你的专业知识。

Introduction

Kubernetes 基础和核心概念

什么是 Kubernetes 以及它为何被使用？

解释 Kubernetes 中 Pod 和 Container 的区别。

Kubernetes 中的 Node 是什么？

描述 Kubernetes 控制平面（Master Node）的主要组件。

Kubernetes 中的 Deployment 是什么以及它为何被使用？

Kubernetes 如何处理 Pod 的网络？

Kubernetes 中的 Service 是什么以及它的类型有哪些？

解释 ReplicaSet 的作用。

kubectl 是什么以及它的主要功能是什么？

etcd 在 Kubernetes 中的作用是什么？

Advanced Kubernetes Topics and Architecture

Explain the concept of a Kubernetes Operator and provide an example of when you would use one.

Describe the purpose of a Custom Resource Definition (CRD) in Kubernetes.

How does the Kubernetes API Server handle authentication and authorization for requests?

What is the difference between a DaemonSet and a Deployment in Kubernetes?

Explain the concept of Pod Security Policies (PSPs) and why they are being deprecated.

How do you achieve high availability for the Kubernetes control plane?

What is a mutating admission webhook and how can it be used?

Describe the role of etcd in a Kubernetes cluster.

How does Kubernetes handle network policy enforcement?

What are Taints and Tolerations, and how are they used for pod scheduling?

场景化和设计问题

你的应用程序 Pod 频繁重启。你如何在 Kubernetes 中排查这个问题？

你需要以零停机的方式部署应用程序的新版本。你如何在 Kubernetes 中实现这一点？

描述一个你会使用 StatefulSet 而不是 Deployment 的场景。

你的 Kubernetes 集群资源（CPU/内存）不足。你会采取哪些步骤来诊断和缓解这个问题？

你如何安全地将 Kubernetes 中运行的 Web 应用程序暴露给互联网？

你需要运行一个一次性的批处理作业，该作业处理数据然后退出。你会使用哪个 Kubernetes 对象？

为 Kubernetes 中关键的微服务设计一个高可用性策略。

你将如何处理应用程序在 Kubernetes 中的敏感信息，如 API 密钥或数据库凭据？

你的应用程序需要访问运行在 Kubernetes 集群外部的数据库。你如何安全地配置它？

你注意到在重负载下应用程序的响应时间正在增加。你如何通过 Kubernetes 扩展你的应用程序来处理这种情况？

特定角色问题 (开发者、管理员、DevOps)

开发者：如何排查一个处于“待定”(Pending) 状态的 Pod？

开发者：你需要部署应用程序的新版本。在 Kubernetes 中，最小化停机时间的最佳部署方式是什么？

管理员：有用户报告无法访问集群中运行的服务。你会采取哪些步骤来诊断这个问题？

管理员：你如何确保只有授权用户才能访问 Kubernetes 集群中的特定资源？

管理员：描述一个你会使用 NetworkPolicy 的场景。

DevOps：你如何安全地管理 Kubernetes 中的 Secrets（例如，API 密钥、数据库凭据）？

DevOps：解释 Helm chart 的目的以及它如何使 CI/CD 流水线受益。

DevOps：你将如何为 Kubernetes 中的微服务应用程序实现持续部署？

DevOps：你会监控 Kubernetes 集群及其应用程序的哪些关键指标？

DevOps：描述你将如何为有状态应用程序在 Kubernetes 中处理持久化存储。

Troubleshooting and Debugging Kubernetes

Your pod is stuck in 'Pending' state. What are the common reasons and how would you troubleshoot?

A pod is in 'CrashLoopBackOff' state. What does this indicate and how do you debug it?

How do you check the logs of a specific container within a multi-container pod?

A service is not reachable from outside the cluster. What steps would you take to diagnose this?

You suspect a network policy is blocking traffic to your application. How would you confirm this?

How do you get a shell into a running container for debugging purposes?

What are common causes for 'ImagePullBackOff' and how do you troubleshoot them?

Your application is experiencing high latency, but the pods appear healthy. What could be the issue?

How would you debug a liveness or readiness probe failure?

A node is in 'NotReady' state. What are the typical reasons and how do you investigate?

Kubernetes 故障排查与调试

你的 Pod 卡在“待定”(Pending) 状态。常见原因有哪些以及如何排查？

一个 Pod 处于“崩溃循环中的错误”(CrashLoopBackOff) 状态。这表示什么以及如何调试？

如何查看多容器 Pod 中特定容器的日志？

服务无法从集群外部访问。你会采取哪些步骤来诊断这个问题？

你怀疑网络策略（Network Policy）正在阻止到你应用程序的流量。如何确认这一点？

如何获取正在运行的容器的 shell 以进行调试？

“镜像拉取错误”(ImagePullBackOff) 的常见原因有哪些以及如何排查？

你的应用程序出现高延迟，但 Pod 看起来是健康的。可能是什么问题？

如何调试存活或就绪探针（liveness or readiness probe）失败？

一个节点处于“未就绪”(NotReady) 状态。典型原因是什么以及如何调查？

实用的 Kubernetes 挑战

你有一个不断崩溃的 Pod。如何排查这个问题？

一个 Deployment 卡在待定状态。常见原因有哪些以及如何诊断它们？

如何将 Deployment 中运行的无状态应用程序暴露给外部流量？

你需要对 Deployment 进行零停机时间的滚动更新。Kubernetes 如何处理这个问题，关键考虑因素是什么？

描述一个你会使用 ConfigMap 而不是 Secret 的场景。

如何确保 Pod 只在具有特定硬件（例如 GPU）的节点上运行？

一个 Service 没有将流量路由到其 Pod。你会采取哪些步骤来调试这个问题？

你需要在集群中运行一次性任务，例如数据库迁移。你会使用哪个 Kubernetes 资源？

解释 PersistentVolume (PV) 和 PersistentVolumeClaim (PVC) 的目的。

如何手动和自动扩展 Deployment？

总结

`kubectl` 是什么以及它的主要功能是什么？

`etcd` 在 Kubernetes 中的作用是什么？