K8s Summaries - Pods
Workloads in Kubernetes
A workload is an application running on Kubernetes inside a set of pods. A Pod represents a set of running containers on your cluster.
- Pods have a defined lifecycle. If a pod fails, you would need to create a new Pod to recover.
- To manage pods, you can use workload resources that manage a set of pods on your behalf.
Workload Resources
Kubernetes provides several built-in workload resources:
- Deployment and ReplicaSet: Good for managing a stateless application workload, where any Pod is interchangeable and can be replaced if needed.
- StatefulSet: Lets you run related Pods that track state. For example, if your workload records data persistently, you can run a StatefulSet that matches each Pod with a PersistentVolume.
- DaemonSet: Defines Pods that provide node-local facilities. These might be fundamental to the operation of your cluster, such as a networking helper tool, or be part of an add-on. Every time you add a node to your cluster that matches the specification in a DaemonSet, the control plane schedules a Pod for that DaemonSet onto the new node.
- Job and CronJob: Define tasks that run to completion and then stop. Jobs represent one-off tasks, whereas CronJobs recur according to a schedule.
In the wider Kubernetes ecosystem, you can find third-party workload resources that provide additional behaviors. Using a custom resource definition, you can add in a third-party workload resource if you want a specific behavior that's not part of Kubernetes' core.
Supporting Concepts
- Garbage collection: Tidies up objects from your cluster after their owning resource has been removed.
- The time-to-live after finished controller: Removes Jobs once a defined time has passed since they completed.
Further Resources
Once your application is running, you might want to make it available on the internet as a Service or, for web applications only, using an Ingress.
Pods: Summary
- A Pod is the smallest deployable unit in Kubernetes. It can contain one or more containers sharing storage and network resources.
- Pods can also contain init containers that run during Pod startup and ephemeral containers for debugging, if supported by the cluster.
- Pods have shared Linux namespaces, cgroups, and potential other facets of isolation, similar to a set of containers.
- Pods can be created directly but are usually created using workload resources such as Deployment or Job.
Types of Pods
- Single Container Pods: Most common use-case where a Pod wraps around a single container.
- Multi-container Pods: These Pods encapsulate an application composed of multiple co-located containers that are tightly coupled and need to share resources. They form a single cohesive unit of service.
Pod Management
- Each Pod runs a single instance of a given application, with replication handled by multiple Pods.
- Pods support multiple cooperating processes (as containers) that form a cohesive unit of service.
- Workload resources can be used to manage multiple Pods.
- Controllers for workload resources create Pods from a pod template and manage those Pods.
Pod Lifecycle
- Pods are considered relatively ephemeral, disposable entities. They remain on their node until they finish execution, the Pod object is deleted, the Pod is evicted for lack of resources, or the node fails.
- Restarting a container in a Pod is different from restarting a Pod. A Pod persists until it is deleted.
Pod Networking and Storage
- Each Pod is assigned a unique IP address. All containers in a Pod share the network namespace, including the IP address and network ports.
- A Pod can specify a set of shared storage volumes. All containers in the Pod can access the shared volumes, allowing those containers to share data.
Privileged Mode for Containers
- Containers in a Pod can run in privileged mode to use operating system administrative capabilities.
Static Pods
- Managed directly by the kubelet daemon on a specific node, without the API server observing them. Useful for running a self-hosted control plane.
Container Probes
- Probes are diagnostics performed periodically by the kubelet on a container, used to check the health of the container.
Operating System Support
- As of Kubernetes v1.25, only Linux and Windows are supported. The .spec.os.name field indicates the OS on which you want the Pod to run.
Pod Templates
- Controllers for workload resources (like Deployments, Jobs, and DaemonSets) create Pods using a Pod template.
- Pod templates are specifications for creating Pods and are part of the desired state of the workload resource.
- Modifying the Pod template or switching to a new Pod template does not directly affect existing Pods. Instead, the workload resource controller creates new Pods based on the updated template.
- Each workload resource has its own rules for handling changes to the Pod template.
Pod Update and Replacement
- If the Pod template changes, the controller creates new Pods based on the updated template.
- Kubernetes allows you to update some fields of a running Pod, in place. However, limitations exist:
- Most metadata about a Pod is immutable.
- If the
metadata.deletionTimestamp
is set, no new entry can be added to themetadata.finalizers
list. - Pod updates may not change fields other than
spec.containers[*].image
,spec.initContainers[*].image
,spec.activeDeadlineSeconds
, orspec.tolerations
.
Resource Sharing and Communication
- Pods can specify a set of shared storage volumes, accessible by all containers in the Pod.
- Each Pod is assigned a unique IP address. Every container in a Pod shares the network namespace, including the IP address and network ports.
- Containers within the Pod can communicate with each other using localhost. They must coordinate their use of shared network resources when communicating with entities outside the Pod.
- Containers in a Pod can also communicate with each other using standard inter-process communications like SystemV semaphores or POSIX shared memory.
- Containers in different Pods have distinct IP addresses and can not communicate by OS-level IPC without special configuration.
- Containers that want to interact with a container running in a different Pod can use IP networking to communicate.
Privileged Mode for Containers
- Any container in a pod can run in privileged mode to access operating system administrative capabilities.
- This is available for both Windows and Linux.
Linux Privileged Containers
- On Linux, containers can enable privileged mode using the
privileged
flag on the security context of the container spec.
Windows Privileged Containers
- As of Kubernetes v1.26, you can create a Windows HostProcess pod by setting the
windowsOptions.hostProcess
flag on the security context of the pod spec.
Static Pods
- Static Pods are managed directly by the kubelet daemon on a specific node, not observed by the API server.
- Static Pods are always bound to one Kubelet on a specific node.
- The kubelet automatically tries to create a mirror Pod on the Kubernetes API server for each static Pod.
- The Pods running on a node are visible on the API server, but cannot be controlled from there.
Container Probes
- A probe is a diagnostic performed periodically by the kubelet on a container.
- The kubelet can perform diagnostics through different actions:
- ExecAction (performed with the help of the container runtime)
- TCPSocketAction (checked directly by the kubelet)
- HTTPGetAction (checked directly by the kubelet)
Kubernetes Pod Lifecycle
Pod Lifetime
- Pods are ephemeral entities, created once, assigned a unique ID (UID), and scheduled to a node until terminated or deleted.
- If a Node fails, Pods on the node are scheduled for deletion after a timeout period.
- A Pod isn't rescheduled to a different node if it fails. Instead, a new Pod can replace it with a different UID.
- Objects tied to a Pod's lifetime, like a volume, are also destroyed when the Pod is deleted.
Pod Phases
- A Pod moves through several phases in its lifecycle:
- Pending: Accepted by the Kubernetes cluster, but some containers aren't ready yet.
- Running: Bound to a node, all containers have been created, and at least one container is still running.
- Succeeded: All containers in the Pod terminated successfully, will not be restarted.
- Failed: All containers in the Pod terminated, with at least one failure (either non-zero exit status or system termination).
- Unknown: Due to an error, the state of the Pod couldn't be obtained.
Pod and Container States
- Each container in a Pod has three possible states: Waiting, Running, and Terminated.
- A container's lifecycle can be influenced by hooks that trigger events at certain points.
Waiting
- If a container is not Running or Terminated, it's in the Waiting state. Reasons for Waiting might include pulling the container image or applying Secret data.
Running
- A Running container is executing without issues. If a postStart hook was configured, it's already executed and finished.
Terminated
- A Terminated container either ran to completion or failed. If a preStop hook was configured, it runs before the container enters the Terminated state.
Container Restart Policy
- A Pod's spec contains a
restartPolicy
field. Values include "Always" (default), "OnFailure", and "Never". restartPolicy
only refers to restarts of the containers by the kubelet on the same node.- After containers exit, the kubelet restarts them with an exponential back-off delay, capped at five minutes.
Pod Conditions
- A Pod has a PodStatus, which has an array of PodConditions.
- Kubelet manages these conditions: PodScheduled, ContainersReady, Initialized, Ready, and PodHasNetwork (alpha feature).
Pod Readiness
- Pod readiness is a feature that allows your application to provide additional signals to the PodStatus.
- It uses
readinessGates
in the Pod's spec to list extra conditions that the kubelet checks for Pod readiness. - The status of these conditions is extracted from the status.condition fields of the Pod.
- If the condition isn't found, it defaults to "False".
- The condition names should conform to the Kubernetes label key format.
- Custom conditions and readiness of all containers are required for a Pod to be considered ready.
- If all containers are ready, but at least one custom condition is missing or False, the kubelet sets the Pod's condition to ContainersReady.
Pod Network Readiness
- After being scheduled on a node and admitted by the Kubelet, a Pod needs to have volumes mounted and network configuration set up.
- The
PodHasNetworkCondition
feature gate allows Kubelet to report the Pod's network initialization status. - The
PodHasNetwork
condition can be set to False in early or later stages of the Pod's lifecycle under certain circumstances. - This condition is set to True after successful sandbox creation and network configuration.
Pod Scheduling Readiness
- This feature is currently in alpha and additional information can be found in the Pod Scheduling Readiness documentation.
Container Probes
- Probes are diagnostics performed periodically by the kubelet on a container.
- There are four types of probes:
exec
,grpc
,httpGet
, andtcpSocket
. - Each probe has three possible outcomes: Success, Failure, Unknown.
- Kubelet can perform and react to three kinds of probes on running containers:
livenessProbe
,readinessProbe
, andstartupProbe
.
Liveness Probe
- It indicates whether the container is running.
- If it fails, the kubelet kills the container and the container follows its restart policy.
- Default state is Success if a liveness probe is not provided.
Readiness Probe
- It indicates whether the container is ready to respond to requests.
- If it fails, the endpoints controller removes the Pod's IP address from all matching Services' endpoints.
- The default state before the initial delay is Failure if a readiness probe is not provided.
Startup Probe
- It indicates whether the application within the container has started.
- All other probes are disabled if a startup probe is provided, until it succeeds.
- If it fails, the kubelet kills the container and the container follows its restart policy.
- Default state is Success if a startup probe is not provided.
When to Use Probes
- Liveness probe: Useful if you want your container to be killed and restarted if a probe fails.
- Readiness probe: Useful to start sending traffic to a Pod only when a probe succeeds, or for taking the container down for maintenance.
- Startup probe: Useful for Pods that have containers that take a long time to start up.
Note
- On deletion, a Pod puts itself into an unready state regardless of the readiness probe existence.
- It stays in this state while waiting for the containers in the Pod to stop.
Kubernetes Pod Termination Process
Pods in Kubernetes represent processes running on nodes. It's crucial to allow these processes to terminate gracefully instead of abruptly stopping them with a KILL signal. The key steps involved in this process are:
- Requesting Deletion: When you request a Pod's deletion, the cluster records and starts tracking the grace period before the Pod can be forcefully killed.
- Graceful Shutdown: The container runtime typically sends a TERM signal to each container's main process. If the grace period expires and processes are still running, a KILL signal is sent, and the Pod is deleted from the API Server.
- Process Interruption Handling: If the kubelet or container runtime's management service restarts during process termination, the cluster retries from the start with the full original grace period.
Deletion Flow Example:
- Request Deletion: Delete a specific Pod using the
kubectl
tool with a default grace period of 30 seconds. - Pod Status Update: The API server updates the Pod with a "dead" status and grace period. If checked with
kubectl describe
, the Pod shows up as "Terminating". - Local Shutdown Process: The kubelet on the node starts the local Pod shutdown process once it detects the Pod as terminating.
- PreStop Hook: If a preStop hook is defined in any of the Pod's containers, the kubelet runs it. If it's still running after the grace period, the kubelet requests a grace period extension of 2 seconds.
- TERM Signal: The kubelet triggers the container runtime to send a TERM signal to process 1 in each container.
- Service Interruption: The control plane evaluates whether to remove the shutting-down Pod from EndpointSlice (and Endpoints) objects. ReplicaSets and other resources no longer treat the shutting-down Pod as a valid, in-service replica.
- Forcible Shutdown: When the grace period expires, the kubelet triggers forcible shutdown. Any remaining processes in any container in the Pod receive SIGKILL. The kubelet also cleans up a hidden pause container if the container runtime uses one.
- Pod Transition: The kubelet transitions the Pod into a terminal phase (Failed or Succeeded) depending on the end state of its containers.
- Forcible Removal: The kubelet triggers forcible removal of the Pod object from the API server by setting the grace period to 0 for immediate deletion.
- Deletion: The API server deletes the Pod's API object, rendering it invisible from any client.
Forced Pod Termination
Forced deletions can be potentially disruptive. By default, all deletions are graceful within 30 seconds. However, you can override this default with the --grace-period=<seconds>
option in the kubectl delete
command.
Setting the grace period to 0 forcibly and immediately deletes the Pod from the API server and triggers immediate cleanup by the kubelet on the node.
Garbage Collection of Pods
For failed Pods, their API objects remain in the cluster's API until explicitly removed. The Pod garbage collector (PodGC) in the control plane cleans up terminated Pods when the number of Pods exceeds the configured threshold. PodGC also cleans up Pods that are orphan pods, unscheduled terminating pods, or terminating pods bound to a non-ready node tainted with node.kubernetes.io/out-of-service
.