K8s Summary - Nodes
Communication between Nodes and the Control Plane
- Kubernetes follows a "hub-and-spoke" API pattern where all API usage from nodes (or the pods they run) terminates at the API server. The API server is designed to listen for remote connections on a secure HTTPS port (typically 443) with client authentication and authorization enabled.
- Nodes must have the public root certificate for the cluster to connect securely to the API server along with valid client credentials.
- Pods can connect securely to the API server using a service account, which allows Kubernetes to inject the public root certificate and a valid bearer token into the pod when it is instantiated.
- Control plane components also communicate with the API server over the secure port, making the connections from the nodes and pod running on the nodes to the control plane secure by default.
Control Plane to Node Communication
There are two primary communication paths from the control plane to the nodes:
- From the API server to the kubelet process which runs on each node.
- From the API server to any node, pod, or service through the API server's proxy functionality.
- The connections from the API server to the kubelet are used for fetching logs for pods, attaching to running pods, and providing the kubelet's port-forwarding functionality.
- These connections terminate at the kubelet's HTTPS endpoint, but the API server does not verify the kubelet's serving certificate by default, which could make the connection subject to attacks if run over untrusted/public networks.
- To secure this, use the
--kubelet-certificate-authority
flag to provide the API server with a root certificate bundle to verify the kubelet's serving certificate. Alternatively, use SSH tunneling between the API server and kubelet to avoid connecting over an untrusted or public network. - The connections from the API server to a node, pod, or service default to plain HTTP connections and are not authenticated nor encrypted. They can run over a secure HTTPS connection by prefixing
https:
to the node, pod, or service name in the API URL, but they will not validate the certificate provided by the HTTPS endpoint nor provide client credentials.
SSH Tunnels
- Kubernetes supports SSH tunnels to protect the control plane to nodes communication paths. However, SSH tunnels are currently deprecated.
Konnectivity Service
- Introduced in Kubernetes v1.18 (beta), the Konnectivity service provides a TCP level proxy for the control plane to cluster communication, replacing SSH tunnels. The service comprises the Konnectivity server in the control plane network and the Konnectivity agents in the nodes network. The Konnectivity agents initiate and maintain connections to the Konnectivity server, and all control plane to nodes traffic goes through these connections after enabling the Konnectivity service.
Controllers
- Controllers in Kubernetes are control loops that observe the state of the cluster and make necessary changes to bring the current state closer to the desired state.
Controller Pattern
- A controller tracks at least one Kubernetes resource type. The spec field in these objects represents the desired state.
- Controllers make the current state come closer to the desired state. They can either perform the action themselves or send messages to the API server that create useful side effects.
Control via API Server
- Built-in controllers, like the Job controller, manage state by interacting with the cluster API server.
- Job is a resource that runs a Pod or several Pods to carry out a task and then stop.
- When the Job controller sees a new task, it ensures that the right number of Pods are running to get the work done. It doesn't run any Pods or containers itself but tells the API server to create or remove Pods.
- Once a Job is done, the Job controller updates that Job object to mark it Finished.
Direct Control
- Some controllers make changes to things outside the cluster. For example, a controller that ensures there are enough Nodes in your cluster needs to set up new Nodes when needed.
- Such controllers get their desired state from the API server and then communicate directly with an external system to align the current state.
Desired versus Current State
- Kubernetes can handle constant change. It doesn't matter if the overall state is stable or not as long as the controllers for your cluster are running and able to make useful changes.
Design
- Kubernetes uses many controllers, each managing a particular aspect of cluster state.
- It's better to have simple controllers rather than a monolithic set of interlinked control loops. Controllers can fail, and Kubernetes is designed to handle that.
- There can be several controllers that create or update the same kind of object. Kubernetes controllers make sure they only pay attention to the resources linked to their controlling resource.
Ways of Running Controllers
- Kubernetes comes with a set of built-in controllers that run inside the kube-controller-manager.
- You can also find controllers that run outside the control plane to extend Kubernetes.
- You can run your own controller as a set of Pods, or externally to Kubernetes, depending on what that particular controller does.
Leases
- Kubernetes uses Lease objects in the coordination.k8s.io API Group to lock shared resources and coordinate activity in distributed systems.
- Node heartbeats: Each Node has a matching Lease object in the kube-node-lease namespace. The kubelet updates the spec.renewTime field in this Lease object to communicate its heartbeat to the Kubernetes API server.
- Leader election: Control plane components like kube-controller-manager and kube-scheduler use Leases to ensure that only one instance of a component is running at a time in HA configurations.
- API server identity: From Kubernetes v1.26, each kube-apiserver uses Leases to publish its identity. This allows clients to discover how many instances of kube-apiserver are operating the control plane.
- API server identity leases can be inspected in the kube-system namespace using kubectl.
- API server identity leases are named using a SHA256 hash based on the OS hostname as seen by that API server. Each kube-apiserver should be configured to use a unique hostname within the cluster.
- API server identity leases from kube-apiservers that no longer exist are garbage collected by new kube-apiservers after 1 hour.
- You can disable API server identity leases by disabling the APIServerIdentity feature gate.
- Custom workloads can define their own use of Leases, for example, to elect a leader in a custom controller. The Lease name should be obviously linked to the product or component.
- Care should be taken to avoid name collisions for Leases when multiple instances of a component could be deployed.
Cloud Controller Manager
- The cloud-controller-manager is a Kubernetes control plane component that adds cloud-specific control logic. It allows you to link your cluster into your cloud provider's API, and separates components that interact with the cloud platform from those that only interact with the cluster.
- By decoupling the interoperability logic between Kubernetes and the underlying cloud infrastructure, CCM enables cloud providers to release features independently of the main Kubernetes project.
- CCM is structured using a plugin mechanism that allows different cloud providers to integrate their platforms with Kubernetes.
- The cloud controller manager runs as a replicated set of processes in the control plane (usually containers in Pods). It can also be run as a Kubernetes addon.
CCM includes the following controllers:
- Node Controller: Updates Node objects when new servers are created in your cloud infrastructure. It annotates and labels the Node object with cloud-specific information, obtains the node's hostname and network addresses, and verifies the node's health.
- Route Controller: Configures routes in the cloud to ensure that containers on different nodes in the Kubernetes cluster can communicate with each other. It might also allocate blocks of IP addresses for the Pod network.
- Service Controller: Interacts with your cloud provider's APIs to set up load balancers and other infrastructure components when you declare a Service resource that requires them.
The access required by CCM on various API objects includes:
- Node Controller: Full access to read and modify Node objects.
- Route Controller: Get access to Node objects.
- Service Controller: List and watch access to Services, and patch and update access to update Services. It requires create, list, get, watch, and update access to set up Endpoints resources for the Services.
- Others: CCM requires access to create Event objects and ServiceAccounts for secure operation.
cgroup v2
Here's a summarized version of the information about cgroups v2 in Linux and its relation to Kubernetes:
- What are cgroups? Control groups (cgroups) are a Linux feature that limits and allocates resources— such as CPU time, system memory, network bandwidth, or combinations of these resources—among user-defined groups of processes.
cgroup v2: cgroup v2 is the new generation of the cgroup API. It is a unified control system with enhanced resource management capabilities. cgroup v2 offers several improvements over cgroup v1:
- Unified hierarchy design in API
- Safer sub-tree delegation to containers
- New features like Pressure Stall Information
- Enhanced resource allocation management and isolation
- Unified accounting for different types of memory allocations
- Accounting for non-immediate resource changes
- Kubernetes and cgroup v2: Kubernetes v1.25 and onwards use cgroup v2 for enhanced resource management and isolation. Some features, like MemoryQoS, rely exclusively on cgroup v2 primitives.
Using cgroup v2: It's recommended to use a Linux distribution that enables and uses cgroup v2 by default. To use cgroup v2, you need:
- An OS distribution that enables cgroup v2
- Linux Kernel version 5.8 or later
- A container runtime that supports cgroup v2 (e.g., containerd v1.4+ or cri-o v1.20+)
- The kubelet and the container runtime configured to use the systemd cgroup driver
Linux Distribution cgroup v2 support: Here are some Linux distributions that support cgroup v2:
- Container Optimized OS (since M97)
- Ubuntu (since 21.10, 22.04+ recommended)
- Debian GNU/Linux (since Debian 11 bullseye)
- Fedora (since 31)
- Arch Linux (since April 2021)
- RHEL and RHEL-like distributions (since 9)
- Migrating to cgroup v2: To migrate to cgroup v2, ensure that you meet the requirements, then upgrade to a kernel version that enables cgroup v2 by default. The kubelet will automatically detect cgroup v2 and no additional configuration is required.
- Identifying the cgroup version: To check which cgroup version your distribution uses, run the
stat -fc %T /sys/fs/cgroup/
command on the node. The output iscgroup2fs
for cgroup v2, andtmpfs
for cgroup v1. - Compatibility: cgroup v2 uses a different API than cgroup v1, so applications that directly access the cgroup file system need to be updated to support cgroup v2. This includes third-party monitoring and security agents, standalone cAdvisor, and Java applications.
Container Runtime Interface (CRI)
- Container Runtime Interface (CRI): CRI is a plugin interface that allows the kubelet to use a variety of container runtimes without needing to recompile the cluster components. Every Node in the cluster requires a working container runtime for the kubelet to launch Pods and their containers.
- Communication: CRI is the main protocol for communication between the kubelet and the Container Runtime. This communication is defined by a gRPC protocol.
- The API: As of Kubernetes v1.23 (stable), the kubelet acts as a client when connecting to the container runtime via gRPC. The runtime and image service endpoints must be available in the container runtime, and can be configured separately within the kubelet using the --image-service-endpoint and --container-runtime-endpoint command-line flags.
- CRI versions: For Kubernetes v1.27, the kubelet prefers to use CRI v1. If a container runtime does not support v1 of the CRI, the kubelet tries to negotiate any older supported version. CRI v1alpha2 is supported but considered deprecated. If the kubelet cannot negotiate a supported CRI version, it does not register as a node.
- Upgrading: When upgrading Kubernetes, the kubelet attempts to automatically select the latest CRI version upon restart. If this fails, fallback occurs as mentioned above. If a gRPC re-dial is required due to an upgrade of the container runtime, the runtime must also support the initially selected version or the re-dial is expected to fail, necessitating a kubelet restart.
K8s Garbage Collection
Garbage Collection: Kubernetes uses various mechanisms to clean up cluster resources such as terminated pods, completed jobs, objects without owner references, unused containers and images, dynamically provisioned PersistentVolumes with a StorageClass reclaim policy of Delete, stale or expired CertificateSigningRequests (CSRs), deleted nodes, and Node Lease objects.
- Owners and Dependents: Many Kubernetes objects are linked through owner references. Owner references tell the control plane which objects are dependent on others. Kubernetes uses owner references to ensure related resources are cleaned up before deleting an object. Cross-namespace owner references are disallowed by design.
Cascading Deletion: When an object is deleted, Kubernetes can automatically delete the object's dependents, known as cascading deletion. Two types of cascading deletion exist:
- Foreground Cascading Deletion: The owner object first enters a deletion in progress state, then dependents are deleted, and finally, the owner object is deleted.
- Background Cascading Deletion: The owner object is deleted immediately, and the dependents are cleaned up in the background.
- Orphaned Dependents: When an owner object is deleted, the dependents left behind are called orphan objects. By default, Kubernetes deletes dependent objects.
- Garbage Collection of Unused Containers and Images: The kubelet performs garbage collection on unused images every five minutes and on unused containers every minute. Configuration options for this garbage collection can be set using the KubeletConfiguration resource type.
- Container Image Lifecycle: Kubernetes manages the lifecycle of all images through its image manager, part of the kubelet, with the cooperation of cadvisor. Disk usage limits guide garbage collection decisions.
- Container Garbage Collection: The kubelet garbage collects unused containers based on certain variables like MinAge, MaxPerPodContainer, and MaxContainers.
- Configuring Garbage Collection: Garbage collection of resources can be configured by tuning options specific to the controllers managing those resources.
To delve deeper into garbage collection in Kubernetes, you can explore topics such as configuring cascading deletion of Kubernetes objects, configuring cleanup of finished Jobs, learning more about ownership of Kubernetes objects, Kubernetes finalizers, and the TTL controller that cleans up finished Jobs.