Tuning a Kubernetes cluster for running video encoders

Written by Jens Schneider & Max Bläser | Nov 20, 2024

Optimizing a Kubernetes cluster for running video encoders—or other long-running, computationally demanding workloads—can boost encoding speed by up to 10%. This can be achieved by fine-tuning Kubernetes CPU Manager policies and using features such as CPU pinning and node-level NUMA topology alignment. Furthermore, parallelizing video encoding and correctly configuring threading will optimize the overall throughput of the cluster. In this article, we explore how adjusting these settings impacts performance and provide practical guidelines that can also be applied to your setup.

Introduction

Running video encoders on Kubernetes—whether on a managed cloud service such as Amazon Web Services (AWS), Microsoft Azure or Google Cloud Platform (GCP), or in a private cloud—has become a common approach to scaling video transcoding workflows. As we demonstrated in our previous blog post, setting up a Dask cluster on Kubernetes, combined with an encoder binary, is a straightforward way to run encoder comparisons, VOD transcoding or any other parallelized computation.

Kubernetes is a versatile orchestration platform capable of managing various workloads, with one of its main applications being microservices and web applications that can easily scale using features such as the Horizontal Pod Autoscaler. To efficiently utilize the underlying hardware, Kubernetes relies on resource requests and limits. These control the amount of CPU and memory allocated to each container, and additionally influence where the workload is scheduled within the cluster. However, for specific workloads such as video encoders, the behaviour can differ significantly compared to running directly on bare metal.

Video encoders are unique in that they are typically long-running, multi-threaded, NUMA-aware and memory-intensive. Fortunately, Kubernetes provides advanced configuration options through its CPU Manager, which can be tuned for such demanding applications. In this article, we assume you already have a Kubernetes cluster running and have access to modify the Kubelet. If not, we highly recommend having a look at k3s, which is a fully compliant and lightweight Kubernetes distribution.

The basics of Kubernetes Resource Management

Kubernetes manages and allocates computing resources for containers in a cluster using resource requests and limits:

Requests specify the minimum amount of resources guaranteed for a container
Limits specify the maximum resources a container can use

Kubernetes schedules containers based on their requests and enforces limits to prevent resource contention.

In Kubernetes, CPU resources are measured in millicores, where 1 CPU core equals 1000 millicores (1000m). One CPU unit corresponds to a logical CPU core or virtual core, depending on whether the node is a physical machine or a virtual machine. The use of millicores allows Kubernetes to support fractional CPU requests and limits.

But how can an application use, for example, 500 millicores (or 0.5 CPUs)? Kubernetes employs a time-sharing model for CPU usage based on the Linux kernel's Completely Fair Scheduler (CFS) and control groups (cgroups). While the specific mechanics of this model are not crucial for this discussion, you can refer to the article "Making Sense of Kubernetes CPU Requests and Limits" hyperlinked at the end of this article for more details. It must also be noted that CFS will be replaced by Earliest Eligible Virtual Deadline First (EEVDF) scheduling, starting from version 6.6 of the Linux kernel. We do not know how this change will impact Kubernetes in the long run (ideally not at all, since Kubernetes operates entirely in user land) but if it does have an impact, send us an email or a comment.

The key takeaway is that workloads are throttled according to their resource limits in Kubernetes. This throttling can lead to undesirable behaviour, particularly for long-running, multi-threaded applications such as video encoders. By default, when a container is throttled (i.e., temporarily suspended), it can be rescheduled on any available CPU once it is resumed. This can cause frequent context switching, where threads are moved between different CPUs, negatively impacting performance.

In the next section, we will explore optimizations for Kubernetes, such as CPU pinning and NUMA topology alignment, which address these issues and improve throughput for workloads such as video encoders. After that, we will examine how to best utilize the hardware and maximize parallelized encoding performance.

Optimizing Kubernetes CPU management

As mentioned above, a fundamental problem using the default behaviour of the Kubernetes CPU Manager is that stopped threads may be scheduled on different CPUs once they are resumed. To make matters worse, threads of the same process could even be scheduled on CPUs in different NUMA nodes. A NUMA node (Non-Uniform Memory Access node) refers to a memory and processor architecture where memory is divided into separate regions, each closely associated with a specific group of processors. In a NUMA system, memory access time varies depending on whether the memory is local (attached to the same node as the CPU accessing it) or remote (attached to a different node).

Fortunately, the Kubernetes CPU Manager can be tuned for specific workloads and hardware architectures. For our setup using nodes composed of 2x32-core AMD EPYC 7513 CPUs with 128 logical cores, we found the AMD EPYC 7003 Kubernetes Tuning Guide to be invaluable for finding better settings, which we will briefly discuss in the following sections. Note that a similar CPU Pinning and Isolation in Kubernetes* Technology Guide exists for Intel architectures.

CPU Manager policies

The default (or none) policy of the CPU Manager provides no CPU affinity (assigning or pinning a process to one or more specific CPUs) beyond what the OS scheduler does automatically. With this setting, limits on CPU usage for Guaranteed pods and Burstable pods are enforced using the CFS quota. Setting the policy to static allows containers in Guaranteed pods with integer CPU requests exclusive access to CPUs on the Kubernetes node. This exclusivity is enforced using the cpuset cgroup controller. The CPU Manager policy is set with the --cpu-manager-policy Kubelet flag or the cpuManagerPolicy field in the KubeletConfiguration. The number of exclusively allocatable CPUs is equal to the total number of CPUs of the node minus any CPU reservations by the Kubelet via --kube-reserved or --system-reserved. The CPU reservation list can be specified explicitly by the Kubelet's --reserved-cpus option. Also, the explicit CPU list specified by --reserved-cpus takes precedence over the CPU reservation specified by --kube-reserved and --system-reserved. CPUs reserved by these options are taken, in integer quantity, from the initial shared pool in ascending order by physical core ID. This shared pool is the set of CPUs on which any containers in BestEffort and Burstable pods run. Containers in Guaranteed pods with fractional CPU requests also run on CPUs in the shared pool. Only containers that are both part of a Guaranteed pod and have integer CPU requests are assigned exclusive CPUs.

A closer look at our setup

For our AMD nodes, we chose to reserve 2 logical CPUs (1 physical core) for the Kubelet. This may be larger or smaller, depending on your needs. Note that when specifying the reserved CPUs—at least for AMD Epyc processors—logical cores on the same physical core do not have consecutive IDs. You can easily inspect the layout of your system, for example using lstopo --of ascii. For a k3s cluster, the Kubelet can now be configured either by appending CPU Manager configuration options during install or by specifying a configuration YAML at /etc/rancher/k3s/config.yaml.

When installing a k3s node:

TOKEN=... # insert k3s token 
URL=... # URL of control plane node 
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.29.3+k3s1 K3S_URL=${URL} K3S_TOKEN=${TOKEN} sh -s - --kubelet-arg=cpu-manager-policy=static --kubelet-arg=kube-reserved=cpu=2000m,memory=8G --kubelet-arg=reserved-cpus="63,127"

Alternatively, using a configuration YAML at /etc/rancher/k3s/config.yaml as follows:

kubelet-arg: 
    - "cpu-manager-policy=static" 
    - "kube-reserved=cpu=2000m,memory=8G" 
    - "reserved-cpus=63,127"

Validation of the settings

To evaluate the impact of the CPU Manager changes, we saturated a Kubernetes node with 100 encoding jobs (each using 6 threads) with the MainConcept HEVC/H.265 Video Encoder at performance level 10 and measured the frames-per-second (FPS) encoding speed. The box plot on the left in the following image visualizes the results:

As shown, the average encoding speed improves significantly when switching from the none to the strict CPU Manager policy. However, there are notable outliers in the data where the encoding speed is either much lower or higher than the average. To investigate further, we analyzed the encoding processes with the psutil Python library. The bar plot on the right shows the actual CPUs used by the encoding processes.

With the none policy, as expected, the 6 threads of each encoder were scheduled randomly across all available logical cores, shown as [0...126]. For the strict policy, we only highlighted the two most extreme outliers at both ends of the performance spectrum in the bar plot. In the worst-performing case, the threads were assigned to logical cores [0, 1, 62, 64, 65, 126]. By inspecting the topology of the system using the output of lstopo, it becomes clear that (0, 64), (1, 65) and (62, 126) each correspond to a single, physical core. However, (0, 64) and (1, 65) are on the first NUMA node, while (62, 126) is on the second NUMA node.

In contrast, the best-performing encoding had its threads scheduled to [56, 57, 58, 120, 121, 122], which mapped to the three physical cores (56, 120), (57, 121) and (58, 122). These cores are not only on the same NUMA node, but also share the same L3 cache, contributing to improved performance.

Introducing NUMA-aware scheduling

While the average performance improved, we still wanted to eliminate outliers and achieve more consistent encoding speeds. One way to do this was by preventing scheduling across multiple NUMA nodes. The Kubernetes CPU Manager provides additional settings to enable this:

TOKEN=... # insert k3s token
URL=... # URL of control plane node
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.29.3+k3s1 K3S_URL=${URL} K3S_TOKEN=${TOKEN} sh -s - --kubelet-arg=cpu-manager-policy=static --kubelet-arg=kube-reserved=cpu=2000m,memory=8G --kubelet-arg=reserved-cpus="63,127" --kubelet-arg=cpu-manager-policy-options=full-pcpus-only=true --kubelet-arg=topology-manager-policy=restricted

kubelet-arg:
    - "cpu-manager-policy=static"
    - "kube-reserved=cpu=2000m,memory=8G"
    - "reserved-cpus=63,127"
    - "cpu-manager-policy-options=full-pcpus-only=true"
    - "topology-manager-policy=restricted"

If the full-pcpus-only policy option is specified, the static policy will always allocate full physical cores. By default (without this option) the static policy instead allocates CPUs using a topology-aware, best-fit allocation. The Topology Manager supports four different allocation policies. These can be set via the Kubelet flag --topology-manager-policy. The four supported policies are:

none (default)
best-effort
restricted
single-numa-node

When using the restricted topology management policy, the Kubelet calls a Hint Provider for every container to discover the resource availability. Using this information, the Topology Manager stores the preferred NUMA node affinity for that container. If the affinity is not preferred, the Topology Manager will reject the pod from the node and the pod will reach a Terminated state with a pod admission failure. The single-numa-node option is even stricter and will cause the Topology Manager to determine if a single NUMA node affinity is possible. If this is not possible, the Topology Manager will also reject the pod from the node with a pod admission failure.

Running the encoding experiment again with the additional NUMA-aware scheduling, we get the following result:

The average encoding speed is still much higher compared to using the none policy. Although there are still outliers at the lower end, the extreme ones have been eliminated compared to using the strict policy and a much more consistent encoding speed can be measured.

Kubernetes resources and video encodings

Now that we have covered the basics of Kubernetes resource management and have tuned the cluster for specific workloads, let us go into more details. When using a Kubernetes cluster for offline video encoding, a primary goal would be to maximize throughput, i.e., encode the most content per computing hour. This is particularly important when encoding assets for a VOD service where meeting real-time performance for individual assets is not critical. If faster offline encoding for a single asset is needed, it can easily be split into multiple chunks, encoded in parallel and the bitstreams concatenated later. By adjusting chunk sizes, this method allows for near-unlimited acceleration.

To achieve high throughput for video encoding on Kubernetes, we can take two main steps:

Parallelizing encodings
Limiting resources per encoding instance

This leads us to three key questions:

How many parallel encoding pods should be run per node?
How much of Kubernetes resources should be assigned to each encoding pod?
Should the number of threads per encoder be optimized?

Remarks on parallelization

The core assumption here is that an encoder's performance, measured in frames-per-second, does not scale linearly with the number of CPU cores (or threads) it has available. While a well-designed encoder can theoretically scale its performance with an increasing number of cores or threads, practical limits exist. More precisely, synchronization between threads introduces overhead, or there may simply not be enough work to parallelize. Therefore, it is often more efficient to run multiple encodings in parallel instead of focusing on maximizing the performance of a single encoding on one machine.

The optimal settings for parallel encodings of course may be constrained by the underlying hardware. For instance, if a Kubernetes cluster consists of nodes with 48 logical cores, running 4 encoding pods with a limit of 10 cores each would lead to underutilization, leaving some computing power unused (assuming no other workloads are running on the node). Conversely, running 48 pods, each with a 1-core CPU limit and 8 GB of RAM (e.g. for UHD video encoding), would require a total of 384 GB of memory. This demand could not be met without a node with substantial memory resources. Furthermore, inefficiencies due to I/O bottlenecks could arise, such as reading or writing to a network share.

Configuring the optimal number of threads

Encoders are typically very greedy in terms of resources: they detect the number of CPU cores of the system and scale their thread pools accordingly, trying to make use of as many resources as possible. This is most often the "auto" setting of the encoder and it is fine for many use cases, such as encoding a video after editing on a desktop machine. However, when running encoders on Kubernetes with shared resources and potential interference, things change. For example, even if a container is limited to 8 cores by CPU pinning, the encoder inside the container will detect the total number of cores on the host system (which could be much higher) and scale its thread pool based on that, unaware of the Kubernetes-imposed limits.

This raises an important question: could leaving encoders in "auto" threading mode, while running many encoders in parallel, lead to overcommitment of system resources and unnecessary overhead? Fortunately, most encoders allow for manual threading specification:

ffmpeg -i source.ts -threads 8 ... # works with x264, x265

ffmpeg -i source.ts -c:v omx_enc_hevc -omx_core libomxil_core.so -omx_name OMX.MainConcept.enc_hevc.video -omx_param "num_threads=8" ... # for the MC HEVC FFmpeg Plugin

sample_enc_hevc -num_threads 8 ... # for the MC HEVC Sample Encoder

That is why we designed an experiment to investigate this using the MainConcept HEVC/H.265 Video Encoder. On our Kubernetes cluster, we started a large set of parallel encodings that completely used the capacity of a single node, varied the num_threads settings and measured the resulting FPS for every encoder. Each encoder ran in a container in a pod with a CPU limit of 6 (which was an educated guess for a reasonable trade-off between compute and I/O). To completely saturate the cluster and ensure repeatability, we ran a total of 100 encodings per tested num_threads setting. The result of this experiment is visualized in the following box plot:

As can be deduced from the result, the auto setting, meaning num_threads=0, is suboptimal in terms of encoding performance. It is also very reasonable and unsurprising that using fewer threads than the number of cores assigned (num_threads<cpu_limit) did not deliver good performance because computation time is simply not used. The highest performance was achieved in this experiment when the number of threads was equal to the number of CPU cores. Interestingly, the average performance initially dropped slightly when using more threads than cores (16, 32) and then rose again and leveled off for the greatest number of threads (512) being evaluated. The box plot shows that FPS measurements are also quite noisy and often have significant outliers. With the result of this experiment in mind, we can continue and optimize for maximum throughput by varying the number of parallel encodings and the CPU limits.

Optimal number of parallel encodings and CPU limit

In this experiment, we then tried to find which CPU limit and hence number of parallel encodings maximized the overall throughput. In the previous experiment, we simply made an educated guess of 6 CPUs (equivalent to 21 parallel encodings on a 128 cores Kubernetes node). But what will the optimum be? More encodings at slower speed or fewer encodings running faster? To determine the answer, we can quickly run the required simulations by subsequently launching multiple Dask clusters, each having different CPU limits and numbers of workers, such as:

from dask.distributed import Client, Future
from dask_kubernetes.operator import make_cluster_spec, KubeCluster
from datetime import datetime, timezone
import subprocess
clusters = [
    {"num_workers":64, "cpu_limit":2},
    {"num_workers":32, "cpu_limit":4},
    {"num_workers":20, "cpu_limit":6},
    {"num_workers":16, "cpu_limit":8},
    {"num_workers":12, "cpu_limit":10},
    {"num_workers":10, "cpu_limit":12},
]
spec = make_cluster_spec(
    name="optimize-resources", image="ghcr.io/dask/dask"
)
spec["spec"]["worker"]["spec"]["containers"][0]["args"] += [
    "--nworkers",
    "1",
    "--nthreads",
    "1",
    "--memory-limit",
    "0",
]

def simulation_run(
    simulation_params: dict
):
    # run the encoding ...
    enc_cmd = [
        "./sample_enc_hevc", 
        "-I420", "-w", "1920", "-h", "1080", "-f", "30.0", "-v", "test.yuv", "-perf", "10"
    ]
    if simulation_params.get("with_io"):
        enc_cmd += ["-o", "test.hevc", "-preview", "rec.yuv"]
    else:
        enc_cmd += ["-o", "/dev/null"]
    start = datetime.now(timezone.utc)
    subprocess.run(enc_cmd)
    finished = datetime.now(timezone.utc)
    simulation_result = {
        "fps": 600 / (finished - start).total_seconds(),
        "started": started,
        "finished": finished
    }
    return simulation_result.update(simulation_params)
    
results = []
for cluster in clusters:
    spec["spec"]["worker"]["replicas"] = cluster.get("num_workers")
    spec["spec"]["worker"]["spec"]["containers"][0]["resources"] = {
        "requests": {"memory": "4G", "cpu": f"{cluster.get('cpu_limit')}"},
        "limits":   {"memory": "4G", "cpu": f"{cluster.get('cpu_limit')}"},
    }

    cluster_object = KubeCluster(custom_cluster_spec=spec, shutdown_on_close=True)
    client = cluster_object.get_client()
    num_active_workers = len(client.scheduler_info()['workers'])

    futures = []
    for with_io in [True, False]:
        # launch enough encodings to keep the cluster busy
        for encoding_run_idx in range(5*num_active_workers):
            future = client.submit(
                simulation_run, 
                {
                    "encoding_run_idx": encoding_run_idx,
                    "cpu_limit": cluster.get('cpu_limit'),
                    "num_active_workers": num_active_workers,
                    "with_io": with_io
                }
            )
            futures.append(future)
        results.append(client.gather(futures))
    client.close()
    cluster_object.close()

As can be seen in the code, we can also simulate the impact of I/O by letting the encoder write out a bitstream and a reconstructed raw video. Note that writing out a reconstructed video is not required when simply performing VOD transcodings. From the collected results, we then plot the per-core performance in terms of achievable FPS throughput:

The result clearly indicates that for maximizing throughput, it makes sense to use many workers (e.g. 63) and a small CPU limit (e.g. 2 cores). Since the performance is monotonically decreasing in line with increasing parallelization and the impact of I/O is constant, it can be concluded that I/O and memory bandwidth did not seem to create a bottleneck. Note that this assumption might not hold up when highly parallelizing 4K or 8K encodings. Scaling the results up to the 2x32 AMD Epyc machine we used, we can encode a maximum of approximately 580 HD frames per second in HD resolution using the MainConcept HEVC/H.265 Video Encoder's performance level 10. If less parallelization is chosen (to encode a single asset or chunk faster), for example using 8 logical cores and threads, we can still encode about 510 HD resolution frames per second.

Conclusion and summary

In this article, we have demonstrated several ways to maximize the utilization of a Kubernetes cluster and to smartly balance CPU limits and threading to maximize throughput for video encoding:

By pinning encoding threads to CPUs and ensuring NUMA aware scheduling, the average encoding performance can be increased by 10%. This optimization also potentially applies to other long-running and computationally demanding workloads.
Explicitly specifying the number of threads per encoder avoids oversubscription and boosts speed by 20%.
High parallelization will result in maximum throughput of frames-per-second, which is desirable for offline encoding.