Jens Schneider & Max BläserDec 8, 202513 min read

Constant Target Quality: Encoding Driven by Perceptual Fidelity

17:27

Introduction

Delivering high-quality video at the lowest possible bitrate remains a core challenge for both Video-on-Demand (VOD) platforms and live streaming services. As audiences grow and expectations of flawless streaming quality rise, video encoding workflows must also be increasingly efficient. This is where rate-control strategies such as Variable Bitrate (VBR), Constant Rate Factor (CRF), Content-aware Encoding (CAE) and Per-title/Per-shot optimization play a crucial role. From VBR, a fundamental low-level encoding concept, all the way to Per-title/Per-shot encoding, which relates to bitrate-resolution ladders in adaptive bitrate streaming, these terms are sometimes confused with one another and may require additional context. In fact, these technologies can be thought of as built on top of one another, which is visualized in Figure 1.

blog - CTQ 1 Figure 1

Let us explore each approach in more detail.

Traditional broadcast delivery relies on fixed or constant bitrates (CBR), due to constraints in the underlying transmission systems (e.g. fixed bandwidth). However, applying this approach to adaptive bitrate streaming can lead to avoidable inefficiencies, as the constraints on the transmission system and end-user devices are generally more relaxed. VBR makes use of this by allowing the encoder to dynamically adjust the bitrate based on scene complexity, while still trying to hit an average target bitrate. CRF, a commonly used quality-based mode in modern encoders, inverts the problem by targeting consistent, perceptual quality instead, resulting in even more bitrate variability. However, the resulting quality strongly depends on the video content when encoding with CRF. This results in a limitation whereby there is no setup in which one single CRF value rules all potential video sequences.

Building on these foundations, content-aware encoding leverages additional analytics (potentially powered by machine learning) to better understand the characteristics of each video asset. This enables more intelligent decisions regarding bitrate and codec parameters.

Per-title encoding expands on this idea by tailoring encoding ladders to the specific content type. Instead of using pre-determined bitrate-resolution rungs, both the number of rungs and each unique bitrate-resolution combination may be determined dynamically. Per-shot adds yet another complication by optimizing bitrate and encoding settings at the level of individual scenes or shots, ensuring that both simple and complex segments receive the appropriate level of compression. The adaptation of ABR ladders to end-user devices and viewing scenarios (e.g. TV vs. mobile), sometimes termed context-aware encoding, is beyond the scope of this article.

Together, these techniques form the backbone of modern video delivery strategies. They reduce storage and CDN costs and ensure that viewers experience the best possible quality on any device or network condition.

The leap from CRF to content-aware encoding

VBR typically works by dynamically adjusting quantization parameters (QP) for each frame, macroblock or coding block to achieve the desired bitrate within a specified window of frames or buffer size. This approach may inadvertently result in variable perceptual quality because bitrate alone does not directly correlate with human-perceived video quality, especially at those bitrates targeted for content-delivery. To address this issue, many encoders provide a Constant Rate Factor (CRF) rate control mode, which aims to maintain more consistent visual quality by also dynamically adjusting the QP based on motion and scene complexity.

However, because video content can vary dramatically in complexity over time or across an entire library, the optimal CRF value required to achieve uniform quality can differ significantly between videos or even within the same video. This variability makes it very challenging in real-world applications to preselect a CRF or QP value that consistently delivers the desired quality level without underallocating or overallocating the bitrate. In VOD and live-streaming scenarios, the highest rung/rendition is typically encoded in such a way as to reach an average VMAF score between 90 and 95, indicating very high quality. Overshooting this quality provides no visual benefit and essentially wastes bitrate. This can occur frequently if, for example, a fixed CRF value was predetermined based on the most difficult to encode assets in a video library. On the reverse side, applying this CRF to simple content will most likely overshoot the desired quality and spend more bitrate than required.

Strategies that try to solve these issues are exactly what content-aware encoding is about, and various commercial solutions exist as well as publications in academia.

The approaches can be roughly categorized into one of the following groups:

Brute-force or iterative encoding at various target bitrates or rate factors, where the actual encoder is treated as a black box. Quality is computed post-encoding and adaptation is performed on a per-scene or per-chunk basis. This approach is suitable for VOD-style encoding. It can be easily implemented with scripting languages, and the overall problem can be solved in reasonable time by parallelization and using large amounts of computational resources. However, this heavy reliance on repeated full encodes makes the approach more costly at scale.
Pre-analysis is applied to the source video with the goal of determining optimal encoding configurations. Conceptually, this pre-analysis is based on features calculated on the input video that are either derived from classical signal theory or inferred by some machine learning algorithm. If machine learning approaches are utilized, a GPU might be required for reasonable inference speed. With suitable performance, live-encoding scenarios could therefore also be targeted. However, considerable engineering effort and video data is required to replicate such an approach. The underlying encoder must have an interface that allows for more control or on-the-fly changes of encoding parameters.
Multiple encoding passes are utilized. In the first pass, an analysis or pre-encoding is performed, which gives insights into the complexity of the video and to which parts of the video the most bitrate must be allocated. Using the data from the first pass, a second pass is performed for the actual encoding, typically without performing an expensive motion estimation again. To the best of our knowledge, commercial encoders do not utilise perceptual quality like VMAF as part of multi-stage encoding because it is prohibitively expensive. In general, multipass approaches can work well for VOD scenarios if the additional encoding complexity is deemed acceptable to the user. Naturally, this is not a fitting approach for low-delay, live-encoding applications.

In summary, existing approaches solve the problem of content-aware encoding at consistent perceptual quality by utilizing a lot more computational and/or engineering resources. This requirement is often not feasible for companies with short-lived or time-critical content. Consequently, video engineers often resort to simple constant rate factor-based encoding as a “poor-man’s” solution for content-aware encoding.

Our journey

At MainConcept, we considered how we could solve the problem of content-aware encoding in a simple and practical way that does not treat the encoder as a black box but rather performs constant-quality encoding as just another rate-control option, tightly integrated into a single-pass design. Secondly, we wanted to make the encoding configuration as simple as specifying a target-quality level, applicable to all content types, with all other optimization working under the hood. This means that two methods for content-aware encoding are needed: a way to measure perceptual quality and an algorithm that performs the adaptation.

Regarding the first method, when trying to optimize with a certain objective in mind—in this case perceptual video quality —the most important thing is the capability to precisely measure the desired property. So why not simply measure VMAF scores during the encoding process? At first glance this sounds like a good solution because VMAF is the de-facto, industry-proven, standard method to measure perceptual video quality. Unfortunately, VMAF is computationally expensive and thus very slow, which makes it simply impractical for an actual encoder targeting live-encoding speeds of 60 fps and beyond. Excellent CUDA acceleration of VMAF exists, but this requires the availability of specific GPUs. Additionally, VMAF, whether computed on CPU or GPU, only processes frames in presentation order due to the motion feature that is part of the VMAF score computation. This property also renders VMAF unsuitable for traditional rate control approaches where the coding order of frames often does not align with the presentation order to achieve higher compression efficiency.

As a first step to solving this problem, MainConcept Codec SDK 16.0 introduced VMAF-E within the vScore quality metrics suite. vScore is a suite of advanced video quality metrics that enables our encoders to measure quality while encoding. Within vScore is VMAF-E, a fast VMAF proxy that estimates the actual VMAF scores with high precision, available with our HEVC encoder. In fact, VMAF-E is fast enough to be utilized within the rate control loop of the encoding process and can be computed in coding order.

blog-CTQ2 Figure 2

Blog: vScore: Fast and Easy Video Quality Measurement for MainConcept HEVC Encoding

CTQ design and implementation

Having the ability to measure perceptual quality makes the problem of reaching and maintaining a constant target quality a control problem, in principle very similar to other control problems such as maintaining the temperature in a heating system or altitude and position control in aeronautics. The classical solution to these problems is a proportional-integral-derivative controller (PID controller). A PID controller constantly compares the desired target value with the actual value of the system and automatically applies corrective actions to the system, specifically to the control variable of the system, to more closely align both values. This works well if the system to be controlled behaves linearly and is noise free, meaning that the correction applied to the system results in a proportional change. For example, with a heating system, this means that a change in valve position should result in some proportional change in temperature. Naturally, this change cannot occur instantly but rather with a delay.

Applying this technique to the problem of constant quality encoding, our control variable at hand is essentially the QP and the response of the system is the resulting visual quality measured by VMAF-E. The principal design of the control mechanism is shown in Figure 2.

Unfortunately, the relationship between QP and VMAF is highly non-linear, noisy and content-dependent, which makes the control problem more complex. However, in the context of video encoding, we can utilize an additional trick that other physical control problems cannot, namely, if the corrective term applied to the control variable does not result in the desired change, we can simply try again with another corrective term, meaning we can perform a reencoding. This, of course, does not come for free and results in degraded encoding speed, but importantly, this is a highly-localized, frame-level, two-pass operation rather than a full two-pass encode of the entire video.

Constant quality vs. rate-distortion efficiency

Figure 3 demonstrates the level of quality consistency CTQ (with re-encodings) can achieve for a video comprised of multiple scenes. Whereas a fixed CRF encoding produces an encoding result that achieves the same average perceptual quality, e.g. a VMAF-E score of approximately 92 over the entire sequence, much larger variations in quality exist compared to CTQ.

ctq_example_2 (1) Figure 3

In general, the threshold at which humans can perceive a just noticeable difference (JND) between two compressed videos is difficult to estimate with respect to VMAF and recommendations range from 2 to 6 VMAF scores. Studies show that the JND-VMAF relationship is content-dependent, so no universal “rule-of-thumb” exists. While this variability complicates the construction of VMAF-optimal bitrate ladders in ABR streaming, CTQ only requires a conservative upper bound on the amount of perceptual quality variation that remains imperceptible to viewers. Assuming that the distribution of the per-frame quality is Gaussian when encoding with CTQ mode, the standard deviation σ provides a quantitative estimate of CTQ’s accuracy. For example, for σ < 1 JND, 68.3% of all frames will be below the just noticeable threshold. If 3σ < 1 JND, then even 99.7% of all frames will have the same, indistinguishable perceptual quality. For the example shown in Figure 3, we measure 3σ (CRF) = 10.34 and 3σ (CTQ) = 2.65, meaning that very tight quality control can be achieved, even if 1 JND is assumed to be 3 VMAF scores. However, there is a cost, as the CRF encoding requires 5104 kbit/s compared to 6076 kbit/s for CTQ in this example, an increase in bitrate of roughly 20%. Decreasing quality variance is being traded for rate distortion efficiency. What trade-off is acceptable is ultimately the decision of the user, the application or it is governed by other constraints.

Rate-factor driven constant quality encoding

Our analysis shows that allowing some degree of quality variance is good for rate-distortion efficiency. In fact, we believe that a mild steering of the rate factor, disabling reencodings and thereby allowing quality to fluctuate within reasonable limits is a good default for CTQ and suitable for typical VOD and live-streaming use cases. This approach combines the benefits customers expect from CRF encoding with the ability to hit an average VMAF score for any type of content.

This feature set is exactly what the first release of CTQ in HEVC provides. Below is a command line example for the MainConcept sample HEVC encoder that highlights the simplicity of CTQ usage.

$ ./sample_enc_hevc -I420 -w 1920 -h 1080 -v testfile.yuv -lf license.lic -o test.hevc -target_quality 92 -bit_rate_mode 6 -quality_metric 8

This command line example will generate a compressed bitstream at VMAF quality 92 for any type of content. Of course, it is also possible to cap the bitrate at a predefined value using hypothetical reference decoder (HRD) conformance, which is frequently used to control cost or provide better bitrate accuracy in VOD-style encodings. The benefit of CTQ encoding becomes much clearer when looking at an entire library of video assets.

crf_ctq_vmaf_bitrate_2 (1) Figure 4

Figure 4 shows both the distribution of resulting average VMAF scores and the respective bitrates for a library of about 50 video assets. It is clear that fixed CRF encoding suffers from a “one-size-fits-all” approach, meaning that a fixed CRF used for encoding the entire library may lead to a huge spread in resulting perceptual quality. CTQ on the other hand allows for consistent perceptual quality. Bitrate savings over the entire library can now be realized. For example, if the entire library is encoded with a conservatively chosen CRF=28, the resulting average bitrate equates to 10991 kbit/s with a large spread in resulting VMAF scores and significant quality outliers. On the other hand, a CTQ encoding targeting 92 VMAF delivers exactly this quality with a small margin of error and a single outlier. The resulting average bitrate is 8932 kbit/s, an almost 20% bitrate saving over fixed CRF encoding. It is important to note that the bitrate savings are highly dependent on the composition of the library. For a very homogenous library, meaning assets have very similar characteristics, CTQ would not provide a rate-saving benefit over a well-chosen, predetermined CRF. It would still be advantageous, however, to remove the tedious tasks of sampling the library for representative assets, running multiple trial encodings and taking VMAF measurements.

Summary

CTQ is MainConcept’s answer to the problem of content-aware encoding, focusing on delivering constant quality at a variable bitrate. It is a single-pass design, capable of running at live-encoding speeds. CTQ eliminates the need for trial encodings to determine optimal bitrates or rate factors that deliver the desired quality.

Initially released for our HEVC/H.265 encoder and built on top of our highly efficient constant rate factor encoding mode, we are also planning to release it for the workhorse of the streaming industry—AVC/H.264—in the near future.

Jens Schneider & Max Bläser

Senior Video Coding Research and Development Engineers

Max focuses on applying machine learning algorithms to encoders and video applications. He previously worked for Germany’s largest private streaming and broadcast company where he co-developed large-scale transcoding systems for VOD streaming. He received the Dipl.-Ing. in electrical and communications engineering and Dr.-Ing. degrees from RWTH Aachen University. During his doctoral studies at the Institut für Nachrichtentechnik (IENT), Max performed research in the area of video coding and actively contributed to the standardization of Versatile Video Coding (VVC).

Jens received his Dr.-Ing. degree in communications engineering from RWTH Aachen University in 2021. His research focused on the link between machine learning and low-level coding tools in combination with higher level video coding concepts such as dynamic resolution coding. Jens joined MainConcept after working as a Software Engineer in the cloud native landscape. In his role at MainConcept, he is currently working on machine learning-based encoder optimizations and cloud native simulation setups.

VMAF-E

MainConcept Easy Video API (EVA)

CMAF: Low-Latency at Scale