Delivering high-quality video at the lowest possible bitrate remains a core challenge for both Video-on-Demand (VOD) platforms and live streaming services. As audiences grow and expectations of flawless streaming quality rise, video encoding workflows must also be increasingly efficient. This is where rate-control strategies such as Variable Bitrate (VBR), Constant Rate Factor (CRF), Content-aware Encoding (CAE) and Per-title/Per-shot optimization play a crucial role. From VBR, a fundamental low-level encoding concept, all the way to Per-title/Per-shot encoding, which relates to bitrate-resolution ladders in adaptive bitrate streaming, these terms are sometimes confused with one another and may require additional context. In fact, these technologies can be thought of as built on top of one another, which is visualized in Figure 1.
Let us explore each approach in more detail.
Traditional broadcast delivery relies on fixed or constant bitrates (CBR), due to constraints in the underlying transmission systems (e.g. fixed bandwidth). However, applying this approach to adaptive bitrate streaming can lead to avoidable inefficiencies, as the constraints on the transmission system and end-user devices are generally more relaxed. VBR makes use of this by allowing the encoder to dynamically adjust the bitrate based on scene complexity, while still trying to hit an average target bitrate. CRF, a commonly used quality-based mode in modern encoders, inverts the problem by targeting consistent, perceptual quality instead, resulting in even more bitrate variability. However, the resulting quality strongly depends on the video content when encoding with CRF. This results in a limitation whereby there is no setup in which one single CRF value rules all potential video sequences.
Building on these foundations, content-aware encoding leverages additional analytics (potentially powered by machine learning) to better understand the characteristics of each video asset. This enables more intelligent decisions regarding bitrate and codec parameters.
Per-title encoding expands on this idea by tailoring encoding ladders to the specific content type. Instead of using pre-determined bitrate-resolution rungs, both the number of rungs and each unique bitrate-resolution combination may be determined dynamically. Per-shot adds yet another complication by optimizing bitrate and encoding settings at the level of individual scenes or shots, ensuring that both simple and complex segments receive the appropriate level of compression. The adaptation of ABR ladders to end-user devices and viewing scenarios (e.g. TV vs. mobile), sometimes termed context-aware encoding, is beyond the scope of this article.
Together, these techniques form the backbone of modern video delivery strategies. They reduce storage and CDN costs and ensure that viewers experience the best possible quality on any device or network condition.
VBR typically works by dynamically adjusting quantization parameters (QP) for each frame, macroblock or coding block to achieve the desired bitrate within a specified window of frames or buffer size. This approach may inadvertently result in variable perceptual quality because bitrate alone does not directly correlate with human-perceived video quality, especially at those bitrates targeted for content-delivery. To address this issue, many encoders provide a Constant Rate Factor (CRF) rate control mode, which aims to maintain more consistent visual quality by also dynamically adjusting the QP based on motion and scene complexity.
However, because video content can vary dramatically in complexity over time or across an entire library, the optimal CRF value required to achieve uniform quality can differ significantly between videos or even within the same video. This variability makes it very challenging in real-world applications to preselect a CRF or QP value that consistently delivers the desired quality level without underallocating or overallocating the bitrate. In VOD and live-streaming scenarios, the highest rung/rendition is typically encoded in such a way as to reach an average VMAF score between 90 and 95, indicating very high quality. Overshooting this quality provides no visual benefit and essentially wastes bitrate. This can occur frequently if, for example, a fixed CRF value was predetermined based on the most difficult to encode assets in a video library. On the reverse side, applying this CRF to simple content will most likely overshoot the desired quality and spend more bitrate than required.
Strategies that try to solve these issues are exactly what content-aware encoding is about, and various commercial solutions exist as well as publications in academia.
The approaches can be roughly categorized into one of the following groups:
In summary, existing approaches solve the problem of content-aware encoding at consistent perceptual quality by utilizing a lot more computational and/or engineering resources. This requirement is often not feasible for companies with short-lived or time-critical content. Consequently, video engineers often resort to simple constant rate factor-based encoding as a “poor-man’s” solution for content-aware encoding.
At MainConcept, we considered how we could solve the problem of content-aware encoding in a simple and practical way that does not treat the encoder as a black box but rather performs constant-quality encoding as just another rate-control option, tightly integrated into a single-pass design. Secondly, we wanted to make the encoding configuration as simple as specifying a target-quality level, applicable to all content types, with all other optimization working under the hood. This means that two methods for content-aware encoding are needed: a way to measure perceptual quality and an algorithm that performs the adaptation.
Regarding the first method, when trying to optimize with a certain objective in mind—in this case perceptual video quality —the most important thing is the capability to precisely measure the desired property. So why not simply measure VMAF scores during the encoding process? At first glance this sounds like a good solution because VMAF is the de-facto, industry-proven, standard method to measure perceptual video quality. Unfortunately, VMAF is computationally expensive and thus very slow, which makes it simply impractical for an actual encoder targeting live-encoding speeds of 60 fps and beyond. Excellent CUDA acceleration of VMAF exists, but this requires the availability of specific GPUs. Additionally, VMAF, whether computed on CPU or GPU, only processes frames in presentation order due to the motion feature that is part of the VMAF score computation. This property also renders VMAF unsuitable for traditional rate control approaches where the coding order of frames often does not align with the presentation order to achieve higher compression efficiency.
As a first step to solving this problem, MainConcept Codec SDK 16.0 introduced VMAF-E within the vScore quality metrics suite. vScore is a suite of advanced video quality metrics that enables our encoders to measure quality while encoding. Within vScore is VMAF-E, a fast VMAF proxy that estimates the actual VMAF scores with high precision, available with our HEVC encoder. In fact, VMAF-E is fast enough to be utilized within the rate control loop of the encoding process and can be computed in coding order.
Blog: vScore: Fast and Easy Video Quality Measurement for MainConcept HEVC Encoding
Having the ability to measure perceptual quality makes the problem of reaching and maintaining a constant target quality a control problem, in principle very similar to other control problems such as maintaining the temperature in a heating system or altitude and position control in aeronautics. The classical solution to these problems is a proportional-integral-derivative controller (PID controller). A PID controller constantly compares the desired target value with the actual value of the system and automatically applies corrective actions to the system, specifically to the control variable of the system, to more closely align both values. This works well if the system to be controlled behaves linearly and is noise free, meaning that the correction applied to the system results in a proportional change. For example, with a heating system, this means that a change in valve position should result in some proportional change in temperature. Naturally, this change cannot occur instantly but rather with a delay.
Applying this technique to the problem of constant quality encoding, our control variable at hand is essentially the QP and the response of the system is the resulting visual quality measured by VMAF-E. The principal design of the control mechanism is shown in Figure 2.
Unfortunately, the relationship between QP and VMAF is highly non-linear, noisy and content-dependent, which makes the control problem more complex. However, in the context of video encoding, we can utilize an additional trick that other physical control problems cannot, namely, if the corrective term applied to the control variable does not result in the desired change, we can simply try again with another corrective term, meaning we can perform a reencoding. This, of course, does not come for free and results in degraded encoding speed, but importantly, this is a highly-localized, frame-level, two-pass operation rather than a full two-pass encode of the entire video.
Figure 3 demonstrates the level of quality consistency CTQ (with re-encodings) can achieve for a video comprised of multiple scenes. Whereas a fixed CRF encoding produces an encoding result that achieves the same average perceptual quality, e.g. a VMAF-E score of approximately 92 over the entire sequence, much larger variations in quality exist compared to CTQ.
In general, the threshold at which humans can perceive a just noticeable difference (JND) between two compressed videos is difficult to estimate with respect to VMAF and recommendations range from 2 to 6 VMAF scores. Studies show that the JND-VMAF relationship is content-dependent, so no universal “rule-of-thumb” exists. While this variability complicates the construction of VMAF-optimal bitrate ladders in ABR streaming, CTQ only requires a conservative upper bound on the amount of perceptual quality variation that remains imperceptible to viewers. Assuming that the distribution of the per-frame quality is Gaussian when encoding with CTQ mode, the standard deviation σ provides a quantitative estimate of CTQ’s accuracy. For example, for σ < 1 JND, 68.3% of all frames will be below the just noticeable threshold. If 3σ < 1 JND, then even 99.7% of all frames will have the same, indistinguishable perceptual quality. For the example shown in Figure 3, we measure 3σ (CRF) = 10.34 and 3σ (CTQ) = 2.65, meaning that very tight quality control can be achieved, even if 1 JND is assumed to be 3 VMAF scores. However, there is a cost, as the CRF encoding requires 5104 kbit/s compared to 6076 kbit/s for CTQ in this example, an increase in bitrate of roughly 20%. Decreasing quality variance is being traded for rate distortion efficiency. What trade-off is acceptable is ultimately the decision of the user, the application or it is governed by other constraints.
Our analysis shows that allowing some degree of quality variance is good for rate-distortion efficiency. In fact, we believe that a mild steering of the rate factor, disabling reencodings and thereby allowing quality to fluctuate within reasonable limits is a good default for CTQ and suitable for typical VOD and live-streaming use cases. This approach combines the benefits customers expect from CRF encoding with the ability to hit an average VMAF score for any type of content.
This feature set is exactly what the first release of CTQ in HEVC provides. Below is a command line example for the MainConcept sample HEVC encoder that highlights the simplicity of CTQ usage.
This command line example will generate a compressed bitstream at VMAF quality 92 for any type of content. Of course, it is also possible to cap the bitrate at a predefined value using hypothetical reference decoder (HRD) conformance, which is frequently used to control cost or provide better bitrate accuracy in VOD-style encodings. The benefit of CTQ encoding becomes much clearer when looking at an entire library of video assets.
Figure 4 shows both the distribution of resulting average VMAF scores and the respective bitrates for a library of about 50 video assets. It is clear that fixed CRF encoding suffers from a “one-size-fits-all” approach, meaning that a fixed CRF used for encoding the entire library may lead to a huge spread in resulting perceptual quality. CTQ on the other hand allows for consistent perceptual quality. Bitrate savings over the entire library can now be realized. For example, if the entire library is encoded with a conservatively chosen CRF=28, the resulting average bitrate equates to 10991 kbit/s with a large spread in resulting VMAF scores and significant quality outliers. On the other hand, a CTQ encoding targeting 92 VMAF delivers exactly this quality with a small margin of error and a single outlier. The resulting average bitrate is 8932 kbit/s, an almost 20% bitrate saving over fixed CRF encoding. It is important to note that the bitrate savings are highly dependent on the composition of the library. For a very homogenous library, meaning assets have very similar characteristics, CTQ would not provide a rate-saving benefit over a well-chosen, predetermined CRF. It would still be advantageous, however, to remove the tedious tasks of sampling the library for representative assets, running multiple trial encodings and taking VMAF measurements.
CTQ is MainConcept’s answer to the problem of content-aware encoding, focusing on delivering constant quality at a variable bitrate. It is a single-pass design, capable of running at live-encoding speeds. CTQ eliminates the need for trial encodings to determine optimal bitrates or rate factors that deliver the desired quality.
Initially released for our HEVC/H.265 encoder and built on top of our highly efficient constant rate factor encoding mode, we are also planning to release it for the workhorse of the streaming industry—AVC/H.264—in the near future.