Jens Schneider & Max BläserApr 30, 202610 min read

Constant Target Quality Encoding for Segment-Based VOD: Now Available for AVC/H.264

14:27

In our previous post, we introduced Constant Target Quality (CTQ) encoding, a new rate control mode in the MainConcept HEVC/H.265 Video Encoder, which allows you to specify a target VMAF quality level and have the encoder automatically hit the target in a single pass. We explained why achieving consistent perceptual quality across a large video library with diverse content is difficult with traditional CRF encoding and how our VMAF-E proxy metric, together with a PID-style control loop inside the encoder, solves this problem at live-encoding speeds.

Since then, we have further optimized CTQ:

CTQ is now available for AVC/H.264, the workhorse of the video industry, with the release of MainConcept Codec SDK 16.2. Its feature set is identical to HEVC: Just set a target quality, start the encoder and you are ready to go!
We have further improved the CTQ algorithm for both codecs, lowering quality variance and the rate-distortion efficiency gap for CRF encoding. Also, we have added the option to re-encode dedicated frames for even tighter quality constraints.

In this post, we will highlight the most important use case in which CTQ pays off: segment-based Video-on-Demand (VOD) encoding with AVC/H.264. We will also highlight how to use the vScore logging feature to measure per-frame quality directly from the encoder and share results of large-scale Dask experiments comparing CTQ with CRF for a full-length broadcast sequence.

For more information, check out:

This video where Jens explains in depth how to use vScore logging for the HEVC/H.265 sample encoder
Our previous blog posts: Constant Target Quality: Encoding Driven by Perceptual Fidelity and Streamlining Encoder evaluation using Dask & Python with MainConcept Codecs

Segment-based encoding

For distribution via HLS or DASH (with CMAF), the encoded bitstream is chopped into short segments of typically 2-4 seconds, so that an adaptive bitrate (ABR) player can switch between renditions of different resolutions and bitrates at segment boundaries. This means that every segment must start with an IDR/IRAP frame for it to be independently decodable. A modern transcoding workflow, whether on-premises or in the cloud, can use this property and, therefore, encode all segments in parallel for maximum throughput.

Combining segmented encoding with content adaptivity means that every segment should be encoded at exactly the right bitrate to meet a desired quality level while also considering other constraints, such as maximum and minimum average bitrate. This is a complex challenge, solved by existing solutions, either through brute force or iterative approaches—encoding every segment repeatedly until optimal parameters have been found—or through smart pre-analysis.

CTQ works differently. Since perceptual quality is measured for every frame during encoding and this information is constantly fed back into the rate control to hit the desired average quality, there is no need for repeated encoding. Given long enough segments, which is the case for typical ABR scenarios, CTQ will also be able to hit the desired target quality for every individual segment. For good rate-distortion performance, a certain level of quality variation within each segment is desirable and typically not noticeable for human viewers. For very specific use cases that demand even tighter perceptual quality control on every frame, we provide the option to re-encode dedicated frames.

CTQ encoding respects all of the usual VOD constraints: IDR positions can be pinned to segment boundaries, I-frames can be inserted at scene changes and the bitrate can be capped via HRD/CPB conformance.

A simple segment-based VOD encoding recipe with CTQ

Let’s look at a basic CTQ encoding example in Python using a segmented VOD asset. The code below is a stripped-down version of the experiment we are running internally on our Kubernetes cluster via Dask. The cluster setup, MongoDB result storage and other internals have been stripped out so we can see the raw encoding logic. The AVC/H.264 encoder referenced here is plugged directly into FFmpeg.

import os
import subprocess
import re
import pandas as pd

# Input description 
sequence = { 
    "width": 1920, 
    "height": 1080, 
    "fps": 25.0, 
    "path": "/path/to/asset.mxf", 
    "num_frames": 74250, 
}

# 4 second VOD segments, aligned with IDR boundaries 
target_segment_length_seconds = 4.0 
segment_frames = int(target_segment_length_seconds * sequence["fps"]) 

# One chunk will be one minute of video 
# We encode the sequence in independent chunks so they can be computed on a cluster in parallel 

chunk_frames = int(60.0 * sequence["fps"]) 
chunks = [ 
    {"chunk_id": idx, "chunk_start": x} 
    for idx, x in enumerate(range(0, sequence["num_frames"], chunk_frames)) 
] 


# CTQ configuration for AVC/H.264, MainConcept Codec SDK 16.2 
params_ctq = { 
    "bit_rate_mode": "5",  # Enable CTQ rate control 
    "quality_metric": "8",  # Enable VMAF-E quality measurements 
    "target_quality": "90",  # VMAF-E quality target 
    # Optionally, also control the allowed per-frame quality deviation 
    # "max_quality_threshold": 12.0, 
    "bit_rate": "0",  # Disable fixed bitrate 
    # VOD-style GOP settings 
    "idr_interval": f"{segment_frames}",  # IDR every 2.0s, assuming 25 fps progressive video 
    "min_idr_interval": "1", 
    "fixed_i_position": "1",  # IDRs at multiples of IDR interval 
    "vcsd_mode": "1",  # Enable scene change detection 
    "idr_frequency": "1",  # Every I-picture will be an IDR 
    # Optionally: Cap the bitrate with HRD conformance 
    # "hrd_maintain": "1",                 # Enable HRD 
    # "vbv_buffer_units": "1"              # All VBV units in bits 
    # "max_bit_rate": "10000000",          # 10Mbit/s cap 
    # "bit_rate_buffer_size": "20000000",  # 2.0 second buffer 
} 

vscore_pattern = re.compile( 
    r"Picture number:\s*(\d+).*?" 
    r"VMAF score:\s*([\d.]+).*?" 
    r"VMAF-E score:\s*([\d.]+)", 
    re.DOTALL, 
) 

def encode_chunk(sequence: dict, params: dict, chunk: dict) -> dict: 
    """Encode one chunk of the sequence and return the per-frame VMAF-E scores.""" 

    config_str = ":".join(f"{k}={v}" for k, v in params.items())  

    # FFmpeg reads the source, deinterlaces, converts to planar YUV and 
    # pipes the raw frames into the MainConcept AVC/H.264 Video Encoder via  
    # the OMX plug-in 
    p1 = subprocess.Popen( 
        [ 
            "ffmpeg", 
            "-ss", f"{chunk['chunk_start'] / sequence['fps']}", 
            "-i", sequence["path"], 
            "-vf", "bwdif=mode=send_frame:parity=auto:deint=all", 
            "-frames:v", f"{chunk_frames}", 
            "-pix_fmt", "yuv420p", 
            "-c:v", "omx_enc_avc", 
            "-omx_core", "./libomxil_core.so", 
            "-omx_name", "OMX.MainConcept.enc_avc.video", 
            "-omx_param", f"preset=139:qualityinfofile=vscore_{chunk['chunk_id']}.txt:[AVC Settings]{config_str}" 
            f"{os.devnull}", 
        ], 
        stdout=subprocess.PIPE, 
        stderr=subprocess.PIPE, 
    ) 

    encoder_log = p1.stderr.decode() 
    bitrate = float(re.search(r"Bits/sec:\s*avg\s*=\s*(\d+)", encoder_log).group(1)) 
    with open(f"vscore_{chunk['chunk_id']}.txt", "r") as fr: 
        vscore_log = fr.read() 
    # Read the per-frame quality from the vScore log file 
    df_per_frame = ( 
        pd.DataFrame( 
            vscore_pattern.findall(vscore_log), columns=["poc", "vmaf", "vmaf_e"] 
        ) 
        .astype({"poc": int, "vmaf": float, "vmaf_e": float}) 
        .sort_values("poc") 
        .reset_index(drop=True) 
    )

    df_per_frame["poc"] += chunk["chunk_id"] * chunk_frames 

    return { 
        "chunk": chunk["chunk_id"], 
        "bitrate": bitrate, 
        "per_frame": df_per_frame, 
    } 

# In production, this could be computed on the cluster of your choice 
# (Dask, Ray, Celery etc.) 

results = [encode_chunk(sequence, params_ctq, chunk) for chunk in chunks]

The actual configuration for CTQ is straighforward: target_quality = 90.0 sets the desired average VMAF-E quality and, optionally, max_quality_threshold = 12.0 sets a threshold that determines when frames should be re-encoded. If the measured per-frame quality deviation from the target quality is above or lower than the threshold, frames will be re-encoded with a modified QP until the quality is within acceptance range. However, even with re-encodings, due to rate-distortion efficiency constraints, it cannot be guaranteed that the desired quality will be met every time. We try to prevent excessive bitrate spending on individual frames.

Experimental setup

To validate CTQ for AVC/H.264 under production-grade VOD conditions, we selected a real-world ~50-minute broadcast asset, encoded as a 50Mbit/s XDCAM-HD source as our test sequence. The source had to be deinterlaced on the fly via FFmpeg before being fed into the MainConcept AVC/H.264 Video Encoder. This test sequence was very typical for TV content with a mix of static and dynamic scenes, varying complexity, graphics overlays, hard cuts and end credits.

To gain a better understanding of the performance of CTQ, we compared it with the CRF mode of the encoder, which is optimized for RD efficiency. We set up a Dask cluster with 80 workers and encoded the source material with the following two configurations:

CRF (bit_rate_mode = 3) with rate factors ranging from 16 to 44 in steps of 4
CTQ (bit_rate_mode = 5) with target quality ranging from 70 to 94 in steps of 4

Both modes used an identical GOP structure (4 second IDR interval), identical encoder presets and VMAF-E measured while encoding. Additionally, we also recorded the actual VMAF scores by specifying quality_metric = 12, which would have been prohibitively expensive to accomplish in an actual production deployment, due to the large computational overhead of VMAF on CPUs. For each encoded chunk, we therefore retrieved the per-frame VMAF-E and VMAF measurements so we could cross-check both. In a production deployment, only measuring VMAF-E at encoding time by specifying quality_metric = 8 would be the lightweight approach.

Rate-distortion efficiency

Figure 1 – Rate-distortion efficiency

Figure 1 shows the rate distortion plot for the test sequence in terms of average bitrate and video quality measured as VMAF-E scores. Across the relevant quality range from 85-95, the CTQ curve sits on top of the CRF curve, indicating that, for this particular asset, CTQ achieves identical compression efficiency. Notably, the CRF-mode rate-distortion curve is very flat for VMAF-E values above 96. This means that choosing a CRF that is too low can easily result in overspending bitrate.

Per-segment quality

Figure 1 does not tell us the whole underlying story. In fact, just looking at the average quality tells us nothing about how quality varies across the entire asset. The real benefit of CTQ is not visible in the average RD-curve. Figure 2 provides us this insight, as it shows the average VMAF-E quality (top) and corresponding actual VMAF quality (bottom) per 4.0s segment over the whole ~50-min asset for CTQ target_quality = 90 compared to fixed CRF rate_factor = 32.

The differences are clear: Fixed CRF (blue) VMAF-E quality (top plot) can vary significantly and noticeable outliers can occur. However, CTQ (orange)maintains near-constant VMAF-E quality across all segments, with minor deviations where the content is of very high complexity. When looking at the actual VMAF scores, a similar conclusion can be made. Although the quality measured across segments is noisier, the deviation in average quality is within +- 3.0 VMAF scores for most parts of the sequence.

Per_segment

Figure 2 – Per-segment quality

For ABR use cases, this is exactly the desired property. The top rendition of your ladder is the most important one and, therefore, supposed to have excellent, uniform visual quality. CTQ delivers this by design.

Validating VMAF-E accuracy

Since CTQ steers the bitrate based on VMAF-E—our fast VMAF proxy—a natural question is how close VMAF-E is compared to true VMAF, as reported by end-user quality measurement tools.

Figure 3 answers this directly. For every frame, we compute the VMAF-E error as vmaf_e_error = vmaf_e – vmaf and plot the resulting histogram. The mean absolute error is about 1.67 across all 74250 frames with the error distribution centered near zero. In other words, our CTQ encoding with a VMAF-E target quality of 90 will, on average, agree with an independent VMAF measurement by less that 2 VMAF scores, well inside the 2-6 score just-noticeable-difference (JND) range that is suggested.

VMAF-E_error

Figure 3 – Validating VMAF-E accuracy

Overall workflow

To summarize, for a segment-based VOD pipeline, encoding the top-rung rendition of an ABR ladder, the workflow with CTQ is now very short:

Define a target VMAF quality. 90-92 is a good default for the top rung.
Set bit_rate_mode = 5 (AVC) or bit_rate_mode = 6 (HEVC) and quality_metric = 8 to enable VMAF-E.
Configure your usual IDR and GOP settings. Optionally, also configure HRD/CPB conformance for bitrate capping and further tuning maximum quality deviation.
Encode.

This approach requires no trial encodings, no per-title search, no content-classification or multiple-encoding passes. The encoder measures quality frame by frame and steers towards the target in a single pass. These properties are crucial, when the encoding scenario requires a large number of encodings daily and computational resources become a bottleneck for multi-pass encoding. The exact same recipe can also be used for our MainConcept HEVC/H.265Video Encoder.

Summary

With MainConcept Codec SDK 16.2, CTQ is now available for AVC/H.264 with improved algorithmic performance. Our experiments using a real-word asset demonstrate that CTQ is essentially rate-distortion neutral compared to CRF within the typical VOD bitrate operating range, while delivering consistent perceptual quality with every segment. Per-frame quality logs, including regular VMAF measurements, can be easily captured from our sample encoders and FFmpeg plug-ins to verify the result.

Jens Schneider & Max Bläser

Senior Video Coding Research and Development Engineers

Max focuses on applying machine learning algorithms to encoders and video applications. He previously worked for Germany’s largest private streaming and broadcast company where he co-developed large-scale transcoding systems for VOD streaming. He received the Dipl.-Ing. in electrical and communications engineering and Dr.-Ing. degrees from RWTH Aachen University. During his doctoral studies at the Institut für Nachrichtentechnik (IENT), Max performed research in the area of video coding and actively contributed to the standardization of Versatile Video Coding (VVC).

Jens received his Dr.-Ing. degree in communications engineering from RWTH Aachen University in 2021. His research focused on the link between machine learning and low-level coding tools in combination with higher level video coding concepts such as dynamic resolution coding. Jens joined MainConcept after working as a Software Engineer in the cloud native landscape. In his role at MainConcept, he is currently working on machine learning-based encoder optimizations and cloud native simulation setups.

VMAF-E

MainConcept Easy Video API (EVA)

CMAF: Low-Latency at Scale