Igor Krivtsov, Konstantin Fedorov & Konstantin ShatlovNov 12, 20249 min read

VVC/H.266 encoding acceleration

13:03

Introduction

Versatile Video Coding (VVC) is a video coding standard that was finalized in 2020. It provides better compression and a much wider feature toolset than previous standards, which increases computational complexity. Codec developers are faced with this challenge of increasing computational complexity and finding ways to utilize the capabilities of hardware, while still meeting the new standard's claims of improved compression and quality. To meet these requirements in practical conditions, the standard incorporates parallelization capabilities.

MainConcept has developed a high-performance implementation of the VVC (or H.266) compatible standard encoder and decoder. This implementation is currently the most advanced on the market and allows customers to avail themselves of real-time video encoding up to 8K.

Since computational complexity imposes limitations on an implementation, the ability to use parallel computing is a must-have. Below is an overview of base parallel coding techniques provided by the VVC standard.

Standard capabilities

MainConcept has developed a VVC codec for various purposes, with real-time encoding being one of the most challenging, as it requires adapting algorithms to multi-core systems.

All encoding algorithms are based on techniques that eliminate redundancies between frames and within a single frame. Algorithms that exploit similarities and reduce them within a single frame are called intra coding tools, while algorithms that exploit temporal similarities between multiple pictures are called inter coding tools. As a successor to previous MPEG standards like HEVC/H.265 and AVC/H.265, VVC uses a block-based coding approach. These coding blocks are called CTUs (Coding Tree Units). Among the many characteristics of a CTU, its geometric size is of particular interest in the context of parallelism. Since the standard defines the basic dependencies at the CTU level, it can be considered as an atomic element in parallel processing. The VVC standard defines three possible sizes for CTUs: 32x32, 64x64 and 128x128 and, accordingly, the dimensions of the CTU can characterize the area being processed in parallel.

CTU

Fig. 1. CTU and its neighboring CTUs, which need to be processed beforehand

All block-based coding algorithms use adjacent, already-encoded CTUs (see Fig. 1), which introduces dependencies between them for efficient CTU encoding. Given the existing limitations between whole picture CTUs and raster order processing, it has been observed that there is a processing pattern, which meets all previously mentioned restrictions and facilitates moving from raster order encoding to the practical use of multithreading. This approach is called wavefront parallel encoding.

Wavefront parallel encoding

Taking into account the sequential nature of block-based encoding and the dependencies of nearby CTUs, it is possible to encode CTU rows in parallel. This implies compliance with the following rule: the upper-right CTU from the row above must already be encoded and its coded data must be available. Wavefront parallel encoding can effectively use multiple threads. Fig. 2 illustrates a possible wavefront parallel encoding, taking CTU dependencies into account.

Theoretically, this allows the encoding of each CTU row to be launched in a separate thread.

CTU-dependencies

Fig 2. CTU dependencies

However, given the horizontal and vertical dimensions of the coding area, only a limited number of threads can be used effectively. Moreover, the degree of parallelism during the encoding process varies and can be noticeable. The number of active working threads is illustrated in Fig. 3

CTU-threads

Fig 3. Number of utilized threads during encoding of picture

CTU-threads-2

Fig. 4. Number of threads that can be used during the encoding over time for Fig. 3

The start and end of the picture are the least parallelizable parts of a picture. As illustrated in Fig. 4, half of the image is processed with the highest thread count and accounts for one-third of all picture encoding time. The other parts of the image (the start and the end) are limited by waiting for necessary data input and account for two-thirds of the encoding time. This is due to the limitations imposed by dependencies between the CTUs as the thread cannot immediately start encoding; it must wait for the necessary data before encoding starts.

It is also worth noting that the CTU size plays a role in determining thread utilization. Considering the vertical and horizontal dimensions of an encoding region, the total number of running threads can be expressed as: THREAD_NUMBER = MIN (WIDTH_PICTURE_IN_CTU / 2, HEIGHT_IN_CTU). Additionally, a smaller CTU size allows for more threads to be used per coding area.

Slices

One of the earliest and most basic methods for eliminating coding dependencies within a single encoded picture is the use of slices. Although originally designed for use in network streaming to improve data robustness and reliability, slices can also be considered an effective multithreading tool. A slice consists of an integer number of CTUs, and pictures can be divided into multiple slices. Since there are no cross-dependencies between slices, they can be encoded in parallel, meaning each slice can use the ladder coding approach internally, increase the number of threads used and therefore balance the overall system load.

Despite not being the most common choice for multithreading, slices remain a viable multithreading tool. While they were originally introduced to support network streaming and are not widely used in multithreading contexts, they can still be effective in such use cases.

Tiles

VVC provides special parallelization features such as tiles (this functionality was introduced with HEVC/H.265 and improved on for VVC). The image being encoded can be divided into arbitrary rectangular regions. This allows them to be coded within a single encoded picture in parallel. However, tiles may impose certain restrictions on the use of motion vectors, applying post processing filters and tying entropy coding with tiles instead of the entire encoded picture. Decoders also benefit from this because they can start decoding tiles in parallel, and each tile has a specific tile offset from the slice header.

To avoid quality degradation at tile boundaries, each post-process filter has an optional ability to filter tile boundaries.

In practical multithreading, the most effective approach is to split the picture into several tiles of equal area before encoding. Splitting a picture in this way can speed up the entire picture encoding process. The potential theoretical performance gain from splitting the picture encoding into two vertical tiles is illustrated in Fig. 5 and 6.

CTU-wavefront

Fig. 5. Wavefront of picture processed without tiles

CTU-wavefront-2

Fig. 6. Wavefront of picture processed with tiles

To demonstrate the speed benefits of using tiles, we can conduct a synthetic test as illustrated in Fig 5-6. In this test, we assume that the encoding time for each CTU is the same and threads begin their work as soon as the necessary CTUs are encoded. If we compare the figures showing pictures encoded without tiles (Fig. 5) and with two vertical tiles (Fig. 6), the first benefit is that the encoder starts picture processing from two independent points, effectively doubling the maximum number of threads. The second benefit is that the tile’s CTU line is shorter than the picture’s one, allowing the encoder to reuse threads for the next line more frequently. In the examples shown, while the picture encoded with one tile is ~50% complete, the picture encoded with two tiles is already ~75% complete.

Under ideal conditions, where CTU encoding time remains constant and all dependent CTUs are encoded, vertical tiles can increase the encoding speed by 31% with two tiles and by 58% with four tiles.

Introducing Wavefront Parallel Processing (WPP)

Wavefront Parallel Processing is a parallel data processing technique used in new codecs like HEVC/H.265 and VVC/H.266. This technology enables writing out bits in parallel to the bitstream during encoding. A decoder can then read bits in parallel because entry points for decoding are signaled in the bitstream for the first CTU in each row. For this reason, the entropy processing can be accelerated on both ends by the encoder and the decoder. For a 1080p resolution with a CTU size of 64x64, it is possible to process 16 lines at the same time.

For conventional non-WPP encoding, entropy for each CTU is coded in scan order. This means that you must first finish processing the last block in the first line before starting the second line. In other words, all entropy processing is done sequentially and cannot be parallelized within the encoded picture. WPP can provide a significant performance boost. This is especially beneficial at higher bitrates, where entropy consumes a significant portion of the encoding time.

The WPP algorithm in VVC has been enhanced to reduce dependencies between neighboring CTUs. In HEVC, encoding depends on two CTUs (see Fig. 1) from Above and Above right. In VVC, this has been reduced just to Above, noticeably reducing encoding ladder dependency delay and improving overall performance for encoding pictures. In addition to accelerating the process of entropy coding and reducing CTU dependencies, this also limits intra prediction within a picture and simplifies some rate-distortion optimization (RDO) encoder algorithms, which also significantly increases the encoding speed. This may lead to noticeable, but acceptable, quality degradation.

Non-standard acceleration methods

Parallel Pictures

As described above, an encoder can use the encoding ladder, tile partitioning and other tools to maximize parallelism and achieve maximum encoding performance for a single picture. However, it is also possible to start encoding multiple pictures in parallel. In the case of intra coding, this is straightforward, as the pictures have no inter dependencies, allowing their encoding processes to start in parallel.

For inter coding, the situation is significantly different: dependencies between pictures are introduced by motion vectors or other coding tools. This means that for the dependent picture, the encoding process should start as soon as the information (prediction region or motion vector) from the reference picture is encoded and ready.

Since the picture is encoded in CTUs from the top-left corner, it is possible to encode another picture once a portion of the reference picture is complete. This leads to some overhead spent on the synchronization grid of the completed regions. Naturally, this synchronization overhead increases with a higher number of parallel pictures and multi-reference predictions, though, overall, parallel pictures provide a significant performance boost. The MainConcept VVC/H.266 Video Encoder is optimized for high-performance encoding and is now configured to use 8 parallel pictures by default, which ensures high performance on modern desktop and server CPUs.

Parallel Motion Vector Search

Motion search is a complex yet well-studied area of coding algorithms and is fundamental to all block-based coding. Conceptually, the compression ratio is heavily dependent on the quality of the prediction signal found by the motion estimation stage of an encoder. Consequently, it is vital for an encoder to effectively find an appropriate motion vector for a coding block. The MainConcept VVC/H.266 Video Encoder performs a high-quality motion search in two stages. The first stage achieves sufficient accuracy during initial analysis and can be performed directly on the source picture. This means that in the first stage, searching can be performed independently of the actual encoding, providing an additional opportunity for parallelization. The second stage is made by using an already reconstructed reference picture to refine the final motion vector and use the best match.

Conclusion

We have explored the basic capabilities of parallelization provided by the VVC standard. The MainConcept video codec development team strives to leverage all capabilities of the VVC standard and integrate them into our VVC encoder. Beyond these capabilities, there are many methods to optimize encoding performance, but at the cost of certain trade-offs, such as the performance tuning function and AutoLive mode, which can influence overall performance. These are topics that we will cover in future posts. The MainConcept VVC/H.266 Video Encoder is already available and can deliver the encoding performance you require, as vouched for by our satisfied customers.

Igor Krivtsov, Konstantin Fedorov & Konstantin Shatlov

Igor Krivtsov
Igor is a Staff Software Engineer. Inspired by multimedia technologies, particularly computer vision and video coding, he has been working at MainConcept for over a decade developing codecs and multimedia solutions. Graduated Tomsk State University of Control Systems and Radioelectronics with a degree in Software Engineering>

Konstantin Fedorov
Konstantin is a Senior Software Engineer. He is a graduate of Tomsk Polytechnic University with a degree in Software Engineering. In terms of work on the encoder, his main focus is algorithmic and hardware acceleration, as well as the implementation of new API functionality.

Konstantin Shatlov
Konstantin is a Staff Software Engineer. He has been working at MainConcept for 15 years, developing video codecs and multimedia applications. He is a graduate of Tomsk State University of Control Systems and Radioelectronics with a degree in Software Engineering. He is currently working on improving the quality and optimizing of the VVC encoder.

AutoLive Encoding

Introducing EVA

CMAF: Low-Latency at Scale