Apple Immersive Video drives video encoder comparisons to new dimensions because of the tremendous file sizes of the simulation assets and the continuously arising question of how to perform quality measurements. This blog post presents a practical file-based approach to encoder evaluation for Apple Immersive Media, including a strategy for VMAF-based quality measurements.
Apple Immersive Video (AIV) is the signature format used to play video on the Apple Vision Pro. With its latest release, Apple has introduced new specifications for content creation, encoding, multiplexing and playback via HLS. Integrating these into the encoding pipelines of the streaming world can be challenging: Stereoscopic video with a resolution of 4320x4320 samples per eye, captured at 90 fps and in 10-bit HDR, delivered as a single Multiview-HEVC (MV-HEVC) bitstream are aspects, which put the format outside of the “comfort zone” of most encoding toolchains. When we thought about doing initial rate-distortion analysis for this new use case, we quickly realized that our usual methods would not get us very far. The overall process from the source format (Apple ProRes 4444 HQ at 8160x7200) over the encoder input (two 4320x4320 or 3600x3600 views) to the MV-HEVC bitstreams (an ABR ladder with rungs ranging from 100 Mbit/s down to 16 Mbit/s respectively) involves a lot of plumbing. Furthermore, there is no established perceptual video quality metric for these high-resolution stereoscopic video sequences, which raises another key question—can we measure visual quality with VMAF?
This blogpost will walk you through our process. The goal is to give everyone a basic understanding of what it takes to carry out experimental MV-HEVC encoding for the Apple Vision Pro. This can be helpful to evaluate an encoder, make a rate-distortion comparison or prepare assets for playback tests. We will cover the full end-to-end pipeline: demultiplexing and decoding the Apple ProRes source, preparing raw YUVs, encoding with the MainConcept MV-HEVC encoder and NVIDIA NVENC for comparison and finally measuring video quality.
Before we start, please note, what we describe here is an internal, experimental setup that we use for codec evaluation at MainConcept and not a recipe that can be used in production. Running this workflow means moving around terabytes of uncompressed YUV, creating intermediates and chaining together command line sample applications. We do this to isolate the encoder behavior from all other steps involved, but this is the exact opposite of what you would want in production. A real-world deployment would skip raw YUV handling and drive the encoder directly though our MainConcept Codec SDK, ingesting and decoding the ProRes source at the input stage and streaming the MV-HEVC bitstream at the output stage with far less I/O. Bearing this in mind, let’s get started.
The starting point for our AIV experiments is a stereoscopic ProRes 4444 (or 4444 XQ) MOV file at 8160x7200 per view or a slightly less demanding ProRes 422 HQ. These sources can be produced, for example, directly via the new Blackmagic URSA Cine Immersive and processed with DaVinci Resolve Studio. The two views are stored as separate streams within the same container. The first job is to extract those streams into two elementary Apple ProRes streams. Using the MainConcept MPEG-4 Demultiplexer sample, this looks as follows:
Sample and reference encoders typically need raw YUV video data. ProRes, on the other hand, is a mezzanine codec, which means we cannot encode from it directly. Therefore, we need to decode each view to uncompressed YUV once and feed it into as many encoders as needed. Using the MainConcept video decoder for Apple ProRes sample, the next processing step is read as:
The result here is a an uncompressed 4:2:2 16-bit YUV when we start with a 422 HQ ProRes. For the 4444 XQ ProRes, we would decode into a 4:4:4 16-bit YUV. These files are enormous, at bitrates of 126.01 Gbit/s and 253.81 Gbit/s, respectively, per view.
A resolution of 8160x7200 is actually higher than the native display resolution of the Apple Vision Pro, which Apple states as being 3660x3200 per eye. Therefore, we apply downsampling to the original source material. The downsampling process is crucial for the overall visual quality observed on the head-mounted display. For this reason, we verified the quality of MainConcept’s downsampling components by visual tests on the Apple Vision Pro. These downsampling components are part of the mc_trans_video_colorspace_cl command line tool. For the sake of conciseness, the workflow specific usage is only listed in the snippet below for view 0:
Executed for both views, these calls produce in total four 10-bit raw video files, namely one file per view at the two resolution levels of 4300x4300 pixels and 3600x3600 pixels, respectively. These files serve as input files for the sample HEVC encoder.
Encoding stereoscopic video sequences for the Apple Vision Pro is a straightforward task with MainConcept’s MV-HEVC encoder, as it comes with presets defined for the HLS bitrate ladder recommended by Apple. One example profile id is preset = 402 and an example configuration file for the sample_enc_hevc encoder is shown below:
Using this configuration file, the encoder call becomes as simple as the one listed in the next snippet. The only difference to a single layer encoder call is in the input file. We specify both views separated by a semicolon here:
This runs the HEVC encoder at performance level 15, i.e. balanced performance between encoding speed and quality. The output is an MV-HEVC bitstream. For the sake of completeness, we also provide the command that multiplexes the plain bitstream into an MP4 container that can be displayed on the Apple Vision Pro:
As already mentioned in our introduction, one crucial question is how to measure the video quality for Apple Immersive Video. Of course, PSNR measurements provide an option that accounts for the distortion in a mean-squared error sense. However, PSNR measurements have the drawback that they do not generalize over video sequences with different characteristics. This issue was addressed by VMAF, which essentially became the industry standard for video quality assessment in recent years. One feature of VMAF is its resolution awareness. By default, it comes with models for 1080p and 4K video sequences. However, there is neither a dedicated model for the resolutions specified for Apple Immersive Video nor a dedicated model targeting the Apple Vision Pro. The closest model is the vmaf_4k_v0.6.1neg model, which was initially trained for 4K video sequences watched in 1.5h distance from a conventional display. That being said, VMAF measurements appear not to be relevant for video watched on the Apple Vision Pro. So, let’s look at the RD curves with distortion measured in PSNR first. The figure below depicts RD measurements averaged over multiple test sequences from a test set provided by Apple1.
From the plots, we can observe the very typical rate distortion behavior of video encoders. The PSNR values lie in a range of 41.5dB to 45dB which unfortunately does not reveal anything about the perceived quality. Furthermore, the plots show that MainConcept’s MV-HEVC encoder produces the same level of quality for the left and the right view of the video sequence. Consequently, the binocular rivalry effect is disregarded by default, and both eyes would experience the same level of quality. This was a design decision mainly driven by the fact that there is no clear evidence regarding how quality should be distributed between left and right on the Apple Vision Pro. However, with MainConcept Codec SDK 16.2, the MV-HEVC encoder supports uneven quality distributions through the delta_qp_multiview setting. This allows users to steer the bitrate more towards the base view, which can result in a higher level of perceived quality on the headset. That brings us back to the question of how perceptual quality can be measured for stereoscopic video watched on the Vision Pro.
We already discussed how VMAF was regarded as unsuitable for the task, and it would not help to describe binocular effects for stereoscopic video. However, it fuses multiple video quality metrics into a single score via support vector machine regression. These underlying metrics still contain information on video quality that involves more sophisticated math and ideas than just some logarithmic version of the mean squared error, i.e., the PSNR. For instance, VMAF involves temporal features that account for artifacts that could be caused by motion compensation, for example. For this reason, we also measured VMAF scores for left and right view with the vmaf_4k_v0.6.1neg model independently. Even though our encoder can measure VMAF scores at encoding time, we provide an example below on how this can be approached with FFmpeg in the command line. We provide this, since not every encoder will provide measurements while processing, i.e., this call is most likely required when comparing different encoders.
The figure below shows the RD curves for the same bitstreams as above, but the video quality is measured in VMAF scores this time. Comparing these results to the PNSR-based RD curves, you can easily see that the range of the y-axis has changed from values in the 40dB range into the typical VMAF score range lasting from approximately 85 to 92. This correlates with the perceived quality when eyeballing sequences on the headset. Of course, to justify the accuracy of the vmaf_4k_v0.6.1neg model for Apple Immersive Video, standardized subjective testing involving many human ratings would be required. This is beyond the scope of this blog post and will potentially be subject to future research activities.
Naturally, we would like to compare the rate-distortion efficiency of our MV-HEVC encoder against others. The open-source HEVC encoder x265 is an obvious candidate and is advertised as also supporting MV-HEVC. After compiling a fresh 10-bit binary from the latest x265 source, we however quickly discovered that 10-bit input is not supported for MV-HEVC, which was a real dealbreaker for us, as all our AIV test content is 10-bit HDR. Therefore, we quickly turned our attention to other options.
NVIDIA’s hardware encoder also supports MV-HEVC on recent GPUs, but the interfacing compared to our encoders is slightly different. Instead of accepting two separate view files, NVENC’s sample application (AppEncCuda) expects a single input file in which the two views are temporally interleaved. Furthermore, NVENC expects NV12 as the input pixel format, which is a semi-planar 4:2:0 YUV format where the U and V chroma samples are also interleaved in pairs. The conversion from two separate, planar YUV files to single, interleaved, semi-planar YUV can be easily achieved using FFmpeg as follows:
The “setpts” filters assign even and odd timestamps to the first and second input while “interleave” then correctly merges the frames in timestamp order. In a last filtering step, “format” converts each frame to P010 pixel format, which is equivalent to NVIDIA’s NV12.
From this point, the NVENC encoding itself is straightforward. It is important to enable MV-HEVC via the “-enableMVHEVC 1” switch:
One important caveat worth mentioning: The target bitrate parameter is interpreted per-view, meaning that if an overall 100 Mbit/s MV-HEVC rendition is desired, we need to specify a bitrate of 50 Mbit/s.
The figure below demonstrates the resulting RD-curve for one specific sequence from the test set. For MainConcept’s HEVC encoder, no anomalies were observed, and the RD-behavior matched the results presented above. For NVENC, we clearly observed that the target bitrate was not met accurately over the entire bitrate range. Also, the RD-performance we measured was inferior to the MainConcept encoder when quality was measured in VMAF. For PSNR-based RD comparisons, NVENC seemed to perform on par with the MainConcept encoder in the higher bitrate range, but lacked quality at limited bitrates. Additionally, the MainConcept encoder was not operated at its highest performance level (25/31) so that there is still some room for even better RD-performance. However, operating at a higher performance level would result in higher computational complexity and increased runtime. As this post just focuses on experimental file-based studies, we do not provide any encoding speed measurements. The raw video file-based workflow simply does not allow for accurate encoding speed measurements due to heavy I/O operations caused by very demanding file sizes and corresponding data transfer rates. Based on our experience with less demanding input data, we can state that NVENC is expected to outperform MainConcept’s encoders in terms of encoding speed but deliver inferior RD-performance.
In summary, our analysis shows that NVENC hardware encoding does not provide the same level of bitrate accuracy and bitrate efficiency as MainConcept software encoding.
Encoder comparisons for Apple Immersive Media come at a new level of complexity compared to single view HD/UHD evaluations. We presented a practical file-based workflow combined with a basic strategy to assess the perceptual video quality using VMAF for AIV. Experimental results show that MainConcept’s MV-HEVC encoder outperforms hardware encoding provided by NVENC in terms of RD-behavior. Due to the complexity of the workflow and its demands on file I/O, no encoder runtime measurements can be captured. In conclusion, the setup presented is highly experimental and meant for pure RD –comparisons, which disregard practical workflows where video data would be stored in RAM to enable real-time capabilities.