The past years were mainly driven by larger resolutions and codecs in video. But since early 2023, the focus of industry and consumers alike has also been on immersive, object-based audio formats. At NAB 2024, Salsa Sound, MainConcept and Fraunhofer IIS worked on an exciting Proof of Concept (POC) showing live MPEG-DASH content creation with MPEG-H Audio authoring and encoding on a single system.
The production process of live events is complex regardless of whether you are targeting TV broadcast or adaptive bitrate streaming workflows. The authoring step of immersive, object-based audio, which is required to create MPEG-H Audio content, needs a lot of human interaction. Moreover, this process normally uses separate servers for contribution and distribution. Finding a way to streamline the process and reduce required hardware to a single system is of paramount interest for businesses, as it means significantly less expenditure.
When Salsa Sound and Fraunhofer IIS described the pain points they were experiencing in the production process for MPEG-H Audio authoring and encoding, we were excited to join them in coming up with a solution. After some initial discussion and a few emails, we came to a simple conclusion: “Why not create a joint demo for NAB?” Of course, by this time we only had three weeks to set everything up for the show, but the technical expertise and dedication of all involved parties paid off in the end.
When Fraunhofer IIS introduced us to the UK-based company Salsa Sound, Ltd. and their MIXaiR™ software, we at MainConcept were immediately interested in working together to create live MPEG-DASH content with AVC/H.264 video and MPEG-H Audio.
Fraunhofer IIS is the primary developer of MPEG-H Audio, the next-generation system for UHD-TV and streaming. It delivers personalized immersive sound, which makes for unprecedented audio experiences. MPEG-H Audio is included in the ATSC, DVB, TTA (Korean) and SBTVD (Brazilian) TV standards and the world’s first terrestrial UHD TV service in South Korea. In Brazil, it has been selected as the sole mandatory audio system for the country’s next-generation TV 3.0 broadcast service. Other countries and organizations, including ATSC 3.0 in the US, DVB in Europe, and ARIB in Japan, have already started to evaluate the use of MPEG-H Audio as their sole or supplemental audio format.
The core software of this NAB demo was MIXaiR from Salsa Sound. MIXaiR is an AI-driven automatic mixing tool for live sports and concerts that independently generates a stunning immersive listening experience using standard microphone setups. There is no additional tracking or manual operation required, just Salsa Sound’s patented AI engine to process the different microphone feeds is sufficient. Besides creating compelling and engaging pitch mixes, the tool also autonomously manages crowd, commentary and AUX-in feeds, outputting as many different mix variants as required by the user. Each mix is normalized to the required loudness standard without user interaction to ensure compliance with the target platform. The workflow within MIXaiR is completely automated for object-based audio production and immersive output configuration, ready to be sent to a live encoder to generate MPEG-H Audio content.
This is where the MainConcept Live Encoder enters the picture. The software application can be used as both a contribution and distribution encoder supporting various input and output stream types. Historically, Salsa Sound did not touch the video stream, thus Live Encoder was the missing piece needed to encode AVC/H.264 video as well as MPEG-H Audio, outputting the content for MPEG-DASH streaming. Although Live Encoder supports multiple video codecs including HEVC/H.265 and VVC/H.266, the NAB demo was created using AVC up to full HD. In the long run, HEVC up to 4K and HLS might be of interest, as well as VVC and the enhancement format LCEVC.
Reading the above paragraphs, you would expect setting up the NAB demo was a no-brainer! Life is not that simple, and software interaction can be complex, especially when you only have a three weeks’ timeline. But first things first!
There was only limited space on the Salsa Sound NAB booth in South Hall to set up this POC. Instead of a large server or workstation, they used a simple laptop running Microsoft Windows and installed both Salsa Sound’s MIXaiR and MainConcept’s Live Encoder: all the demo equipment fit into a small cabinet.
The audio and video source content were streamed to the MIXaiR software on the laptop via NDI. The tool authored the audio and created a 16 channels PCM stream that was processed into the MPEG-H Audio Production Format (MPF) for Contribution. In this case, the last channel is the Control Track which carries all necessary metadata, such as the channel layout, the position of the objects, loudness values and how to encode the immersive MPEG-H Audio data for delivery. Both the authored audio and the passed through video were embedded onto SDI using a renderer within a Blackmagic Design UltraStudio 4K Mini, a portable Thunderbolt 3 capture and playback box, that was connected to the laptop. The source content looped back into the Blackmagic Design device via SDI as ingest for the MainConcept Live Encoder.
The Live Encoder was set up to receive the video and the 16 channels MPEG-H Audio Contribution (including control track) via SDI. The included MainConcept AVC/H.264 software encoder was configured to create multiple resolution and quality layers using MPEG-H Audio Emission (Distribution) encoding as the audio codec. The control track in the source provided by MIXaiR contained all settings and parameters to create proper MPEG-H Audio. This guarantees an immersive listening experience the user can control on his playback device. As an output option from the Live Encoder, MPEG-DASH was selected for multiplexing and packaging to allow layer switching in case of network issues.
Although the MainConcept Live Encoder offers an integrated HTTP-Server, an external custom server was used for streaming the MPEG-DASH content to an NVIDIA Shield TV, a versatile multimedia streaming box running Android. It served as a playback device. The Fraunhofer MPEG-H WL-App, a multimedia player for Android OS, was installed on the NVIDIA Shield to receive and play back the incoming streams. An MPEG-H Audio capable soundbar was connected to the device to listen to all showcase features displayed on the large-screen TV.
During the NAB show, visitors to the Salsa Sound booth could experience the user interaction with the exciting MPEG-H Audio format while the video was displayed on the large screen. They could change the loudness of different objects (e.g. players on the pitch, away vs. home supporters, commentary). The whole production workflow powered by Salsa Sound, Fraunhofer IIS and MainConcept clearly proved that immersive audio will play a decisive role in live sports production and beyond. And keep in mind, this was “just” a POC which was set up in a few weeks!
Although the POC worked as expected, the solution demonstrated during NAB is not production-ready yet. The major bottleneck is how to transfer the 16 channels PCM audio data from Salsa Sound’s MIXaiR to the MainConcept Live Encoder. The current option to use SDI out and SDI in with a single card on a single system is too expensive and not feasible for a practical production workflow.
However, what options are there? The requirements are clear: We need to package and transport 16 channels of uncompressed PCM from one software to another without corrupting or losing the control track. The agreed-upon approach will target NDI ingest on the MainConcept Live Encoder side. NDI fulfills all requirements for sending the 16 channels PCM audio from MIXaiR to the Live Encoder while preserving the control track. The goal is to integrate the NDI SDK into the MainConcept Live Encoder to streamline the communication between the tools.
Salsa Sound, MainConcept and Fraunhofer IIS are thrilled with the way this last-minute collaboration came together. We are all eager to provide a complete, production-ready solution to enable broadcasters to bring streaming content with immersive, object-based MPEG-H Audio to a wider audience who is enthusiastic about diving deeper into live sports events.
Learn more about the future-proof technologies Salsa Sound MIXaiR and MainConcept Live Encoder and how their synergies can enhance your live and MPEG-H Audio production workflow to increase your visibility on the market. And stay tuned for a ready-to-use solution to exactly meet your requirements!