Voices of Video

AI Hype vs. Broadcast Reality: Why FFmpeg Alone Isn’t Enough

NETINT Technologies Season 3 Episode 38

The promise of “just add AI” sounds great until your live feed is eight seconds behind and the subtitles miss the moment.

In this episode of Voices of Video, we confront the gap between AI hype and broadcast reality. From FFmpeg 8’s Whisper integration to off-the-shelf transcription and auto-dubbing, we break down why demos often fall apart in real production pipelines, and what it actually takes to deliver broadcast-grade results.

🔗 FFmpeg: https://ffmpeg.org
🔗 Whisper (OpenAI): https://openai.com/research/whisper

Drawing on real-world experience building live captions at scale, we unpack the hard constraints that matter in live video: latency, context, accuracy, and workflow integrity. Translation needs context. Live pipelines force tradeoffs. And “video in, text out” quickly turns into a dozen-plus processing steps—voice detection, hallucination filtering, diarization, domain dictionaries, blacklists, subtitle formatting, and delivery.

That reality is why fully autonomous media pipelines still fall short. Instead, we explore a human-in-the-loop approach with Media Copilot, where automation accelerates transcription, speaker detection, highlights, summaries, and social crops, while humans retain control over speakers, entities, and house style.

🔗 Media Copilot (Cires21): https://cires21.com

You’ll also hear how live architectures balance speed and quality today: a flagship encoder feeding a live editor for recording and clipping, with near-real-time processing in Copilot. We look ahead to a direct encoder-to-Copilot workflow using chunked processing to prepare assets before a stream even ends, and how natural-language controls let producers request clips, formats, and quotes without touching APIs.

The takeaway isn’t that AI fails - it’s that reliability requires more than a single model. Invisible AI, integrated cleanly into existing CMS and MAM workflows, is what keeps teams fast without breaking what already works.

If you care about broadcast quality, human judgment, and AI that fits real production pipelines, this conversation offers a practical blueprint.

Episode Topics

• AI hype fatigue and why “video in, text out” fails
• FFmpeg 8 with Whisper: useful, but limited
• Live captions and unavoidable latency tradeoffs
• Broadcast quality vs. consumer-grade AI outputs
• The real 12+ step pipeline behind transcription
• Human-in-the-loop workflows for trust and speed
• Encoder → live editor → near-real-time AI processing
• Direct encoder-to-Copilot with chunked workflows
• Natural-language control for clips and summaries
• Avoiding AI data silos by integrating back into CMS

This episode of Voices of Video is brought to you by NETINT Technologies.
If you’re looking for cutting-edge video encoding solutions, visit:
🔗 https://netint.com

Stay tuned for more in-depth insights on video technology, trends, and practical applications. Subscribe to Voices of Video: Inside the Tech for exclusive, hands-on knowledge from the experts. For more resources, visit Voices of Video.

Voices of Video:

Voices of video. Voices of video. The voices of video. Voices of video.

Nacho Mileo, Cires21:

Yeah, we're talking about AI for nobody's surprise. And I wanted to start with turning this on first. I wanted to start here. So this is how we feel right now. This is what it looks like to be at not only IBC but the world right now, LinkedIn, right now, everything has AI. Everything has AI since the launch of ChatGPT a while ago. And the question that we have is how have the AI uh hype strengthened the fact that in reality is not the way it should be, or it doesn't always uh yeah, it doesn't always work the way we expect. And at the same time, we've seen into this hype, we've seen a ton of wrap, overwrap, overwrap, overwrap. So we proposed this talk before um FFmpeg 8 was released, and FFmpeg includes um a Whisper inside their their uh new uh release, and we will see that the fact that FFMPeg 8 bring in Whisper, it doesn't uh you know it doesn't just bomb our point but it strengthens our point. So we have uh 15 years of experience in broadcast when we started this project. This was our first approach to AI, it was live captions, and the idea behind live captions was bringing uh an HLS stream from a client, passing it through AI, taking two segments off of it, and give the client HLS again with subtitles inside and translations inside. So the idea was making it as transparent as possible when it comes to integration into their pipelines, and it was a very um challenging project in which we um learn a lot about how the pipelines of AI work and the ideas behind uh you know this transparency that we wanted to bring. You know, it's it's not it's not easy to integrate this into a pipeline without harming the the client. And in this case, we try the best to give me your HLS, we will give your HLS back like and almost untouched, just with uh the VTTs integrated and then you manifest. We the one of the first challenges that we got is like the we we cannot do this real time because translations need context, so we were taking two segments of the of the video to do this. So if you if the client was was using, let's say uh four second segments, once we get the well, once we give them the HLS back, it's it was eight seconds behind. So for some clients this was okay, for some others, this was total. So this the next question is are we seeing like progress or or there's a ton of hype around the the answer is both, but you know, we can do a ton of transcriptions everywhere, but it's never broadcast quality, to be honest. We can do very nice dubbing, but it doesn't work for fiction. And and at the same time, it corrupts a lot of time the way that companies work. It's like the the way that companies are trying to integrate AI in general breaks their workflows or make their workflows choppy and strange. So back to the beginning, there's a fantasy that you can do this, and that's why I mentioned the FFMPeg uh integration. Uh, I love the fact that they put uh Whisper in, but this is fantasy, absolutely fantasy. You know, uh we have tried this, and it just doesn't work like this. It takes text out, yes. Uh it doesn't give you subtitles, it doesn't give you any sort of precision. And this is what our our own pipeline of transcription worked one year ago. I will not show the current one because of IP, but this is what it looks like when you go into reality and when you do the reality check of what it is. So you get the audio, but then you need to detect if there are voices, then you need to improve the transcription, you need to detect hallucinations, you need to uh diorize it, apply dictionaries, trans transcribe, blacklist, then create a format of subtitles and then get out. So things that should be like this video transcription text turn out to be right now, it's it's over 12 steps. This is unupdated, but right now our our our pipeline of transcription is like 12 steps longer than this. So we strongly believe that AI should be invisible uh or should at least be easy to integrate. And we believe that the total automation of stuff it's it's not real, it's it's impossible to achieve, at least for now. So we built Media Copilot. Media Copilot is uh it's a service or a SaaS that lets you in just video and get a ton of information around the video. We transcribe the video, we detect speakers, we create highlights, summaries, we give you the ability to crop for social media, uh, we burn some titles into it. So there's a bunch of stuff, but we are always thinking that the human intervention is necessary. Human humans can come here and say, this is not speaker 2, this is Manu, or this is not speaker 2, this is Ryan. So all of this uh you know works together with uh with the people. So we we we don't think that there's uh uh a fully autonomous thing for now, at least for now. And our approach to live pipelines was this we have our encoder, that's a flagship, uh our flagship product for the last 17 years. We pass content through live editor in which we are recording and clipping, and then we send those clips to MediaCopilot. This is how it works for now. Uh and you know, this this allows you to process stuff in Media Copilot after it went live with just one or two minutes. That that's what it takes to send data from here to here. So it's the way we are working right now with some clients, and the next step that it's already on well, this is this is uh a screenshot of what it looks like. So you get the VPU here processing stuff. Uh so the thing is we have the VPU here processing the channel, recording the channel here, creating clips and sending it over to process on Media Copilot. Then our next approach is this one we have already a POC running. So going from the encoder to Media Copilot straight. And and the idea behind this is we will be, you know, most of the things that we do here require a ton of context and require uh having the video to do stuff. Uh we cannot dab live for now, or we cannot generate a summary of something that's being developed. So what we do here is chunking the video and processing faster. So once the video finishes, you have all the past ready ready to go. Finally, uh the way we are interacting with data right now, it's it's changing every day, and the launch of NCPs a while ago by Anthropic changed everything. And this is what what it looks like. This is a very new thing that we launched uh a week ago or or so, in which you interact with your content with natural language. So the idea behind this is you uh talk to your content instead of using the EAI or the API of Media Copilot, you just said, hey, I need uh from this asset, I need you to extract a video in in this case it's four fifth for uh certain, you know, the Gaza crisis or whatever you want to take, and it gives you back the asset directly. It's so in the end, it's the less intrusive thing in terms of how it works with with the current thing. There's also a couple um you know challenges that are going on. And it when it comes to data silos, you know, we you create a ton of stuff from from here, you you take it a ton of stuff out from your videos, but at the same time, you need to integrate that into your current CMS mumps and everything. If not, you will be uh again creating huge silos that are disconnected from from one another. Finally, if you want to test any of this or if you want to see how we are leveraging VPUs into uh AI, just come over. We're on the corner, and thank you very much.

Voices of Video:

This episode of Voices of Video is brought to you by Netint Technologies. If you are looking for cutting edge video encoding solutions, check out Netint's products at netint.com.