Revolutionizing Real-Time Engagement: New Horizons in AI-Driven Video Interactions with MuseMe Artwork

Voices of Video

Explore the inner workings of video technology with Voices of Video: Inside the Tech. This podcast gathers industry experts and innovators to examine every facet of video technology, from decoding and encoding processes to the latest advancements in hardware versus software processing and codecs. Alongside these technical insights, we dive into practical techniques, emerging trends, and industry-shaping facts that define the future of video.

Ideal for engineers, developers, and tech enthusiasts, each episode offers hands-on advice and the in-depth knowledge you need to excel in today’s fast-evolving video landscape. Join us to master the tools, technologies, and trends driving the future of digital video.

All Episodes

Voices of Video

Revolutionizing Real-Time Engagement: New Horizons in AI-Driven Video Interactions with MuseMe

December 05, 2024 • NETINT Technologies • Season 2 • Episode 2

Could AI be the catalyst redefining how we experience videos in real-time? Join us as we uncover the groundbreaking shifts in interactive video technology with experts Mark and Philip from MuseMe. Their pioneering work has made it possible for live streaming to feel as dynamic and engaging as a group watch party, effortlessly replicating social dynamics and elevating viewer interaction to unprecedented heights. From Mark's initial ventures at LivePeer to the innovative strategies that now power MuseMe, we're exploring a future where video content isn't just seen—it's participated in.

Harnessing AI for video object detection has historically been a labor-intensive process, but MuseMe is changing the game. Learn how their cutting-edge approach allows users to add reference images seamlessly, linking detected objects to rich metadata like NFTs, without the need for cumbersome AI retraining. We dive into the technical marvels that underpin this innovation, from color-coded overlays to distributed microservices, illustrating how MuseMe balances affordability and efficiency with powerful GPU optimization—essential components for delivering a seamless interactive experience.

Facing regulatory hurdles in the EU, MuseMe navigates the complexities of deploying AI without compromising privacy. We discuss the implications of over-regulation and how alternative methods of user identification could offer a way forward. Moreover, the potential of MuseMe extends beyond entertainment into e-commerce and gaming, opening avenues for new revenue streams and interactive consumer experiences. As we wrap up, we invite you to explore MuseMe's transformative platform—poised to redefine how we engage with video content and unlock new realms of digital interactivity.

Stay tuned for more in-depth insights on video technology, trends, and practical applications. Subscribe to Voices of Video: Inside the Tech for exclusive, hands-on knowledge from the experts. For more resources, visit Voices of Video.

Speaker 1: 0:07

Voices of Video. Voices of Video. Voices of Video.

Speaker 2: 0:12

Voices of Video.

Speaker 1: 0:17

Well, good morning, good afternoon, good evening to everyone who's watching live, wherever you are in the world, to everyone who's watching live, wherever you are in the world. We are so happy that you're here for another edition of Voices of Video, and today we have a really exciting discussion and, of course, you know, everything is AI right. So we're going to be talking about interactive video, but specifically the application of AI, and I think this is going to be a really enlightening session, as Mark and Philip are here from MuseMe Guys, you know. Thank you for joining and welcome to Voices of Video. Thank you for having us.

Speaker 2: 1:07

Thanks.

Speaker 1: 1:09

Yeah, exactly. So I know you're joining us from Germany, correct, right?

Speaker 2: 1:16

All right, I'm close to Berlin and Mark's closer to Hamburg.

Speaker 1: 1:21

Yeah, yeah, that's great, that's great. Well, good, well, why don't we just start? You know, give us a real quick overview of you, know who you are, tell the audience who you are, but tell us about MuseMe, me. And then I know that we're going to get in and do some demos and we're going to be able to talk about the technology. So I'm really excited. But who are you, mark? You?

Speaker 3: 1:53

start. Okay, yeah, mark Zmolkowski, I started my career with machine vision 20 years ago, then ventured into video streaming for a while and last year got into the AI plus video world. So I'm now a machine vision, ai and video all combined, which is kind of like the story of my life.

Speaker 2: 2:19

Right and also like working in this industry. For about 20 years I was founding one of Germany's first production companies for live streaming. Later on created a like the first live stream transcoding SaaS called Camfoo, which was then bought by Bowser, and after that I was basically building peer to peer infrastructure, forer infrastructure for video with LivePeer. We've done tons of AI research with LivePeer and, like four years ago, we saw this evolution of AI where it was clear that at one point it would be good enough to automate certain labor-intensive tasks that were making certain use cases in video completely not feasible. So this is basically the origin story of MUSE we tried to use AI to make a more rich experience for viewers right and figured out that there's a clear evolution of these tools that leads to a point where you can basically add certain features, but completely automatically, while you couldn't ever do it manually before.

Speaker 1: 3:34

Yeah, very interesting. Now I have to ask because I'm very familiar with Livepeer and we are definitely friends of the whole team over there and I really like the approach but why did you not end up just building this into the Livepeer platform?

Speaker 2: 3:57

Livepeer is a peer infrastructure layer, right. So it's not meant to build end user applications. And what I did with Livepeer is basically starting the AI processing pipeline that they endorse. Today they are just speaking about Catalyst and being able to now do real time tasks on top of live streams. That I started four years ago when I was with Livepeer, but when I, when I was like in the middle of it, I've seen I've seen this possibility of like this technology, making a very specific use case in live streaming possible that I was personally always most excited about, and that's interactivity right.

Speaker 2: 4:46

And if you think about live streaming, it was never so much about the content. It always was about the experience with life, its nature, right. If you go with friends to watch a football game right, you're not so much interested in the content itself, you're interested in the experience that you have with your friends while you're watching this. And if you put this on top of today's kids watching Twitch and interacting with gamers right, it's the same thing. They want to be part of this. It's not so interesting that they play this game. Of course it's part of the story, right, but it's interesting for the kids to be in that group, being able to influence what's happening on the screen.

Speaker 1: 5:23

Yeah, very much, so I agree. Well, so let's start there. Why don't you explain what you have built, and then maybe you can start out with at least a higher-level explanation of the technology, and then maybe we go into a couple demos so that everybody can see it. Seeing is believing right, absolutely so, yeah, what?

Speaker 2: 5:51

what do users want from interactivity in video? Right, they target a higher engagement rate and don't think about like netflix. Think about TikTok real-time interactivity, right, where you basically, as the broadcaster, have the option to do a little deepfakes on your face, turn your face into something different, and users are able to push emojis into your stream. They can send you call to actions. This is like men for the next generation, of course, right, but it's exciting them tremendously, right. And what do creators need to make this work? Right, they basically can't do it manually, like they can't have an OBS and then press several buttons to, like, make it work in all kinds of situations, right, they need a hands-free experience where they don't have to touch anything and somebody else has to do it for them.

Speaker 2: 6:48

And that was also what was so prohibitive before. To do it, like, those who did interactive live streams or videos most likely had a team doing it for them. It was really expensive. It's not really, you know, prone for error, right, it could happen all kinds of things. Prompt for error, right, could happen all kinds of things.

Speaker 1: 7:12

And so this was always hindering for that technology to, yeah, make sense, work out. Yeah interesting.

Speaker 2: 7:17

There's solutions for it. It's not like we would be the very first to make video interactive, of course, right, and YouTube, for example, you could like just mark an area or basically a simple shape right and put an image there and say, here's my next video, if you want to watch it. Right, people were using this very sparse tools to already build like multiple choice videos where you could like click through your own storyline, right, but that was also very similar to what netflix was doing, where you just have like one or two choices and you go left or right, right, yeah, so not really interactive. It's more a little bit annoying that you have to redesign every time. Do I want the snake or the lion or the shield or the sword?

Speaker 1: 8:02

right, yeah, yeah, yeah, I agree, sorry Sorry.

Speaker 2: 8:05

Marc.

Speaker 3: 8:10

It's still very labor-intensive, right? You have to plan all of this. You need to decide when, what appears where and what the action is attached to it. So the whole MewsMe story is about automation. Take away all that overhead from it and have a lightweight production team being able to use it. Yeah.

Speaker 2: 8:32

The current solutions, you often run into runtime issues and the client devices, right. I mean, if you only do one simple shape on top of a video player, it's not moving, it's not doing much. Right, then you might get away from with it. But as soon as you have like moving content and you have to track that content to the interactivity fields, it's becoming a nightmare for the client side to render this, to execute on it right. Most likely the client side will start to lag. It's not going to work properly. Um so, and it's very labor intensive, right, and not feasible for life, if you switch the camera and at the same time and you have to press a button at the same time to switch the interactivity for it, it's most likely going to fail.

Speaker 1: 9:17

Yeah, for sure. Well, what can you show us? I know you have a couple demos prepared. Yeah let me share my screen and, marc, while Philip is getting that ready, do you have anything else?

Speaker 3: 9:43

to add around the origin story. Well, I joined the game late. Philip gave me a demo last year, um, and it's it's just. I jumped on board because it's totally fascinating combines all the areas of expertise that I had before, plus adding the ai. Um, and now getting into the architecture of building this. Uh, something we can talk about in a moment. But let's get Philip started.

Speaker 2: 10:09

I think this is already there, mark joined us to solve one of the most crucial problems that we have, that's, multi-level classification. But let me give you the demo first. So what you see here is a video in which BigBugBunny that we all know, of course has been made interactive. Right, I just added two links. If I click here, I got to go to the Blender website or to the Wikipedia site of BigBugBunny. It's just to display that any area in a video now becomes a clickable item, just like in a game. And from the technology you could put on top, it could be anything, right, you could run JavaScript on top, could be any type of web functionality that you could imagine that you could put on top of BigBugBunny. Now it's frame and pixel accurate, as you see, right? So basically, we track BigBugBunny through the whole video and create a mask for it that we then use in the player for the navigation. So if I continue, right and stop again, then I see, okay, now it's basically on a different frame, has the mask on a different setup, and that goes then through the whole video, right, right, wherever BigBugBunny is and I pause, then it's gonna be interactive at that very point.

Speaker 2: 11:33

Now how would users do this with us? So this is the editor that we have for VOD that features automated and manual ways on how to make content interactive in any YouTube video. So basically, there's either the way that you search for something right, I could just search for bunny and it would then go through the whole pictures oh, look, there's a false positive. It took the right, but in that case I still could go in and manually mark it, just by setting one point right, and then I would basically have it marked and could track it through the whole video. There's also another option that you have and that's giving us a reference image out of it, right, big bug bunny reference image. One of it is enough to for us to find BigBugBunny across the whole video. So just upload one image, go to the editor, click on find all and it's going to find all the objects depending on your reference images. And just curious for the reference.

Speaker 1: 12:40

Yeah, yeah, sorry, philip, for the reference. You know, obviously you have different orientations, right? So it'd be impossible to load in all the different orientations of how that object would appear throughout the video. Is there a preferred? Do you need kind of a head-on image? Is there some orientation that works better than others? So multiple images help but it's not saying that a single image wouldn't work.

Speaker 2: 13:12

I'll tell you how this technology works. So what we do is and this is basically a smart design- maybe you want to talk about that um.

Speaker 3: 13:25

So obviously, um, um, most people know now with ai there's this multiple ways of of um, how you can extract information from an image, um, and what we call it is is image embeddings, so it's kind of the features that you can extract from an image and those embeddings are not limited to the actual view of that image.

Speaker 3: 13:51

So if the object changes, then the matching of those embeddings could still match. And what you do is you look at the reference images that come in, you extract the information, the embeddings, and you store that in a database, in a vector database that is fast and optimized for AI search, and then, when the video is processed, you basically look up the embeddings from the frame of the video, go into the vector database and try to match it with the best entry in the database so you can look for the nearest neighbor and you set a certain threshold. Obviously, because you don't want to find, like, if you have five different objects in the database, obviously you need a threshold to identify what the actual object is and that is a very fast procedure. So once you have the embeddings of the reference image, the matching is pretty quick.

Speaker 1: 14:51

Interesting. How long does it take to create the reference?

Speaker 2: 14:56

Just like 100 milliseconds, yeah, something like that. Wow, I think it's even less. I think the latest benchmarks were like 20 embeddings per second on a single GPU and you could have multiple at the same time. The beauty of that technology is that you know you can start to basically figure out, to detect new objects that I have never seen before.

Speaker 2: 15:24

Right, the AI hasn't been trained on that specific image. You still provide it and the embeddings are enough to identify it then in video. So that means, for example, if you have a store, right, and a lot of products in there and you want to link them to your video library, right, you can do this with one click of a button. It will basically take all the images that you have of your product, with its description and the links to your store, find the related videos and the objects in the video and links them automatically. And now what's happening is, if you add a new product to your store, through the new embeddings that we receive and the relative matches that we see in the database from it, we can add products that you add to your store today to videos that you processed a couple of weeks ago, without reprocessing the video right.

Speaker 1: 16:19

Yeah, amazing, yeah. And and this is the advantage of machine learning over like an AI, over a model right, where you have to train the model.

Speaker 2: 16:33

Well, the model is also trained right, but it's trained on a different premise. It didn't know to come up with a result that has an accuracy between 0 and 1, but it had to just tell us what it understands of that image Mark. Maybe you tell him what we do with vector database to actually match these.

Speaker 3: 16:59

I kind of described it before. What I said is if you extract from the different reference images the vectors and put them into a vector database, then obviously searching with a reference image, no, you have reference images in the vector database. You take a frame from the video and you search for the different objects in the frame. You identify an object in the frame. You take a frame from the video and you search for the different objects in the frame. You identify an object in the frame. You take the rectangle, the embeddings from that rectangle, you put it into the vector database and you search for the nearest entry from the reference images that are already available in the vector database and that will give you the closest match. And if that is within a threshold that you defined before, then the answer is oh, it's this object. And therefore the machine learning model that you are using to get the embeddings from the image doesn't need to understand what the actual object in the frame is, because it can generalize that understand for us, this was crucial.

Speaker 2: 18:11

At the beginning. We had the strategy to have users basically manually label videos. We're going to take the labeled data, train ai from this and then make specific objects being available for automated detection. But then we talked to the content creators and they were basically telling us well, like this is similar labor intensive than manually setting up interactivity right, and it would be a showstopper for many to even think about, like labeling data.

Speaker 2: 18:42

So we were in the need on finding a way how we could allow custom object detection to happen with the minimum amount of work somebody has to put in to get it started and with the minimum amount of time to get it done. And now with this technology, you can basically add a reference image during a live stream and you know then would still give you the right interactive options for a completely new object. Let's say it's a new piece of art that's just going to be revealed, right, and you want to link it, to link the NFT from OpenSea to it, right. Then during the live show, you know, as soon as it's being released, you could just take that in and link it then without the AI having to be retrained.

Speaker 1: 19:31

Yeah, amazing, wow, that's. That's really cool. Now, a question just came in, by the way, a comment for the live audience. Feel free to type in questions. You know we're we're going to try and get to everything. So one question just came in and and it seems like a good place to ask. So are you able? You've identified the object. Are you able to change the object, like this person said? You know, could you make um, you know the bunny black or white, or could you make him slim? Could you make him more fat? You know?

Speaker 1: 20:07

Put a t-shirt on him all that kind of stuff.

Speaker 2: 20:09

I think he's talking about a color overlay right that we use for the navigation, but you mean a diffusion model, you mean turning the bunny into a cat, right.

Speaker 1: 20:20

Yeah, I don't know. I mean the question is written but it says you know, can you change the color of the chosen character? So that would be like you say. But it also said let's say, make Bunny black and white or slim him out, which would be you're modifying his size, or make him wear a T-shirt.

Speaker 2: 20:38

So the answer is, of course, yes, but it's a little bit of a misinterpretation on what we do with the color-coded overlays. Right now. We use this to show people how we actually map user navigation to metadata. Why are we doing this that way with the color-coded overlay? It's because it's really compute-heavy. If you want to know where the cursor of a mouse is relative to a video screen right and then click on it and figure out what's the exact pixel position is because you have to do these calculations whenever you touch a pixel right to immediately show a label.

Speaker 2: 21:19

So what we do instead is we send out a second video stream. That second video stream is a color-coded representation of the original stream, but just the masks with a single color in it. And now if you mouse over the image right, you actually are touching the stream, but invisible, and all we do then is we read out from the GPU information which color you are touching. And that's really affordable for any type of device, Like if you start, if you go on your iPhone on usemecom right, you're going to see you don't need an app, it's going to work natively in Safari and it's not going to heat your phone at all. It's not much more than just watching the video itself. So this was like a prerequisite for us to be able to move these interactive fields. If we wouldn't have done it like this, then we wouldn't be able to move the areas where you are able to interact around without losing compatibility to all kinds of end-user devices.

Speaker 1: 22:25

That's fascinating. So you have two. You're encoding that file twice. Right, one of them is masked. You're masking out all the surrounding, the trees and the grass and everything, and then, when I mouse over, you're simply just showing then the two images on top of each other. Correct, now these?

Speaker 2: 22:53

masks can be used for shapes that WebGL, for example, can render for you so instead of a single color-coded mask, I could show like a glowing ring around it. Right, I could do a pop-out effect, something like this. But for now, why do we have these simple colors in there For debugging, right For?

Speaker 3: 23:17

us. It's then very easy to see is color red.

Speaker 2: 23:20

Really linking to that metadata, does it work really out? The end product could use that information of the mask and its positioning to create custom-looking overlays from it. Right, interesting, yeah, and we're also thinking about how we could implement diffusion so that we could actually alter the video itself. Right, um, it's. It's still a lot further out and this is definitely going to come for vod long before for life, because it's a processing heavy makes.

Speaker 1: 23:53

Yeah, it makes sense, makes sense. Uh, so you mentioned a GPU and reminding. I don't remember what the capacity was, but what sort of GPU level are you talking about? Like you know, let's talk about how compute intensive this is and what's required on the infrastructure side. I can talk to that the infrastructure side, I can talk to that.

Speaker 3: 24:19

On our side, the way we design this is to make sure that we are using affordable GPUs. We are not talking about the high-end stuff that Meta and the likes are building into the data centers. We are trying to utilize a big array of really affordable GPUs and therefore we have to make sure that all the models that we are using, the AI models, are limited in the GPU memory that they need, and we rather utilize a couple of more GPUs with a couple of more models and then distribute the task than run those high-end models on a super expensive GPU. Therefore, our task is to split everything up into services microservices, and distribute the task, collect the results and then feed it back into the pipeline. So that's the main challenge here in order to balance cost versus performance.

Speaker 1: 25:28

Yeah, it's always a challenge, right? Are you able to get some benefit from you know, I'm thinking like an ARM CPU, for example Ampere, where you could have 128 cores. Is there anything you can push off to a CPU, for example, and get some efficiency there?

Speaker 2: 25:47

The GPU is much faster than the CPU and we came from like the earliest version of MuteMe was like we are using four GPUs at a time for a single stream Wow.

Speaker 2: 25:57

Okay, and like we were experimenting a lot until we found out ways using four GPUs at a time for a single stream, okay, and we were experimenting a lot until we found out ways where we could actually use tiny models to get it done that do not require us to invest heavily in GPUs, because the smallest card currently you could buy that is doing what we are showing here is a 4060 NVIDIA card for like 200 euros, right yeah, that's very affordable.

Speaker 1: 26:23

That's very affordable, right, you mean like one?

Speaker 2: 26:25

stream at a max and maybe is lagging out if you have more than 20 objects in the picture at once. Right, so the advantage of bigger gpus for us would be, uh like, being able to process more frames per second.

Speaker 1: 26:40

That gives you the latency advantage and more objects per frame more objects, exactly because you reference, you can do multiple objects, but obviously now is that a linear relationship in computing horsepower needed. So let's say, say I want to do three objects, does that mean a 3x or is it even more? Or how does that scale?

Speaker 2: 27:07

It's definitely taking more resources per mask, right. There's some advantages you have with AI processing. That is, that the entropy of the information is in a downscaled version available, as it is in 4K. The images that we process. When we show them AI, they are post-stamp size, so we can actually shrink the compute load and the network load down before processing, which makes it work. This is also an area that's consistently improving, so AI is consistently being faster, right. Bandwidth is more available, gpus are faster, right. So we knew already a couple of years back that there is this tipping point where this stuff is just working out and working hand-to-hand.

Speaker 1: 27:59

You just said something, philip. I want to explore a bit more because I don't think everybody's familiar with this. So are you using like a hierarchical type approach? You mentioned that you know the actual resolution. For example, the object is its postage stamp size, you know it's quarter resolution, or maybe even so. How are you scaling um that and what is the most minimum resolution that you need to be able to detect an object? I've got a couple of questions embedded in there, but there's definitely a cutoff for that right.

Speaker 2: 28:44

I was sending Minecraft screenshots to Mark all the time to test the classification on that, because, of course, I want to make Minecraft gaming work out, but also because it is already really compressed. You just have a couple of pixels and it is a sort right, so it's kind of yeah, exactly.

Speaker 1: 29:05

Yeah, yeah, interesting. Wow, that's. That's super fascinating. A few more questions came in and then, and then I I I think actually you're going to show this working live, or do we have a video to simulate it? One of the questions is can this be used for live streaming? So that's sort of the setup.

Speaker 3: 29:32

Yeah, that's possible If you do the processing async. Obviously you have to build the architecture in a way that it's scalable. It's possible if you do the processing async. Obviously you have to build the architecture in a way that it's scalable. So if the load is getting too heavy for the video streams, then the detection I'd say frequency is going down. But the way that you do it is you grab the frames out of the live stream, you send it for processing and when the result comes back you start to fill the metadata to the stream. So it's not there in the first second, but then it comes in over time.

Speaker 1: 30:14

So the objects get intelligent over time the longer the stream runs Interesting, Okay, Maybe do you want to give another demo here of it working live.

Speaker 2: 30:27

Absolutely, yeah, absolutely. So let me quickly see Maybe.

Speaker 3: 30:48

Marc, you can give me some time until I have this started. Yeah, I mean I can talk about. The cost of life scenario that I just mentioned is really the the most challenging, obviously. Um, so, on the on the back end side, uh, you have to make sure that you have everything available obviously heavily redundant and then when the main scheduler that is taken in the uncompressed images that were decompressed from the live video stream, it has to choose, okay, which GPU is free, which model is currently available, what's the order of processing steps that I'm doing? And then, depending on the object type that you are seeing and the results that you're getting or that you want to get, you have to optimize that pipeline. And so, imagine, there's a lot of GPU instances available and you have to constantly manage to get all the different images from the live streams and put them on the different available GPUs. So that management that's crucial in those scenarios.

Speaker 2: 32:01

Right, can you see my screen? Yes, awesome, great. So this is a Twitch stream, this is our Twitch extension and you see, if I hover over the areas, right, this is wrong, right, but then it adds the uh labels at these points and what you see is, if I enable the color-coded overlay, that where these areas are right. But it really said the lemon is a ketchup.

Speaker 2: 32:29

So we're not done yet, but, um, it's getting close, so so if I start that stream now, you're going to see basically the interactive overlay. Yeah, you should see the interactive overlay move with it, didn't? Let me quickly do one thing, so I have created this video in case it's a live demo Only the bravest attempt.

Speaker 2: 33:03

live demos, yeah, exactly so here you see a recording of the same thing, right? I um have a couple of objects that I filmed. They automatically are being turned into interactive points, so nobody had to tell the system hey, there's going to be a banana, and make it interactive, and this is how it looks like.

Speaker 2: 33:21

No, this is completely interactive, and here you see the overlays. They are being rendered in real time, currently at three frames per second. More isn't really needed if you don't see this right, because this is also again just telling you how this works.

Speaker 3: 33:40

So usually, you wouldn't have that issue at all, right where you would Unknown caller.

Speaker 2: 33:46

Cut off no. Sorry, I just had a call coming in, I think.

Speaker 3: 33:58

I just wanted to mention. Why would that be a wrong description for one of the objects in that video? So it really depends on this is a live scenario, right? So there's two ways we provide the functionality. The first one is, as we mentioned before, if you have to find out all the objects in the video yourself and then find metadata for it yourself and identify actions that are possible with such an object yourself, then it can happen that every now and then it finds an object and identifies it completely wrong. But that's over time. With the models improving and our data growing and being able to train them better, this will improve, right. So with the amount of users that we onboard, this will get more accurate.

Speaker 1: 35:07

Do you foresee a scenario because I could see a situation where a content owner you know, netflix, for example where you know they have their proprietary assets, both you know their content that they produce, and then maybe even content they've licensed that maybe you know they have certain rights to or whatever know they have certain rights to, or or whatever they're not going to want to share, um, all of their references with you know disney you know, for example, and vice versa, so I'm assuming. So my question is would I, as netflix, in this scenario, be able to own all of my references? Disney owns theirs and they're not getting shared back and forth, or does this go into some big shared database, or how is all of that managed?

Speaker 2: 36:04

I mean, we can't dictate what customers want, right? So if there's customers who say we would not want to share our data, of course we couldn't do that, then right. So we haven't really decided on a single way how data is being shared across users, but in general, from a technology standpoint, what Mark is true, like if they share the data, then they basically aggregate the information and metadata as being better for everybody.

Speaker 1: 36:36

No, they would benefit. I mean, clearly there's a benefit there, I can see that. And if they don't?

Speaker 2: 36:41

want to, they could still do it. We would most likely not force any of these decisions because we don't own the content. We would just scare everybody away if we would. Oh, of course. So there's pros and cons for sharing content, of course. Maybe they don't even have the rights to do it, Right? That's also like is hard, so that specific things are like like kind of heavily prohibited or, from a cost perspective, prohibiting small companies to do it and to compete with bigger companies through that Right.

Speaker 2: 37:29

One example is face detection. In my original proof of concept demo, I had face detection already in. I can't roll it out in the EU as a feature as face detection. In my original proof of concept demo, I had face detection already in. I can't roll it out in the EU as a feature as face detection and literally that term and the way how you do it with biometrical data is prohibited. But what we figured out is basically that the embeddings of the image even so, they're not using biometrical data, but these AI embeddings are also allowing us to identify users, right? So without face detection, we can still get to the point where we say, hey, this is Mark and just-.

Speaker 1: 38:08

Well, I mean it makes sense. And I am not. I essentially know nothing about machine learning and image analysis, but I know, as the saying goes, just enough to be dangerous and ask the right questions. So if you're looking at relationships of I don't know Mark, is it like polygons or something? But if you're looking at relationships, relationships, why in the world does it matter if it's you know, this bottle of water and you know, or if it's if it's my face right?

Speaker 2: 38:43

I mean I don't know this is legislation, right, it doesn't have to make sense.

Speaker 2: 38:49

It has to work for a specific purpose and for the.

Speaker 2: 38:53

It was preventing, uh, identification of online users, to prevent them to be targeted for political ads, stuff like correct, correct, correct, and they, they overdid this regulation. But you know, it doesn't really matter because, like, the innovation happens so much faster that the areas, for example, with biomedical data, that they have forbid to use now or make it really hard to use, where you need a digital privacy officer being like educated and work for you, just for that feature, right, it's not needed anymore. And I think, like what we see happening now is that they realize that the over bureaucracy has made us all being like stuck and people like mark and I are trying to circumvent this and still steal that ship right, and it's not going to stay this forever. We hope, right, we hope for change there as well, so that we are more allowed to use these technologies as they come in and maybe, if we do something really bad with it, then, yes, get prosecuted, right, but not upfront trying to kill the technology, because eventually somebody is going to do something bad with it.

Speaker 3: 40:04

A really good example for that is, deepfakes.

Speaker 2: 40:06

Everybody is so freaked out about deepfakes. And there's that run for detecting deepfakes, which is a cat-mice game where deepfakes are always like, are not being detected. With the next version, right, but what's for the user, right? The user likes these technologies, for entertainment, for fun. And this is a much bigger use case than somebody trying to push fake information. Right, yeah, that's right Somebody trying to push fake information. Yeah, that's right, stopping a technology that could allow yourself to become the hero in the movie that you're just watching.

Speaker 1: 40:42

Yeah, that's right. Yeah, absolutely yeah, yeah, well, it's an interesting discussion. Okay, a couple other questions. So I intentionally delayed this. I was successful for like 41 minutes. We haven't used the word latency, but we have to because obviously, especially if you're going to say something's interactive, then it needs to operate quick enough and be responsive enough to be useful. So I can't select an object and then 15 seconds later get a response. That's not very useful. So talk to us about latency. Maybe you can explain. You know just at a high level, where, across the chain, you know so the workflow, if you will, from glass to glass, you know where some of the bottlenecks are and where you're optimizing. Where you are today, where you're optimizing, you know because latency is important with interactive technologies.

Speaker 3: 42:00

I would separate latencies to three different in our case. So the first one is the actual video stream. That's where the latency that you typically see on those live streams today is three seconds, and that's the same for us. That's from taking the original content from the camera and then displaying it on the viewer screen. The second latency is how much time does it take in the live stream that the objects become interactive and that is populated over time, depending on how many resources are available for that stream on the GPUs at that time. So it could be very fast within a few seconds, or it could be a few seconds more if the GPUs are busy, but that's not harming because that's happening in the background and more and more metadata becomes available. The third is how long does it take if an object is interactive and the user is clicking the object to get instant feedback? And that's instantaneous because the metadata is already available in the player at that moment. So when somebody clicks the video, the player is taking the metadata, creating the overlay and providing the options to you.

Speaker 2: 43:23

So we target a three seconds blast to blast latency, right, and the actual AI processing only takes like 300 milliseconds of it. But we have, like this additional hop right we first need the server to receive the srt stream, then we have to basically push it further to that gpu. The gpu has to decode it, then the models are being applied on top and that's literally. That's only taking one or two frames in processing latency. It's just the ai part, which is 60 milliseconds or something, right. So, and then, like, we have to send the content back, package it and then it goes to the playback device. So the actual ai processing is is nearly smaller than the buffer that you need for decoding in the processing pipeline as well as at the client side.

Speaker 1: 44:16

Interesting Now. Do you have any limitations over the streaming protocol that is used? For example, you mentioned SRT but WebRTC, or if you were to use QUIC, or if you're HLS or DASH.

Speaker 2: 44:33

It's agnostic. It can work with any of these also, because it's just a metadata stream yeah, right, I mean it's just a metadata stream.

Speaker 1: 44:42

So as long as the protocol supports that, which they all do, at least the original video is whatever you would use, right?

Speaker 2: 44:51

um? On twitch it was h.264 and low latency hls. On youtube, what I showed you it was av1 and, yeah, av1 dash, right. So our player technology, that interactive overlay that fits to all of these technologies, and we, under the hood to get the metadata in, we use webrtc video and webrtc data channels, right so the video you send, the color coded stream, that comes with a really optimized latency already, if you're talking about the web, right? And um, yeah, it allows us to scale this also really nicely, because the post-stem-sized color-coded video is just 120 kilobits or whatever in bandwidth, right? So if you stream a 3-megabit HD video, you still only have bandwidth additional for interactivity, like a second audio stream, right? Yeah, which?

Speaker 2: 45:54

makes it really really easy for us to scale it. It's also compatible with CDNs, but we could use Cloudflare, for example, to stream it to a million users.

Speaker 1: 46:04

That makes it affordable and compatible.

Speaker 1: 46:09

Yeah, that's great, that's amazing. Well, this has been a great discussion. I want to wrap up with some next, like where are you going from here? And I would like you to comment both of you to comment from a technical perspective what's on your roadmap perspective? You know what's on your roadmap, either features-wise, or you know maybe some things that you need to work out before you can really commercialize it. So I think that would be interesting. You know listeners would like to know that. And then the second piece is how do you plan to commercialize this? So are you going to be licensing this to vendors, who will then be building this into their solutions? Are you going to come out with a service?

Speaker 2: 47:05

We're currently in that fuck around phase of a startup. We have the technology.

Speaker 1: 47:12

I love that description by the way, because most people say it a little bit differently, but that's exactly what it is for almost all of us, yeah exactly.

Speaker 2: 47:23

So, trial and error, we're trying to figure it out. We see some signals that are strong from certain markets, some that are less strong. Like strong signals, for example, is for the gaming world. They really love what we see there. They are less strong. Like strong signals, for example, is for the gaming world. Right, they really love what we see there. They are completely, they're shocked. They didn't expect that to be possible. And their mind spins and they realize, ah dang, I can use this with my audience, like this, right, and then they all tell us basically, this is gonna lead to a much bigger bonding experience for our users with us, because they can steer us around, right, they can really tell us what they want and we see this in real time. So if I want to just ask them should I eat the apple or the banana? Right, it's an instant thing. And previously their only feedback channel was the chat, which was completely messy.

Speaker 2: 48:13

as soon as you have like more than 10 people chatting, right and you can't do it hands-free, you have to monitor and read that chat, right? That's like a lack experience for the content creator. So they love it. Then advertising industry they are telling us that for them, you know, first step was to target their ads on who's watching the ad. Right, I know Mark Dunnigan. He loves sports cars, so I show him the Porsche commercial. When he's watching, the ad breaks right Now.

Speaker 2: 48:45

What's going to happen with the rich metadata that's being extracted in real time from linear channels? You can tell the advertiser, the ad insertion engine. Basically, they are just where people at a burger joint or whatever, right, playing the McDonald's advertising because they all are hungry for burgers right now. Right, yeah, so this is like also a strong signal. But there's much more. Right, this could be used as a man-machine interface for robotics. For example, a robot running into an edge case doesn't know is it like a plastic bottle or is it a baby lying in front of him? Right, yeah, it's an edge case that it might, might completely stop its operations, and the most likely outcome for the time we are living in right now maybe not in 10 years from now, but right now is that a human takes over the robot, tells him what to do and takes care on the edge case, right.

Speaker 2: 49:44

So this could be a man-machine interface for cases like this.

Speaker 1: 49:49

Very interesting. So, mark, you know you're obviously thinking about this, as Philip, both of you, but you know what's on the roadmap, what work is still needed to. You know, maybe get your first use case out there into production. What will be the first use case? Talk to us about that.

Speaker 3: 50:12

Well, actually, the kind of separation between the both of us is when it comes to use cases. That's Philip's word, my word right now and going forward will forever be a loop. It will be finding the newest, latest, greatest models, optimize them for performance, optimize them for accuracy, retrain them. Optimize them for accuracy, retrain them. Whether it's segmentation of the video, tracking of the objects, identification of the objects, getting the embeddings from the image into the vector database, increasing the performance of the matching operation. That's a game that I will probably perform for a long time now.

Speaker 2: 50:58

Therefore improve it over time For your question about commercialization. So we're thinking about selling white labels of this to existing video service providers. They most likely have their own niches that they target the software for and they could use this to enrich their own service For the end user. I think like for the gamers it would be really, really interesting.

Speaker 2: 51:27

But very tiny wallets, right, they don't have a lot of income through revenue. But we could potentially change this, even with interactivity, if we would successfully introduce pay to interact and, like they have now a tool where they basically get more money the more people are interacting, right, we we think it eventually is going to lead to completely new content.

Speaker 2: 51:51

It's also a little scary right, but it could really mean that, yeah, you basically put in a few cents cause you want, and content created to play a specific song, right, or similar Interesting. And another one would be e-commerce, of course. Like there is strong signals for e-commerce, it's just not so easy to get into these markets. What we can do now with the automated matching of products into videos, that we make their whole product suite available across video libraries, that would be more of a type of SaaS business, I think. So we're trying to figure out what to do next. We are open to partner. Try things out with anybody right now and see what makes most sense aside from there.

Speaker 1: 52:44

Yeah, I can give you a couple hints as to where to go on the e-commerce side e-commerce side so an obvious one is Shopify. You should go, try and get to those guys and talk to them about what you're doing. There has to be some application that they could get really excited about and certainly their users could get excited about. In Asia there is a platform called Shopee and Shopee. If you're not familiar and for listeners who don't know, shopee, it is this phenomenon of these hosts who basically become like live sellers of products. Sellers of products. So you know, if I were an influencer and you know if I were a fashion influencer or whatever, and I've got a following, then when I go live, when that person goes live, they're basically literally selling products and then people in real time are literally pointing and clicking and buying or asking questions, and some of these it's a massive business.

Speaker 1: 53:58

I mean they're selling millions and millions and millions of dollars, yeah, and Shopee would be somebody that you absolutely should go talk to about this. And then there's others. I mean they're not the only, but they're. They're one of the bigger ones, for sure, if I don't know if they're the biggest, but but yeah, they're very, very large.

Speaker 2: 54:19

so I mean you could do a garage sale and just walk with your phone through your garage and like tell stories what you did with that thing, right, and while do people bid on that specific object Exactly?

Speaker 1: 54:35

Exactly. Yeah, yeah, I'm, yeah, I'm actually. You know it's not. It's not a world that, like I've never worked in e-commerce, you know, in my past, but I always just have in the back of my mind like this is a market that that just I feel like hasn't fully been cracked in. The video streaming, the interactive way, the whole idea that people want experiences, right. And when you think about, I mean sure there's some things we buy that are just pure commodities. I could care less, grab one off the shelf and pay for it and I go home, right, but there's so many things and they don't have to be big purchases, you know either, but there's so many that you know it's like I, you know, if I can have an experience with it, not only is it more enjoyable, but I might actually spend more money, you know. Or I might buy more of something because you know it's more than just you know I need a, whatever the thing is, you know that I'm shopping for. So, yeah, well, guys, we've gone over because you know this was a really enjoyable, at least for me. I hope for the listeners that you all appreciated hearing about what MuseMe is building.

Speaker 1: 55:56

And you know, mark and Philip, thank you for joining us. We will link up in the show notes. You know, a link to your website. You guys are both on LinkedIn, right? You're pretty active, easy to find there. Okay, so if someone wants to get in touch with you, then, uh, they can. They can easily do that. So, yeah, well, thanks guys. Thanks for joining.

Speaker 2: 56:22

Voices of video thank you very much. Yeah, people, please sign up with musemycom, try it out. The whole thing is completely for free right now. Right, no credit card, nothing. Amazing, live beta is gonna start soon. We wanted to have it out in september. It's we're like two months late now, but it's gonna come really, really soon. Awesome. This episode of voices of video is brought to you by netint technologies.

Voices of Video

Voices of Video

Revolutionizing Real-Time Engagement: New Horizons in AI-Driven Video Interactions with MuseMe

Mark Donnigan

Per Nybom

Jan Ozer

Anita Fejter

Marc Cymontkowski

Philipp Angele