Podcast: The Prius of Artificial Intelligence: Hybrid AI Models

author avatar
Podcast: The Prius of Artificial Intelligence: Hybrid AI Models

In this episode, we delve into the ever-evolving world of hybrid AI and generative AI models.

In this episode, we delve into the ever-evolving world of hybrid AI and generative AI models. Join us on a journey through the latest advancements in AI technology and their impact on scalability and real-time AI experiences and explore how on-device AI and cloud AI work together to transform various industries, from automotive to extended reality. 

This podcast is sponsored by Mouser Electronics


(3:47) - The Scalability Challenge: Focusing on Hybrid AI in the Future of Generative AI Models

This episode was brought to you by Mouser, our favorite place to get electronics parts for any project, whether it be a hobby at home or a prototype for work. Click HERE to explore the guide on how TensflowLite models from Google can be deployed on edge devices!


Do you love the Prius as much as I do? Because let's be honest, it was one of the greatest automotive breakthroughs of the 21st century. Well, folks, the Prius of generative AI is here. We're gonna be able to offload a lot of this stuff onto our phones, onto our edge devices, and it's probably gonna lead to a future where we have better generative AI that's more cost efficient, more private, and less energy intensive. So, if you're a chat GPT addict or just generally curious about generative AI and its future, well then buckle up because this one's for you.

I'm Daniel, and I'm Farbod. And this is the NextByte Podcast. Every week, we explore interesting and impactful tech and engineering content from Wevolver.com and deliver it to you in bite sized episodes that are easy to understand, regardless of your background. 

Farbod: All right, people, as you heard, we're talking about Edge AI. But before we jump into today's episode, let's take a step back and talk about today's sponsor, Mouser Electronics. Now you guys know we love talking about Mouser because they're one of the world's biggest electronic suppliers. And what that means, specifically about Edge AI, is that they know the suppliers, they know what's going on in academia, and they write some pretty cool articles about it. So, before we get into today's topic, we wanted to talk about one that's actually super relevant. They're talking about how you can deploy Edge AI devices pretty conveniently with TensorFlow Lite. So, if you don't know, TensorFlow is like a set of software libraries developed by Google that have machine learning capabilities. So instead of you training your own models for audio processing, for video processing, for any of that, you can just draw from it and just get going with whatever idea you have. And that's awesome. But what if you don't wanna run it on your computer or the cloud or whatever? Well, they have TensorFlow Lite, which is a subset of those libraries specifically meant for running on edge devices. And what this article goes into is like, hey, this is how you can utilize it. And this is how you can basically make your next project in a matter of what, a couple of days with hardware that you can actually get off of Mouser. And that's the coolest part, right? Not only do they give you this like little bit of insight about if you're interested in this, it's really easy to get started. This is the 101 on how to do it. And by the way, we even have the parts if you want to get off the ground this weekend.

Daniel: Yeah, well, we love Mouser. We've always said it's not, you know, or the saying is, it's not what you know, it's who you know. Mouser knows the what and the who. And in this case, they've also been willing to give us the how. So, they lay out some basic information on what TensorFlow is, how you could create your own edge AI project. And then they provide the bill of materials and even the block diagrams to understand exactly what's going on when you execute that yourselves. Obviously for folks like us who it's our mission to provide people with information so that technology that is at the cutting edge is approachable and understandable to them. It's exciting for us that Mouser is equipping engineers and makers and garage hobbyists alike to go do the same thing with cutting edge technology like AI.

Farbod: Yeah. And you know what I appreciate? We, I think one of the big things that you and I try to do is make cool tech accessible, right? Like not intimidating, easy to understand. When I read this article, like even through the lens of someone that doesn't have an in-depth technical background or doesn't know much about AI or ML, this is pretty accessible. Like if you're just generally interested in this topic, this article can help you get off the ground, which I think is awesome.

Daniel: Yeah, that's fire. And that's why we've linked it in the show notes.

Farbod: Yeah, definitely check it out. And let's use that as a segue into today's article because we're talking about what's been a hot topic for, like, I don't even know how long at this point, like months. It's consistently the hottest topic in tech. There are countless Twitter threads and LinkedIn threads and I don't know, I'm not even on Facebook anymore. I'm guessing Facebook post about it. Generative AI, like this is your chat GPT, this is your stable diffusion that's giving you all these cool pictures and whatnot. And why are we talking about generative AI? Well, like it's becoming more used in people's day-to-day lives. Like, ChatGPT came out, people were like, oh my god, this is crazy.

Daniel: I mean, it's the fastest growing application ever.

Farbod: Yeah, if you look at the charts, you have Facebook, which was crazy fast, and then you have ChatGPT, which is blowing it out of the water. But we work in tech, so we might be in a bubble. But it's pretty much become a part of my daily life. It's just a great tool that I use on a day-to-day basis. Well, what this article is saying is because of how generally useful it and how fast these algorithms are getting better and more complex, they're just gonna become more and more utilized, which is great because it's good for us. However, when you think about the architecture of how these solutions are deployed, it's very cloud-centric. So, like, let's think about ChatGPT for a second. I'm gonna keep referring to it as an example because it's obviously popular. Everything, every time you run a query, it's run on these servers, it's Azure services, so all the processing is done there. You need an internet connection, et cetera, et cetera. And some of the general concerns around that is that it's very expensive. It's very energy intensive. Some people have concerns about privacy because you're sending data that goes into a server, so you gotta make sure no one's in the middle picking that up.

Daniel: I know there are lots of companies that put limitations on the use of generative AI for the same reason, right?

Farbod: Absolutely, yeah. Some of them are setting up their own instances because they don't wanna do that. There's latency because I've ran into this issue before, especially because I use the free version, I don't pay for my chat GPT. If there's a lot of people trying to use the API, it slows down a lot. And if you don't have internet connection, you're just kind of out of luck because you can't access the service anymore. So those are like the general limitations. And I think most people are kind of, at least I am, in this place of, well yeah, that's just the price you pay for using this technology. It's slightly inconvenient, but what are you going to do? Well, this article that we're talking about today is actually coming from Qualcomm. If you don't know about Qualcomm, they're one of the leaders when it comes to chip making. Snapdragon chips are probably one of the most famous chipsets in the Silicon Valley. You're talking about Apple products, smart cars, everything, drone technology. They're basically writing this article proposing a solution that they call hybrid AI.

Daniel: Well, and I think before we talk into what their solution is, at least personally for me, it's valuable to zoom completely out and talk about what this is a hybrid between. Or we have alluded to the two opposite ends of the spectrum so far in our conversation today, but at one end of the spectrum is cloud computing, cloud AI, and this is what you're talking about with chat GPT, right? When you're inputting queries on your device, chat GPT, OpenAI isn't actually computing anything locally on your device. It's all happening in the cloud. And the only thing that's being communicated back and forth is your query and then the response from the model. On the complete opposite end of the spectrum, when we were talking about edge AI using TensorFlow Lite, or referring to the nozzle article at the beginning, that is completely and wholly hosted and calculated and computed on your edge device. So, say you've got a tiny little chip that's trying to do some type of image processing or audio detection, et cetera, it's doing all of that computation on the microprocessor. It's not doing anything associated with the cloud. Right. Those are the two opposite ends of the spectrum for folks that either needed a refresher like I did or trying to understand the context of this. And what Qualcomm is proposing here is a synergy between both of those approaches. So not just choose one, not just choose the other, find the opportunity to leverage the strengths of both and make a hybrid approach that's a little bit more scalable for the future. Cause we just talked about all the scalability issues associated with these awesome cloud computing models, but we also know that they're way too computationally intensive to move all of that computing out onto the edge. So, what's the happy medium here?

Farbod: Yeah. Yeah. And I'm going to draw a parallel into just standard web technology. We have over time moved into web frameworks that have the client side actually loading and doing a lot of the processing instead of the server side. And that has allowed a lot of developers, a lot of companies to offload those things onto the end user. So now it's like my MacBook Pro that's loading YouTube.com and it's taking on a lot of the computational stuff required to do that versus the server side taking on the bulk of that work.

Daniel: It sounds parasitic. Like, oh, we're offloading all of our computation to you, maybe it is in some ways, but it also vastly has improved the user experience, right?

Farbod: I agree.

Daniel: It reduces latency, it reduces these scalability issues, and all in all, it allows companies like YouTube to reach billions of people every single month without having to charge for it, which is crazy. So, we talk about one parallel there with web technology. There's another one that I want to draw. For folks who are like big gearheads like me and focused in the automotive space. I view this approach by Qualcomm by kind of being like the Prius of artificial intelligence, right? The Prius wasn't the fastest, most expensive, most luxurious car, but over the last 20 years, it's been the clear dominating presence in terms of hybrid propulsion in vehicles. They were the first ones, I think in 1997 or something like that, to make a commercially available vehicle that was a hybrid between an internal combustion engine and a battery electric vehicle and while those felt like though they were at two opposite ends of the spectrum. You couldn't possibly have a vehicle that's both internal combustion and battery electric vehicle. You have to pick one or the other. That's kind of, it feels like we're at this juncture right now with artificial intelligence computing, this team from Qualcomm is saying let's grab the best parts of both let's make this hybrid model that you know, maybe it isn't the shiniest maybe it isn't the fastest, but It's scalable, it's reliable. And to me, that's why I think this approach is like the Prius of artificial intelligence here.

Farbod: I'm with you, dude. And I forgot to mention it. I actually found this really interesting thread that did some back of the napkin math about the cost of running chat GPT for openAI. They're basically like, look, you have to, there's 175 billion parameters for a chat GPT 3.5. You need like this many A100 GPUs to run it. It costs $3 per hour to get these from Microsoft Azure, which is what they're doing. Every query costs this much, yada, yada, yada. What you need to know is the average cost for a query from ChatGPT is about one cent per search. And with the number of daily active users that they have, people estimate that in 20, and basically last year with their peak 10 million users per day, they were basically losing $100,000 a day. And now that's just significantly grown. So, when you take that into context and try to think about all these different generative AI solutions that we might be missing out on because no one has capital like that. Even OpenAI partnered with Microsoft to soften that load. Now if we can find some happy medium where we can get the benefits and these companies don't have to pay so much money just to get started, it could pave the way for a lot better future in terms of AI.

Daniel: Yeah, and let's talk about the mechanics of how that's actually achieved, right? There are certain parts of a computational task that are going to be way too computationally intensive for us to ever break it down into something that could work on something like a TensorFlow Lite, right? Something that needs to be processed at the edge on a device with limited computational power. It's just never gonna work. At the same time, there are probably really, really small, quick, easy computations as a part of a major project that you're working on that could absolutely be calculated at the edge and never sent out to the cloud and subjected to all this latency and all these other inefficiencies you're talking about. The way I view it is like maybe there's, you know, you're working on a project with someone from work, there's a really, really hard sticky task that you know you need to go ask your staff engineer for help with, or you ask your manager for help understanding the sticky part of the problem. But then there's also like a lot of really, really easy parts like responding to email that you could do, absolutely you should do it on your own. And it's actually less efficient for you to try and forward an email to your manager for them to respond on your behalf. You've got all the power, you've got all the might to respond to that email yourself. Do it right now on your computer. Right now, what people have been doing is, as part of OpenAI, as an example with ChatGPT, taking all of those requests, the big, the small, all of those queries are being processed in the cloud. What Qualcomm is suggesting here is using techniques like pruning, quantization, knowledge distillation to break down the different aspects of computation and pass the more manageable parts of computation toward a local compact model like TensorFlow Lite that can be trained and fine-tuned for very specific tasks like responding to email. And then the more heavily computation stuff that requires a lot of creativity or a lot of computational power. That's what's handled in the cloud. And those two kinds of work together in a symphony to cover what you're trying to achieve as a major project.

Farbod: Yeah. So, we're now folks officially talking about the secret sauce of this like hybrid approach that Qualcomm is proposing here. Now they are aware of the limitations. Like you're saying, we have to find that happy medium of what makes sense to do locally versus offloading it to some cloud to handle for us, right? One big problem is obviously gonna be hardware resources. Like, is your phone going to have the space to store a 500 gigabyte model of chat GPT? Probably not. Is it gonna have the computational power to handle all of this? Probably not. So, like, I think you mentioned three different items that they are proposing as a part of their sauce to offload this. But the one that I found most fascinating, and I think it might really be the juicy bit of the sauce, is quantization they have like an entire link in this article that we're talking about today, which we're gonna link in the show notes, which breaks down exactly what this is. But the best way I found that explained is imagine it just as a way of compression, but for an entire algorithm. So instead of looking at this like beautiful colored image that has all the colors on the spectrum, you just turn it into a black and white and use that context to figure out what that image actually is. What that means for the model is, it's gonna require much less space. You don't need the full, let's say 500 gigabytes. You might need a tenth of it. You don't need as much computational power so it's not as memory intensive, which means you actually don't need that much power. So, it's better on your system. It's much easier on your system as well. But what I'm finding it to be really interesting is that in some cases, it doesn't even lose accuracy. Because when I tell you the picture analogy, right? You're like, it's obviously not as nice to look at a black and white picture than it is to look at a fully colored picture. But let me now transition that analogy to Spotify. When you and I listen to Spotify, I would say like most people were contempt. They're like, this is good music. But that song is compressed.

Daniel: Absolutely.

Farbod: It is compressed. Like it's now more compact so that it's cheaper for Spotify to actually allow us to access it and whatnot. But it works, it's good enough. And what these researchers at Qualcomm are doing is that they're spending a lot of resources to figure out how do we do this process so well that we get the magic captured, the accuracy captured as much as possible without compromising on all these different parameters. In fact, they've already kind of done this with stable diffusion. They took the 32-bit stable diffusion, which I think is the default version of the product, turned it into 8-bit via their quantization platform, and they ran it locally on an edge device with no internet, none of that, and they're saying that they had no loss in accuracy. And they could even optimize it to run on low power devices, which is kind of nuts to think about.

Daniel: Well, and I think that's a really interesting proof of concept, but I like understanding also how these fits into the bigger picture, right? So, the secret sauce is awesome, as long as it allows us to achieve some real-world outcomes that are interesting for users like you and I, that's what we always talk about. There are two reasons why we wanna talk about technology on this podcast. One of them is that it's interesting. We've already checked that box here. The second part is we want to make sure that it's impactful. It's actually going to make a difference for us in our daily lives. So, I want to kind of take this opportunity to zoom out and talk about what, what's the, so what here? So how does Qualcomm see hybrid AI models impacting our daily lives? But first, I think the bridge that helps us get there is the understanding that they're going to use quantization. They're going to try to streamline these models pruning, which means they chop the computation into smaller bits. Bits is a bad word because that's really the size of computation chop it into smaller parts that are less bits a piece. One of the things that they mentioned is where they've got really, really strong quantized AI models that can do really one specific task really, really well without spending a lot of energy, without taking a lot of memory, without costing a lot of money. They can use those as like major building blocks. And then if you want to, to improve the quality or the user experience for the end user, you can kind of use those like bricks and then use a generative model like GPT or BERT or BARD from Google and use that as like the mortar between the bricks to fill in the gaps, to make it feel like it's still a creative experience. And that's where it really turns into this hybrid AI model where you're using really discrete chunks that are like small enough and energy efficient enough that you could run it on like an Arduino chip, do a bunch of those. And then in between those chunks, you can layer in some creative AI from something like ChatGPT that helps package it nicely and helps make it easy for people to understand and makes it feel like you're still getting the awesome user experience. Like the first time I used ChatGPT, I'm like, wow, it feels like I'm chatting with a human, but without costing all of that inefficiency along the way.

Farbod: Yeah. By the way, I just, not to toot our own horn here, but the analogies in this episode have just been off the hook, man. But I agree, and I feel like we didn't note it, but we talked about Qualcomm being a big chip manufacturer and that totally makes sense. But when it comes to edge computing, they're also one of the bigger players because they mentioned in the article, when it comes to assistive AI in vehicles, I think they're the leaders. They work with Honda, Volvo, a bunch of different big manufacturers. So, they have the reps in on what's actually required to make these systems work, what the gaps are, and what could potentially be a solution, hence this article that we're talking about.

Daniel: Well, I appreciated what they mentioned in this article around how exactly technology like hybrid AI could be deployed for autonomous vehicles. And you have a set of local, very fine-tuned, specific task calculations that happen on the vehicle. And they aren't that computationally intensive to do the really safety critical things like stopping when there's something in front of you, like doing things like swerving into a lane or even changing lanes, right as we get more and more rigorous assistive driving technology. You use very, very well-oiled fine-tuned efficient building blocks like that and then you have a generative AI model in the cloud that doesn't need to respond very quickly, but it helps the vehicle to string together a bunch of those tasks to plan a really efficient route to get from point A to point B. That's a really symbiotic combination of those two different types of computation, those two different types of AI, where you're able to generate the most efficient route from point A to point B by building together a bunch of little blocks of these tasks that the vehicle's actually really, really good at doing on its own without any internet connection. And they also mentioned a similar example where they're able to create immersive and responsive mixed reality for users. So, developers can create mixed reality content that's both consistent and quality. And that's where you get these really, really fine-tuned, efficient bits that the local, the edge computation part of it. But it's also highly creative, highly personalized. And that's where you can feather in some input from the cloud computation model as well. So, I think it's really interesting how they talked about real world applications of this, where this technology is already being used. And then obviously, we're excited to see how this expands everywhere else. But that kind of got the wheels turning in my brain around, oh, I can absolutely understand how this might be used in autonomous vehicles. I can absolutely understand how this might be used in mixed reality. And then this helps open the door for more innovative applications in education, gaming, training. They even mentioned like some opportunities for helping make sure that like pilots and stuff like that get better training. It all, I don't know. It's got the gears turning in my brain around how we could possibly use hybrid AI and I think the same way that the Prius wasn't that pretty to start it. It didn't appear that appear that beautiful, but when you look at the way that it impacted millions and millions of customers and the way that they were able to drive more efficiently. And then it also improved their user experience. The Prius is beautiful in its own way, even though until the 2023 model, I think it was quite ugly. But the 2023 model actually-

Farbod: I agree, we've talked about it. It's a good-looking car. I gotta say though, one thing that I've been thinking about while I was reading this article is how generative AI, its utilization and this potential hybrid approach, if it does get widely adapted, is gonna change the hardware design engineer side of things. Because already we see what like iPhones, right? They have about a lifetime of let's say 45 years as you reach the older generations. I just switched from an iPhone 7. They increasingly throttle your processor because it just needs to handle a much beefier OS. So, you can either sacrifice your battery life or your performance. And a lot of times it just hits the performance. Now we're talking about, you know, your OS is getting updated every couple of months. We're talking about generative AI that's getting updates, what, every month or so and becoming more and more and more complex. So, if we're really gonna start utilizing our devices to handle this, does that mean we really need to future-proof our hardware way more than we expected? Like, how is that gonna play out? Is my iPhone going to become outdated in two years instead of five?

Daniel: That's really interesting. I didn't think about that, but I could totally see a world in which the fast iteration of generative AI models totally makes our hardware feel obsolete within one to two years because the hardware can't keep up. But I also think that we're kind of safeguarded in some ways because we can't keep up with Moore's law right now. We're trying our best, but we can't continue to create increasingly powerful hardware at the rate that we thought or the way that we've historically been able to. I'm hoping that in some way the folks that are designing hybrid AI models like this are considerate of the fact that maybe some folks have limited computational power and maybe they've got different variations of a hybrid AI model that allow certain functionality to folks with the computational power but then also let you turn down your local computation amount, kind of like the way that you can, I don't know, change the quality by which you're streaming something on YouTube, right? You can do something without completely frying your device and it still allows you to be able to consume the content.

Farbod: That's a fair point. But yeah, I don't know. I've been increasingly thinking about hardware. I mean, I work in a hardware centric role and we have a hardware background, so that makes sense. But I don't know, I feel like in a software dominated world, hardware always comes second. But with the increasing utilization of NVIDIA GPUs and the demand for hardware performance. I feel like it's become more important. And our MIT trip, which we're going to be talking about relatively soon. So, you know, a little teaser right there.

Daniel: Yeah, it will. I think this is a cool part to wrap up. If you could hit us with a quick summary, but I will say before you do that, this interview we did with a bunch of folks at MIT, I think we're thinking about the same one with Joseph Ravichandran, where he's saying that past 20 years of cyber security has been based in software and he thinks the future of security is actually in the hardware because software has started to become so bulletproofed that we're now having to test the limitations of the hardware instead. I think we're talking about the same we are here.

Farbod: Yeah.

Daniel: With AI, right, AI will continue to iterate and to develop at such a fast pace that soon the hardware becomes the limitation not what someone can design in the software world.

Farbod: Joseph's feedback was so, it just really hit the spot. It's been on my mind.

Daniel: And we'll be sharing more information from our, more content from our MIT trip soon.

Farbod: Yeah, I'm excited about that. But let's do the quick TLDR. So, folks, generative AI has taken the world by storm and it's still a big topic. I mean, look at Twitter, Facebook, whatever. Generative AI, chat GPT, people are loving it, but the architecture is very cloud-centric, meaning every time you do a query, it's somewhere in a server farm, GPU's processing all this stuff. And that's typically very cost intensive. It's very energy intensive. Some people have privacy concerns. They don't want their data to go all the way over there. And therefore, manufacturers, folks in academia like Qualcomm have been thinking about what if we have a hybrid model? And what that means is instead of everything being handled by the cloud, you do partially the heavy stuff on the cloud, but then the more lightweight stuff, maybe on your phone, maybe on your tablet, maybe on your computer. This hybrid approach could allow it to be much more cost efficient, much more energy efficient in terms of privacy, you know, you could do it all locally, so you don't have to worry about that. But there's some challenges there as well. How do you make something that big run on a computer that does not have the hardware resources that you would typically have in a server farm? They've come up with some really nifty tricks like breaking up the packets, like making more condensed versions of these algorithms. Think like how Spotify streams music to us compressed. All in all, it seems to be a viable approach and potentially the approach for mass producing generative AI.

Daniel: Nailed it, dude.

Farbod: I try. I do what I can. Now, oh, before I forget, before we wrap up the episode, we were trending for the first time ever in Taiwan.

Daniel: Yeah.

Farbod: 150, 170, 150.

Daniel: I think so, top 150.

Farbod: Yeah, let's just say top 100s to make us feel good. But to our Taiwanese friends, thank you so much. Yeah. We got to figure out a Taiwanese spot to hit this weekend, just to celebrate.

Daniel: That's the deal we made with everyone, is you get us trending in the top 200 podcasts in your country. We're gonna enjoy some food from your country. So, we're gonna figure out how, where and when to consume some Taiwanese food and let everyone know how it went.

Farbod: Yeah, and if you got Rex, hit us up on Twitter, Instagram, TikTok, email.

Daniel: Smoke signal. Messenger pigeon. Yeah. All the above. Walkie talkie.

Farbod: Or, you know, classic handwritten mail. We do appreciate those.

Daniel: Yeah, or the like two little soup cans with a string tied between them.

Farbod: Oh, those are the best. Yeah, those are my favorites.

Daniel: We'll take those too.

Farbod: Yeah. All right, folks. Thank you so much for listening. And as always, we'll catch you in the next one.

Daniel: Peace.

As always, you can find these and other interesting & impactful engineering articles on Wevolver.com.

To learn more about this show, please visit our shows page. By following the page, you will get automatic updates by email when a new show is published. Be sure to give us a follow and review on Apple podcasts, Spotify, and most of your favorite podcast platforms!


The Next Byte: We're two engineers on a mission to simplify complex science & technology, making it easy to understand. In each episode of our show, we dive into world-changing tech (such as AI, robotics, 3D printing, IoT, & much more), all while keeping it entertaining & engaging along the way.


The Next Byte Newsletter

Fuel your tech-savvy curiosity with β€œbyte” sized digests of tech breakthroughs.

More by The Next Byte

The Next Byte Podcast is hosted by two young engineers - Daniel and Farbod - who select the most interesting tech/engineering content on Wevolver.com and deliver it in bite-sized episodes that are easy to understand regardless of your background. If you'd like to stay up to date with our latest ep...