Innovations and Advancements in Generative AI at the Edge

Edge AI Technology Report: Generative AI Edition Chapter 2. Generative AI and edge computing are transforming industries by enabling low-latency, real-time AI on edge devices, allowing efficient, private, and personalized applications without reliance on data centers.

author avatar

27 Nov, 2024. 15 min read

We once believed the cloud was the final frontier for artificial intelligence (AI), but it turns out the real magic happens much closer to home—at the edge, where devices can now think, generate, and respond in real-time. The rapid evolution of AI, particularly generative AI, is fundamentally reshaping industries and challenging the existing computing infrastructure. 

Traditional models, especially resource-intensive ones like Large Language Models (LLMs), have long relied on centralized cloud systems for the necessary computational power. However, as the need for AI-driven interactions grows across sectors—from autonomous vehicles to personalized content generation—there is a clear shift toward edge computing. 

Our new report, drawing from discussions with industry leaders and technologists, explores how generative AI is being harnessed and integrated into edge environments and what this means for the future of technology.

Read an excerpt from the second chapter of the report below or read the full report by downloading it now.

Innovations and Advancements in Generative AI at the Edge

The convergence of generative AI and edge computing has ushered in a new era of possibilities, transforming industries with low-latency, real-time execution of AI models. Open-source LLMs, once thought to require enterprise-grade GPUs in data centers, now have the potential to run efficiently on edge devices, thanks to recent breakthroughs in AI model performance. This convergence enables a range of applications, from on-the-fly content generation to interactive user experiences, with enhanced privacy, bandwidth efficiency, and dynamic personalization.

Recent advancements, such as those in open-source models like Falcon and Llama 2, have further reduced the computational footprint of LLMs, making them more viable for deployment on edge hardware. This opens new avenues for industries requiring instantaneous, context-aware responses and real-time decision-making. From a technological standpoint, the shift towards deploying LLMs at the edge involves creating lightweight, storage-optimized, and data-efficient versions of these models that run on devices like smartphones, IoT gateways, and edge servers.

Industry Trends, Market Analysis, and Innovation Drivers

Nowadays, most innovations linked to the rise of generative AI at the edge are driven by the following key trends and factors:

  • Real-Time Data Processing Needs: Industries such as automotive, healthcare, and manufacturing require real-time data processing capabilities to enhance decision-making and operational efficiency. For instance, autonomous vehicles can process synthetic sensor data generated by edge-based AI models to navigate traffic in real time, improving response times and safety measures. Moreover, future factory workers will likely have their carry-on LLM assistance running on their smartphones or other mobile devices.

  • Privacy and Security Concerns: Processing data locally on edge devices addresses privacy and security concerns, making generative AI at the edge an attractive option for sectors handling sensitive information. By keeping critical data closer to the source, organizations minimize the risks of data breaches during transmission. The rise of open-source LLMs deployed at the edge also offers greater control over how and where data is used without total reliance on cloud-based solutions.

  • Bandwidth and Latency Reduction: Edge computing helps reduce both bandwidth usage and latency by processing data on-site. This is essential for applications like AI-powered monitoring systems, which require constant updates and instant decisions. As companies increasingly deploy generative AI models, reducing dependency on cloud infrastructure will be vital to maintaining scalable operations.

  • Personalization and User Experience: One of the most exciting aspects of generative AI at the edge is its ability to offer highly personalized and interactive user experiences. By processing real-time data, edge-driven AI models can dynamically adjust content recommendations or services based on user preferences, creating richer, more customized experiences in retail, automotive, and media sectors.

Due to these trends, the market for generative AI at the edge is already experiencing rapid growth. This growth is reflected in most projections for the edge AI and generative AI markets, which include generative AI deployments at the edge. While the global edge AI market today is valued at just over USD 21 billion, market analysts expect it to surpass the USD 140 billion mark by 2034. Likewise, the generative AI market’s worth is estimated to reach USD 356 billion by 2030, up from USD 36 billion today. 

Many semiconductor enterprises offer products that facilitate deploying and operating generative AI solutions at the edge, and these will incrementally contribute to the market growth. For instance, NVIDIA is enabling LLMs at the edge through its IGX Orin Developer Kit, which is designed to handle the computational demands of LLMs in industrial and medical environments while at the same time providing real-time, AI-enabled sensor processing. Similarly, Ambarella brings generative AI capabilities to edge devices with its N1 System-on-Chip series. This solution supports multimodal LLMs with low power consumption, making it suitable for demanding edge-LLM applications like autonomous robots.

The NVIDIA IGX Orin Developer Kit is built to meet the high computational requirements of large language models (LLMs) in industrial and medical settings. Image credit: Nvidia.com

Most importantly, there are partnerships between semiconductor companies and LLM-model vendors to ensure the optimized and configurable deployments of LLMs within edge devices. Last year, for example, Qualcomm partnered with Meta to integrate Llama LLMs directly on edge devices. Such collaborations drive reduced reliance on cloud-based LLM services and contribute to the projected growth of the edge AI and generative AI markets.

Industrial leaders and prominent researchers are advocating for the need to reduce the size of LLMs—through techniques like quantization—to enhance their efficiency on resource-constrained hardware. Such a process involves converting models to lower precision formats in ways that save memory and improve computational efficiency at the same time.


Harnessing Generative AI for Edge Applications with Edge Impulse

LLM-based generative AI has recently become one of the fastest-growing technologies, allowing users to create elaborate outputs, including text, code, images, videos, speech, and sounds, delivered in near-perfect quality. Trained on staggeringly massive datasets comprising significant portions of the internet, these tools are versatile at synthesizing data inputs as they create new content from any prompt. However, due to the size of the underlying models, they're relegated to live on powerful GPU-powered servers in giant data centers. 

Edge Impulse sits on the other side of the AI spectrum, providing a platform that allows users to efficiently access, build, and deploy their AI models to run directly on any hardware. This includes ultra-compact and resource-constrained microcontrollers and microprocessors that run on the edge without cloud connectivity.

Yet, generative AI models are too large to run directly on edge devices. As industries demand real-time results and ever-smarter solutions, how might these two approaches interface to expand the benefits of each?

Edge Impulse has developed various LLM-based features to allow developers to access the specific parts of a generative AI model that directly benefit their undertaking. From synthetic data to intelligent and automatic data labeling, these new interfaces allow users to marry the efficiency of the edge with specific benefits of generative AI locally, efficiently, and in real time.


Leveraging Foundation Models for Edge Applications

LLMs are a type of foundation model that is inherently large and resource-intensive. Unlike traditional AI models, foundation models are versatile and can be fine-tuned for various applications without extensive retraining. With more demand for real-time AI on edge devices, optimized versions of foundation models can help bridge the gap between cloud-scale intelligence and local, resource-constrained applications.

Foundation models offer powerful capabilities, such as zero-shot learning, enabling them to perform tasks without explicit training. By incorporating foundation models into development workflows, Edge Impulse enables developers to extract and use valuable insights from large models to train smaller, more efficient models suitable for edge deployment. This opens new possibilities for edge applications, ranging from predictive maintenance in manufacturing to real-time diagnostics in healthcare.

Daniel Situnayake, Director of Machine Learning at Edge Impulse, explains, "We don't need to wait for models like GPT to run on edge devices. There are already ways to harness the power of these foundation models without needing to deploy the full-scale versions at the edge."

Edge Impulse is harnessing foundation model capabilities in many ways:

  1. Synthetic Data Generation: Edge Impulse integrates synthetic data generation tools, such as DALL-E for images, Whisper for voice data, and ElevenLabs for sound effects. These tools allow users to create artificial datasets that mimic real-world conditions, reducing the time and cost involved in traditional data collection. This is especially useful for generating data that is difficult or expensive to capture, like certain sound effects or rare visual scenarios. “One of the exciting aspects of synthetic data,” Situnayake says, “is that it reduces training costs because the data is inherently labeled, saving significant resources on manual labeling efforts.”

  2. Data Labeling: LLMs are used to automatically label visual and audio data, reducing the manual effort required. For example, satellite imagery can be labeled quickly with GPT-based models, allowing for the rapid creation of useful models from the same dataset. LLMs help automate audio dataset labeling, as well, integrating tools like Hugging Face.

  3. Data Cleansing and Validation: LLMs also clean and validate datasets. This process ensures that the data used for training models is of high quality, improving the accuracy and efficiency of edge AI models. LLMs can check data for inconsistencies and help in refining datasets.

  4. Compact Model Training:  Edge Impulse uses LLMs' ability to understand imagery to automatically label objects in the data. This process allows the creation of object detection models that embed a portion of the LLM’s object recognition capabilities, augmenting the creation and accuracy of object detection models on resource-constrained devices.

Bringing Practical AI Solutions to the Edge

Edge Impulse enables developers to build and deploy models for tasks like audio keyword detection, computer vision, activity and sleep tracking, and predictive maintenance, even on limited hardware. It integrates tools that simplify dataset labeling, reducing the traditionally time-consuming process. Its integration of models like Segment Anything, OWL-ViT, and GPT-4o automates the labeling of training data for object detection, cutting down on manual effort.

As industries demand real-time AI applications, Edge Impulse enables models to run locally, reducing latency and the need for constant connectivity. In healthcare, this allows for on-device diagnostics and decision support, while in industrial automation, edge devices can monitor equipment in real time, identifying anomalies in production output or predicting maintenance needs before failures occur. Edge Impulse enables practical, real-world AI solutions at the edge, improving performance where needed most.

Scaling for Real-World Applications

Edge Impulse is actively scaling AI for edge use cases by optimizing models for efficient deployment on constrained devices. A primary challenge in edge AI is ensuring models remain lightweight and resource-efficient without sacrificing performance. By focusing on domain-specific models, Edge Impulse fine-tunes AI solutions for real-time use cases, minimizing power consumption while maintaining accuracy.

The platform provides an end-to-end workflow, covering everything from data collection and model training to deployment while incorporating advanced signal processing to extract features from sensor data. This holistic approach ensures models perform efficiently on edge devices without needing large-scale LLMs to run locally.

The Future of AI at the Edge

Edge Impulse is driving advancements in AI deployment strategies that minimize reliance on cloud computing. According to Situnayake, "We are rapidly approaching a future where edge devices will be able to handle more complex AI tasks autonomously, reducing the need for cloud-based processing and opening up new possibilities for real-time, on-device AI applications." This shift toward more independent edge computing aligns with trends like reduced network latency, enhanced data privacy, and bandwidth efficiency—key factors for the future of generative AI at the edge.

Looking ahead, the combination of more efficient models and advancements in hardware will allow even more sophisticated applications, such as autonomous robotics and real-time video generation, to run directly on edge devices. Situnayake paints an intriguing picture of generative AI: “Imagine a future where instead of streaming Netflix, you have a box generating TV shows in real-time based on your preferences. I think there's going to be all sorts of crazy stuff like that.” With the pace of technological advancements, this type of content will eventually be built directly on the edge. As AI continues to move toward that future, Edge Impulse’s platform is leading the way by bringing LLM capabilities and edge AI together, providing developers with the tools to build the next generation of AI-driven products.

AI Workloads: From the Far Edge to the Cloud

The emergence of some of the Edge-LLM solutions we have mentioned enables a new way of generative AI solutions that distribute workloads across the edge-cloud computing continuum. Specifically, generative AI workloads in edge solutions are designed to operate in a coordinated manner, often escalating from the far edge to the cloud. This hierarchical approach ensures efficient data processing and resource utilization and relies on the following coordination and escalation path:

  1. Far-Edge Generative AI: At the far edge, generative AI data generation and initial processing occur on local devices such as sensors, cameras, or IoT devices. This stage focuses on real-time data analysis and immediate decision-making in the context of compressed, resource-efficient generative AI models that comprise a fraction of large-scale LLM model parameters (e.g., neurons).

  2. Near-Edge Generative AI: Generative AI interactions that require further processing are transmitted to near-edge devices or edge servers. These servers handle more complex computations and aggregations, enabling deeper analysis based on larger LLM models compared to those deployed at the far edge.

  3. Cloud Generative AI: For extensive data analysis, long-term storage, and LLM interactions requiring very complex reasoning (e.g., decision support based on large amounts of data), data is escalated to the cloud. The cloud provides vast computational resources and storage capabilities, which enable the operation of the largest and most advanced generative AI models.

This multi-tiered approach allows for efficient AI processing, with immediate actions taken at the edge and more complex tasks handled in the cloud. Such coordination ensures optimal performance, reduced latency, and enhanced scalability while offering opportunities to use the most advanced LLM capabilities when required.

Early generative AI and LLM deployments at the edge-cloud computing continuum have demonstrated the merits of this integration. At the same time, they have given rise to additional research and innovation activities that promise to revolutionize the deployment and efficiency of generative AI applications. These activities address critical challenges associated with cloud-based LLMs, such as latency, bandwidth usage, and data privacy.

Edge Training and Inference

Edge training and edge inference are developed to facilitate the efficient deployment of LLMs on resource-constrained edge devices. Edge LLM-related innovation activities increasingly focus on enabling training and inference at the edge. This trend is driven by the need for real-time processing and offline functionality for latency-sensitive applications like robotics and autonomous systems. 

The combination of generative AI and edge intelligence (EI) offers a new growth trajectory by distributing lightweight models closer to terminal devices, as noted in a 2024 analysis by Chen et al. By 2025, an estimated 30.9 billion IoT devices will connect globally, creating a data scale expected to reach 79.4 Zettabytes. However, limitations in the edge model scale often lead to unsatisfactory edge training and inference outcomes. The interaction between generative AI and EI aims to bridge this gap, enhancing training convergence and inference performance at the edge. 

“We will witness the transformation of the network paradigm from the Internet of everything to the intelligence of everything, in which native AI will sink from distant cloud servers to the edge of the network.”

– Chen N. et al., IEEE members

Multimodal LLMs at the Edge

Another key trend is the development of multimodal LLMs at the edge, which can process and generate content simultaneously across various modalities, such as text, images, videos, and audio. These models are particularly suited for edge deployments where the ability to handle diverse data types locally and in real time can significantly enhance application performance and user experience. Gartner predicts that 40% of generative AI solutions will be multimodal by 2027, a significant shift from a mere 1% in 2023.

Prominent examples of such multimodal applications include OpenAI's Sora for text-to-video generation and Google's Gemini models designed for multimodal applications with varying resource constraints. Similarly, breakthroughs like transformers, as introduced by Vaswani et al. in 2017, have allowed for more efficient model architectures. These architectures eliminate the need for convolutions and recurrence, capturing long-range dependencies and reducing training time—a key advantage for resource-constrained edge environments.

In addition, advancements in smaller LLMs are facilitating their deployment in edge environments, particularly for scenarios that require less precision or involve resource limitations. For example, BitNet and Gemma 1B models are optimized for edge devices, providing energy-efficient and cost-effective alternatives comparable to full-precision models. These advancements allow LLMs to scale down in terms of memory usage and energy consumption while maintaining robust capabilities for real-time, multimodal tasks. 

Similarly, non-transformer models like Mamba and RWKV are breaking ground in computational efficiency, addressing challenges associated with sequence processing. These models, which incorporate elements like structured state-space parameters and linear attention, offer new possibilities for edge-based LLM deployments.

The progress in these multimodal and smaller models is particularly advantageous for edge-based generative AI applications that require lower latency, enhanced efficiency, and localized data processing, reducing reliance on cloud infrastructures while optimizing for the constraints of edge environments.

Reduced Connectivity Dependency

A primary motivation for shifting LLM inference to the edge is to reduce dependency on stable network connections. Edge-based LLMs can function effectively with limited or intermittent connectivity, which is critical for remote or mobile applications. Such reduced reliance on continuous connectivity is particularly beneficial in industrial or rural environments, where maintaining constant network access can be challenging.

Deployment of LLMs at the 5G and 6G Edge

The deployment of LLMs within 5G systems is already making headway, mainly through the use of Mobile Edge Computing (MEC). This approach leverages the high bandwidth and low latency of 5G networks to enable real-time processing and AI model execution closer to the data source, reducing latency and enhancing privacy by minimizing the need for data to travel back to centralized cloud servers.

However, the vision extends further with 6G. As 6G technology emerges, LLMs are expected to play a central role in advanced MEC architectures, enabling even more efficient processing through techniques like split learning, quantization, and parameter-sharing inference. These advancements will help address some of the current limitations in bandwidth, response times, and data privacy, especially in applications that require real-time decision-making at the edge. While 5G networks are facilitating the current deployment of edge-based AI models, the full promise of LLMs in edge AI is expected to be realized with the advent of 6G, projected closer to 2030.

Personalization and User Experience

Edge-based LLMs enable dynamic personalization through real-time user interactions, allowing AI systems to adapt to individual preferences continuously. This is becoming increasingly crucial as companies strive to deliver hyper-personalized experiences. According to McKinsey, personalization at scale has become a competitive differentiator, with companies that excel in this area achieving up to 40% more revenue growth than their peers.

What sets edge-based personalization apart is its ability to keep sensitive data local. This local processing enables LLMs to generate tailored responses and recommendations based on immediate user inputs, enhancing both privacy and real-time responsiveness. Moreover, analyzing data in real time allows instant feedback loops, which are critical in sectors like retail and healthcare, where consumer behavior or patient data requires continuous adjustment to improve outcomes.

Furthermore, edge-based personalization opens new doors for consumer and industrial applications. From personalized marketing experiences, which McKinsey describes as the "holy grail" of customer engagement, to real-time industrial systems that adjust based on operator behavior, the integration of generative AI at the edge is set to transform user experiences across industries. As edge infrastructure continues to evolve and LLMs become more efficient, these systems will deliver even more accurate, contextual, and meaningful interactions.

By leveraging real-time adaptive models, edge-based LLMs will drive personalization and enable businesses to harness customer insights while safeguarding privacy—a critical balance in today’s data-driven world.

Edge LLM Solutions for Industrial Environments

Various companies are experimenting with edge-LLM deployments for industrial environments. This is because merging LLMs with edge computing can help drive IT/OT convergence and enhance operational efficiencies. For instance, edge AI platforms process data locally on compact computing LLM platforms, which operate on data specific to the processes of the industrial environment at hand.

The industrial sector is increasingly leveraging edge-LLM solutions to enhance operational efficiencies and drive IT/OT convergence. This convergence bridges the gap between information technology (IT) and operational technology (OT), enabling smarter, more responsive systems. Edge-LLM platforms, such as those showcased by Edge Impulse, bring the power of LLMs directly into industrial environments by processing data locally. This improves real-time decision-making and reduces latency, allowing critical operations to function without delay.

In environments like manufacturing or energy management, edge LLMs are vital in transforming large streams of sensor data into actionable insights through natural language processing and interpretation. These systems enable advanced diagnostics and predictive maintenance by monitoring industrial equipment in real time. Furthermore, LLMs deployed on edge AI platforms are highly scalable and customizable to the specific needs of industrial processes, creating more intelligent and efficient production lines.

Many leading companies and researchers are pushing the boundaries by enhancing the hardware and computational capabilities required to support LLMs at the edge. This evolution facilitates better handling of industrial automation tasks, empowering industries to adopt LLM-driven solutions that improve both the accuracy of real-time data interpretation and energy efficiency. Integrating LLMs at the edge with compact AI platforms is ushering in a new era of industrial operations, with clear benefits in areas such as predictive maintenance, fault detection, and streamlined communication across systems.

LLM Agents at the Edge

The development of LLM agents is transforming how AI models interact autonomously at the edge. Platforms like Mistral AI offer built-in agent development capabilities that allow developers to create, manage, and coordinate multiple LLM agents. These agents can be organized into workflows to tackle complex tasks, breaking down large problems into smaller, more manageable pieces that operate close to the data source. NVIDIA’s work in this space highlights how LLM agents optimize edge environments by running localized, self-contained processes for real-time decision-making and efficient communication.

LLM agents are built for autonomous interactions, leveraging the capabilities of multiple models to respond dynamically to tasks such as natural language understanding or content generation. The architecture of these agents is designed to scale, supporting diverse applications ranging from robotics to industrial automation, with frameworks like LangChain enhancing their flexibility. The future of LLM agents at the edge is set to grow, with research focusing on optimizing deployment across distributed networks and enabling localized intelligence for edge applications. By 2025, significant advancements are expected in LLM agents’ coordination and integration into real-world environments, driving further improvements in real-time AI deployments.

Conclusion

Overall, a wide range of R&D activities towards the convergence of generative AI and edge computing are underway. Recent advancements in these fields are driving real-time data processing, enhancing privacy and security, and enabling dynamic personalization. The market for generative AI and edge computing is poised for significant growth, driven by increasing demand for advanced LLM capabilities and real-time data processing at various edge-cloud configurations. In this context, research and industrial organizations are establishing collaborative initiatives aimed at materializing these advancements faster and more effectively.