Live translation devices will benefit from advances in model efficiency. Image credit: TimeKettle
In today’s evolving technology landscape, AI is not merely “artificial”; it is artfully intelligent. But unlike art, AI’s practical applicability requires high levels of efficiency, a challenge the industry still grapples with.
Particularly with deep neural networks, we seem to encounter a challenge. While these networks dazzle us with their ability to recognize faces, translate languages, and make complex decisions, they also come with a voracious appetite for resources. High memory, compute power, and energy consumption have become almost defining characteristics of contemporary AI models, holding AI back from true ubiquity and rendering the need for operational efficiency ever more pressing.
In this article, we take a deeper look into Qualcomm AI Research’s model efficiency research aimed at enhancing power efficiency and boosting AI performance. From the intricacies of Neural Architecture Search (NAS) to the groundbreaking quantization methods, we will discover how Qualcomm AI Research is leading the charge to unlock the full potential of AI.
Qualcomm AI Research is at the cutting edge of AI, committed to tackling the intricate challenge of AI model efficiency through practical strategies that aim to enhance both power efficiency and performance. With its aim to make AI’s core capabilities ubiquitous across devices, Qualcomm AI Research has adopted a holistic approach to AI model efficiency, delving into the practical aspects of full-stack AI optimization.
Addressing AI's resource-intensive nature requires an approach that encompasses machine learning hardware, software, and algorithms. As a result, their research efforts aim to ensure that AI systems operate efficiently without compromising functionality, making AI more adaptable and efficient across a diverse range of devices, particularly mobile devices. As Jilei Hou, VP Engineering at Qualcomm AI Research, states, “Our holistic systems-approach to full-stack AI is accelerating the pipeline from research to commercialization.”
But what exactly is this holistic approach to AI model efficiency? How is it implemented, and what angles does Qualcomm AI Research tackle the model efficiency challenge from?
Qualcomm AI Research's mission to unlock the full potential of AI centers around a comprehensive strategy that attacks the model efficiency challenge from multiple angles. This multifaceted approach is driven by the recognition that optimizing AI models requires a combination of techniques.
Basically, there is no one-size-fits-all solution to AI model efficiency. Just like art, shrinking AI models down and running them efficiently on hardware can and should be approached in multiple ways, including quantization, compression, Neural Architecture Search (NAS), and compilation.
Qualcomm AI Research is a strong believer in quantization as evidenced by their leading research and products in the market. Quantization enables the AI model to run efficiently on dedicated hardware, thus enhancing its performance while minimizing its consumption of power and memory bandwidth. With focus on quantization-aware training and post-training quantization, Qualcomm AI Research’s results demonstrate how effective integer inference is in improving the tradeoff between accuracy and on-device latency.
With compression, Qualcomm AI Research aims at reducing the size of AI models without compromising their functionality by removing parameters. In other words, with no sacrifice to model accuracy, Qualcomm AI Research’s compression technique systematically removes activation nodes and connections between nodes, rendering the AI model smaller and more efficient. Quantization and compression of neural networks are supported by Qualcomm Innovation Center’s AI Model Efficiency Toolkit (AIMET) library.
Another approach is Neural Architecture Search (NAS). Their research in NAS focuses on automating the design of efficient neural networks that are much smaller than a large state-of-the-art model. Qualcomm AI Research seeks to develop search algorithms and methodologies that automatically create optimal network architectures tailored to specific tasks. This streamlines the process of designing AI models, enhancing their efficiency and performance.
Furthermore, advanced compilation techniques optimize the execution of AI models on various hardware platforms. By tailoring AI models to the underlying hardware architecture, they ensure that the models run efficiently, achieving the desired performance while conserving resources.
In the following subsections, we look deeper into quantization and NAS, highlighting how Qualcomm AI Research is enabling these approaches to unlock AI’s potential even further.
Quantization reduces the number of bits necessary for representing information in the weights and activations of AI models, boosting power efficiency and performance while reducing memory bandwidth and storage. It also helps enable simultaneous use of multiple AI models, a use case found in demanding application areas like mobile, automotive, and XR.
Qualcomm AI Research has been at the forefront of research in quantization techniques that are revolutionizing power-efficient AI. Their research covers both quantization-aware training and post-training quantization for efficient integer AI inference. Their products support quantization down to 4-bit integers, showcasing their commitment to making AI more power-efficient.
In fact, some of their recent findings revealed issues that affect model accuracy, one of which is oscillating weights during model training. With state-of-the-art (SOTA) methods called oscillation dampening and iterative freezing, accuracy is improved even for under 4-bit quantization. Qualcomm AI Research has also explored the role of floating-point arithmetic in AI inference, shedding light on the balance between precision and efficiency. This research considers the trade-offs involved in using floating-point or integer representations in AI models, concluding that integer inference is the best approach. Another recent study on post-training quantization (PTQ) and its impact on softmax activation highlights a technique proposed by Qualcomm AI Research, called offline bias correction. With this method, one can enhance the quantizability of softmax with no additional compute during deployment, thus significantly improving the accuracy for 8-bit quantized softmax.
Quantization is a highly-effective approach to AI model efficiency. Qualcomm AI Research's leading quantization research combined with Qualcomm’s leading processor solutions with AI acceleration supporting 8-bit or 4-bit integers has resulted in industry leading performance per watt, showing the importance of tailoring AI models to the specific constraints of edge devices.
With Neural Architecture Search (NAS), Qualcomm AI Research focuses on automating the design and training of efficient neural networks, aligning them more closely with dedicated tasks and optimizing their overall performance. NAS consists of four components:
A search space (what types of networks can be searched over)
An accuracy predictor (how accurate a network should be)
A latency predictor (how fast a network will run)
A search algorithm (what the best architecture is for a particular task)
However, NAS still presents some intricate challenges, particularly in defining an expansive search space, where countless network architectures can be explored. This space represents the potential configurations that an AI model could take. The challenge lies in navigating this vast terrain to discover the most efficient and effective neural network for a given task, as well as addressing the high compute cost, unreliable estimates for hardware performances, and inefficient scaling of existing methods.
That’s why Qualcomm AI Research has addressed these challenges with their NAS research called DONNA (short for Distilling Optimal Neural Network Architectures).
An efficient NAS with hardware-in-the-loop optimization, DONNA is a scalable method capable of finding optimal network architectures with high accuracy and minimal latency for virtually any hardware platform at low cost. This revolutionary method tackles the model deployment challenges in real scenarios effectively thanks to its:
Diverse search space
Low compute cost
Reliable direct hardware measurement
Its diverse search space includes the usual variable kernel-size, expansion-rate, depth, channel number, and cell-type, but can also search over activations and attention, a vital parameter to finding optimal architectures. Its hardware predictor is hardware agnostic with a low startup cost, enabling NAS to scale to multiple hardware devices at low costs. Furthermore, DONNA is able to capture the run-time version and hardware architecture, enabling users to find real and accurate latency values rather than simulated ones.
DONNA also has pre-trained blocks that enable fast fine-tuning of neural networks to fast track reaching full accuracy and is a single scalable solution, applying, for instance, directly to downstream tasks and non-CNN neural architectures with no need for conceptual code changes.
Qualcomm AI Research believes that open-source projects are a great way to share knowledge for scaling model-efficient AI to the masses. That’s why Qualcomm Innovation Center has provided two open-source GitHub projects based on Qualcomm AI Research’s advanced research:
AIMET is a toolkit that offers advanced quantization and compression techniques, while AIMET Model Zoo offers pre-trained 8-bit quantized models with high accuracy. With these open-source projects, Qualcomm aims to boost the transition to fixed-point quantized models, which would radically enhance AI applications, particularly those requiring high performance, low latency, and low power consumption, such as mobile platforms.
As Qualcomm’s OnQ blog series states, “The future of AI is hybrid”, AI is evolving in that direction, with AI processing being distributed between edge devices and the cloud. On-device AI provides benefits in cost, power consumption, performance, personalization, privacy, and security. But for all this to happen, model efficiency is key, and that’s where Qualcomm is leading the pack as the on-device AI leader.
Qualcomm AI Research envisions a future where AI processing is highly efficient for all devices across all applications all over the world. With generative AI pushing the computing requirements even higher, Qualcomm’s focus on efficiency and performance alongside their open ecosystem is enabling generative AI and other upcoming revolutionary technologies.
Whether it's the quantization approach, NAS (and DONNA), compression, or compilation, Qualcomm AI Research will continue to find and develop ways of making AI processing ever more efficient so you and I can enjoy the perks of amazing AI technologies on almost every device you could think of.
Join Qualcomm’s open-source projects (AIMET and AIMET Model Zoo) and be part of the transformation towards efficient AI processing. Interested in learning more about Qualcomm AI Research’s work, check out the following links: