The shift from cloud processing to edge AI processing has introduced several key challenges in the space of memory and the high volume of incoming data. Power consumption in embedded machine learning needs to be monitored over a long period of time and outline an efficient design flow to tackle the problem. The deployment of TinyML workloads on edge devices that come with battery constraints requires high computational energy efficiency. One of the recent developments in solving this challenge for edge AI applications comes through heterogeneous in-memory computing.
Before looking at the propositions made in the paper “A Heterogeneous In-Memory Computing Cluster for Flexible End-to-End Inference of Real-World Deep Neural Networks ,” let us understand how analog in-memory computing was introduced to solve power consumption issues for high computing mission-critical edge applications. To make any edge ML inference power-efficient, it is necessary to design a memory structure that can reduce the power required for computation. With the use of in-memory compute methodology, the amount of incoming data that is required to be moved from memory to ALU is reduced. As this suggests, with a reduction in data transfer, energy efficiency came into the picture.
Innovators Angelo Garofalo, Gianmarco Ottavi, Francesco Conti, Geethan Karunaratne, Irem Boybat, Luca Benini, Davide Rossi have proposed a heterogeneous in-memory compute architecture that integrates 8 RISC-V cores and in-memory compute accelerator along with digital accelerators. The results of heavy data-intensive AI workloads have shown tremendous improvement through several experimental setups, one being the execution of end-to-end inference of the MobileNetV2 demonstrating enhancement of two orders of magnitude compared to the existing heterogeneous solution integrating in-memory computing analog cores.
Analog in-memory compute (AIMC) has been around for a while to deal with memory bottleneck problems. The process refers to data processing that takes place inside the memory arrays, making it one of the promising solutions to accelerate deep neural network workloads. In recent times, several AIMC-based architectures have floated in the area of DNN inference acceleration and ways to make it energy efficient. Through support from industry and educational universities, the AIMC-based architectures have improved the energy efficiency by up to hundreds of TOPS/W. Even though being one of the most efficient solutions, they possess some fundamental challenges, the flexibility of IMC arrays has limited the performance of particular DNN workloads.
In the paper, the researchers have addressed the above-mentioned challenges: flexibility and bandwidth. The complete system consists of an analog in-memory accelerator, the depth-wise digital accelerator and its integration with the PULP cluster. The baseline for the system comes from the 8 RISC-V cores featuring a 4-stage in-order single-issue pipeline that is based on the RV32IMCXpulpV2 instruction set architecture. The custom extension employed in the RISC-V processor core, XpulpV2  is designed to accelerate arithmetic kernels. The designed accelerators are interfaced with the PULP cluster through two interfaces: control and data interface, coming out of the hardware processing engines.
The hardware processing engine that has the IMA subsystem and depth-wise accelerator subsystem comes with three main blocks, the controller, the engine and the streamer. The data interface of both the subsystems are multiplexed towards the tightly coupled data memory (TCDM) which is labeled as L1 memory in the architecture. The IMA subsystem works on the input stores in this L1 memory and uses the same data storage HWC layout as the depth-wise accelerator.
The proposed heterogeneous cluster is analyzed to measure the power and trade-off to achieve better energy efficiency. The IMC shows massive improvements in energy efficiency for matrix-vector multiplications which are mapped to the crossbars of the non-volatile memory arrays. These arrays of non-volatile memory (NVM) devices are considered to be one of the possible paths for implementing improved energy efficiency in neuromorphic computing systems . Moreover, the methodology demonstrates an improvement of 11.5x performance and 9.5x energy efficiency for heterogeneous workloads like the Bottleneck layer. For end-to-end inference, DNN workloads such as MobileNetV2 show the execution reaching 10x faster with 2.5x lower energy consumption.
More details on the implementation of the proposed architecture is available in the research paper, which is published on the free distribution service provided by Cornell University, arXiv, under open access to community viewing. If you are interested in reading more details on the heterogeneous in-memory computing solution for power consumption in edge devices to improve overall performance, head to the official article page.
 Angelo Garofalo, Gianmarco Ottavi, Francesco Conti, Geethan Karunaratne, Irem Boybat, Luca Benini, Davide Rossi: A Heterogeneous In-Memory Computing Cluster For Flexible End-to-End Inference of Real-World Deep Neural Networks. DOI arXiv.2201.01089 [cs.AR]
 Michael Gautschi, Pasquale Davide Schiavone, Andreas Traber, Igor Loi, Antonio Pullini, Davide Rossi, Eric Flamand, Frank K. Gürkaynak, and Luca Benini: Near-Threshold RISC-V Core with DSP Extensions for Scalable IoT Endpoint Devices. DOI: https://doi.org/10.1109/TVLSI.2017.2654506
 A. Fumarola et al.: Non-filamentary non-volatile memory elements as synapses in neuromorphic systems. 2019. 19th Non-Volatile Memory Technology Symposium (NVMTS), 2019, pp. 1-6, doi: 10.1109/NVMTS47818.2019.8986194.