We have already covered a generalized framework, Meta AI’s Data2vec for self-supervised learning in computer vision, natural language processing, and speech recognition. But when it comes to performing efficient computation on embedded edge devices, a group of researchers from Google, Purdue and Harvard University have proposed CFU Playground, a full-stack open-source framework, designed to accelerate ML models on FPGA through a collection of software, gateware and hardware configured.
As the name stands, Custom Function Unit is accelerator hardware tightly integrated into the pipeline of the CPU core that can add custom instructions and complement to the standard functions like arithmetic and logical operations. Through CFU Playground, the aim is to run ML models efficiently on edge devices while monitoring benchmark and profile performance. The framework also brings in substantial improvements in software and gateware.
Looking at the comparison of the CFU Playground with other FPGA toolchains for embedded ML shows the tight coupling of the accelerator with the CPU which is not offered by most of the design flows of the existing frameworks. As the researchers make it clear that “there is no ‘one size fits all’ approach for designing custom ML accelerators. Moreover, the new approach brings dimension flexibility under resource-constrained embedded systems.
Looking inside the block diagram of the CFU Playground, it is evidently seen the combination of software, gateware and hardware working together to bring the effective deployment of AI models. Starting from bottom-up, the hardware is built around the Xilinx Artix 7 35T FPGA featuring 33,000 logic cells and 50x 36kBit block RAMs, sufficient for a soft CPU. Additionally, the development board comes with integrated 256MB of external DDR RAM and a USB serial connection essential for a host computer.
Designed using the LiteX framework for creating FPGA cores, there comes a VexRiscv soft CPU with a CFU extension and a USB-UART to provide a serial terminal with the host. As an extension with the main CPU, the op code is executed to pass the content in the registers assigned to the CFU, which then holds for the response and forwards the results into the third register. An important feature of the CFU is that it does not have direct access to the memory and always relies on the CPU to move the data. The framework is aimed to be used to design CFUs for distinct ML tasks. The advantage of introducing CFUs is that it balances the acceleration with flexibility and reduces the overhead that comes from task-specific accelerators.
“This rapid, lightweight framework lets the user achieve large returns out of a relatively small investment in customized hardware and is particularly useful for the long tail of low-volume applications, which emerge in embedded ML use cases,” researchers explain. “Our framework’s open-source toolchain bundles together opensource software (TensorFlow Lite Micro, GCC), open-source RTL generation IP and toolkits (LiteX, VexRiscv, Migen, nMigen), and open-source.”
Getting started with the CFU Playground requires you to acquire an Arty A7-35T board or any of the supported boards (iCEBreaker, OrangeCrab, ULX3S, FOMU, and Nexys Video). With a few instructions, provided in the documentation, the user can start testing the ML inference through an iterative process of deployment, profiling and optimization loop. The visibility of focusing on the desired layer of the stack and measuring the performance at a fine granule level with repetitive custom optimization makes it a revolutionary breakthrough for ML deployment in edge devices. To work on this open-source framework, the innovators expect the user to have a basic understanding of C and C++ programming languages for the microcontroller, Python (as nMigen framework requires that) and experience of working on general Linux fu and Git.
The testing is based on very common tinyML applications, image classification acceleration in Arty and keyword spotting acceleration in Fomu. The experimentation shows how CFU Playground allows the developer to improve the performance through repeated optimizations with ease. In keyword spotting tests, the methodology demonstrates the co-optimization of CPU and CFU in resource constraints. Within five weeks of part-time effort from a senior engineer, the acceleration of MobileNetV2 obtained a 55x speedup while an undergraduate intern achieved a cumulative speedup of 75x on the Keyword spotting application.
Summarizing the new methodology through an open-source framework that can improve ML acceleration on FPGAs:
A full-stack open-source framework that integrates publicly available tools across the complete stack to allow a community-centered ecosystem.
A holistic approach for researchers to iterate on design to improve performance and acceleration in resource constraints and latency-bound ML tasks.
Open-source design flow to achieve significant speedups through improved visibility and flexibility in the design space between CPU and CFU.
Finally, the scholars note that “the best choice of workflow is task-dependent and up to the developer.” For more details on the methodology of the research, check out the open-sourced research paper. The implementation is also available on Google's GitHub repository for community contributions.
Shvetank Prakash, Tim Callahan, Joseph Bushagour, Colby Banbury, Alan V. Green, Pete Warden, Tim Ansell, Vijay Janapa Reddi: CFU Playground: Full-Stack Open-Source Framework for Tiny Machine Learning (tinyML) Acceleration on FPGAs. DOI arXiv.2201.01863 [cs.LG]