The Art and Science of Benchmarking Neural Network Architectures

author avatar

16 Sep, 2022

The Art and Science of Benchmarking Neural Network Architectures

Proven benchmarks provide a structured method for comparing ML/DL products and services.

This article is based on two blogs by Rouzbeh Shirvani and Sean McGregor, with additions and editing by John Soldatos.  

In recent years, there is a surge of interest in Machine Learning (ML) and Artificial Intelligence (AI) solutions. This interest is driven by advances in parallel hardware, scalable software systems, advanced machine learning frameworks, as well as cloud/edge computing. Furthermore, the explosion of digital data collected makes it possible to develop more effective and accurate ML systems based on deep neural networks. This is because deep learning (DL) outperforms classical ML when large volumes of training data are available. 

State-of-the-art ML/DL systems span both cloud computing and edge computing infrastructures, and can have varying characteristics, such as ML model performance (e.g., classification accuracy), training speed, and energy efficiency. These characteristics determine whether a system can meet specific application requirements such as latency and environmental performance. In most cases it is not possible to optimize all these parameters simultaneously. Therefore ML/DL developers often have to determine the relevant trade-offs and optimize the performance according to their specific application needs. 

The choice of processors is key in setting the performance characteristics of a model. Larger, more power-hungry processors may appear to be the higher performing option, but specifically designed neural processors, or accelerators, often provide the better power/performance tradeoff. Complicating matters, deep learning systems are statistical and non-deterministic, which means that there is rarely a single optimal system configuration that yields the best performance and accuracy. Rather, it is possible to deploy alternative configurations of hardware and algorithms with comparable performance and efficiency.

Introducing ML/DL and Neural Architectures Benchmarks

The variety of available neural network architectures, configurations, and performance-related parameters makes it challenging to compare alternative ML/DL products and services. To facilitate this, engineers often turn to proven benchmarks for compute intense and data intense systems. For example:

  • The Standard Performance Evaluation Corporation (SPEC) family of benchmarks focuses on computational workloads over different computing architectures including cloud-based systems.
  • The LINPACK Benchmarks provide measures of a computer’s floating-point rate of execution i.e., they focus on floating-point computing power. 
  • The TPC consortium includes industry leaders in computing systems and datasets, which have provided a suite of benchmarks for stress testing the transaction processing and data warehousing capabilities of data intensive systems.
  • The High Performance Conjugate Gradients (HPCG) benchmark provides novel metrics that enable the evaluation and ranking of High Performance Computing (HPC) systems.

These benchmarks aid in evaluating different components of ML/DL systems, which are typically deployed over computationally intensive platforms. However, they are overly focused on conventional computational workloads and hence not sufficient for training and executing deep learning tasks that are typically more complex.  

To provide proper benchmarks for ML systems an end-to-end approach is needed. It must consider all the different steps that are involved in the development and execution of an ML/DL pipeline i.e., from the training of an accurate model to the execution of different implementation and computation options towards optimizing system performance (e.g., execution speed). 

Independent benchmarks provide a credible and impartial evaluation of alternative solutions, providing stakeholders with invaluable insights on the available market offerings. Moreover, they provide guidance for the co-design of ML/DL solutions by ML modelers, data scientists and embedded systems engineers. 

Specifically, the benchmarks provide these teams with tangible metrics that can help them evaluate alternative system designs for a variety of popular use cases. Furthermore, ML benchmarking suites enable market players to compete against the same set of metrics, fostering innovation and healthy competition. 

The MLPerf Benchmarking Suite

MLPerf is one of the most popular benchmarking suites for ML/DL systems. It is comprised of a rich set of system tests that span ML models, software, hardware, and the power efficiency characteristics of an ML/DL system. MLPerf was established in 2018 by several organizations including Google, Harvard University, and Stanford University. Since then, it has greatly boosted innovation in the ML space and has helped vendors deliver high-performance and energy efficient ML/DL systems.

The suite includes several benchmarks for different ML tasks. Each benchmark comprises the following characteristics:

  • The name of the task (e.g., Image Classification), which indicates the benchmarking context. For instance, the Image Classification task aims at selecting the class that best describes the content of an image.
  • The dataset, which is typically a popular dataset for the task at hand. For instance, the ImageNet database is commonly used for benchmarking Image Classification tasks.
  • The model to be used, which identifies the ML/DL model that will be used to gauge the computational performance of the ML hardware and software system. As an example, several variations of the Residual Network (ResNet) are commonly used to benchmark hardware performance in computer vision tasks like Image Classification.
  • The Quality Threshold, which denotes a level of acceptable quality for the systems that are benchmarked. For instance, a 74.9% classification accuracy can be used as the threshold for placing an ML/DL system in the Top-1 accuracy.

Along with the above-listed parameters, MLPerf standardizes and specifies other aspects of the benchmarking process such as the initialization of parameters, the optimizer schedule to be used, and data augmentation processes. In this way, it ensures that the benchmarked systems are comparable and can be ranked in an unbiased way. 

The results of the MLPerf benchmarking process are published in a transparent way. This makes it easy for market stakeholders to understand how different systems perform in the various tasks and how they compare to each other. The publication of the results for each system includes information about the system’s benchmark scores, its market availability, the models used (e.g., MobileNet-v1 for mobile phones and ResNet-50 v1.5 for bigger accelerators), as well as the scenarios and configurations of the benchmarking process. 

The scenarios are destined to cover different deployment configurations, including inference at the edge and inference at cloud data centers. Moreover, there is information about the number of accelerators used by the systems under comparison, as the MLPerf benchmark is designed to rank entire systems and not individual components.

In practice, the process of running the MLPerf benchmarking is quite straightforward. It starts with cloning the MLPerf repository and setting up the configuration parameters. Accordingly, the system under benchmarking is added to the list of available system and a proper Docker image for the tests is built. The latter enables data preprocessing, as well as execution of validation models and tests.  

Syntiant’s Performance in the MLPerf Benchmarking

Syntiant is strong leader in tinyML keyword spotting according to the respective MLPerf benchmarking process.  The company’s performance in the benchmark is summarized as follows:

  • Achieved Performance: Syntiant’s NDP120 ran the tinyML keyword spotting benchmark in 1.80 ms. This is a remarkably fast inference, making the company the clear winner in this category. 
  • Energy Usage: The benchmarked system used 49.59 μJ of energy at 1.1V/100 MHz. Energy efficiency was further improved by setting supply voltage down to 0.9 V and reducing clock frequency to 30 MHz, resulting in only 35.29 μJ of energy being consumed. Of course, reducing the clock frequency increases the latency to 4.30 ms, but this highlights both the energy and latency floor of the device, leaving it to the system designer to optimize power and performnce based on their specific application needs. 

The NDP120 is based on Syntiant’s second generation AI accelerator core, Syntiant Core 2®, alongside a Tensilica HiFi DSP (Digital Signal Processor) for feature extraction and an Arm Cortex-M0 core for system management. Syntiant did not enter any other benchmark scores for the NDP120, as the chip was architected for time-series data inferences, such as speech and audio events. 

Syntiant Core 2®, is a highly flexible, ultra-low-power deep neural network inference engine with a highly configurable audio front-end interface. Thanks to this front-end interface, NDP120 is very easily programmable. The ease of deployment is further aided by its native support for all major neural network architectures, allowing for the direct execution of neural network layers without intermediary processing steps. 

NDP120 can run multiple models, such as a keyword spotting model and an audio event detection model, concurrently and supports flexible feature extraction and acoustic far-field processing with the included DSP. Using multiple neural network layers, the system provides accurate high-performance inference and classification, which supports use cases in a variety of application. For instance, it can accurately capture voice commands (e.g., Alexa prompts) in very noise environments.

The device is ideal for ultra-low-power, always-on, edge applications, where an inference based on time series audio or sensor data is required. The deployed solutions using the NDP120 span market  verticals in consumer and industrial, such as security devices detecting acoustic events, mobile phones, tablets, smart speakers, smart displays, as well as smart home appliances and security devices. As with all Syntiant Core 2 family of solutions, it provides optimized memory access and at-memory compute, ensuring high efficiency and extremely low power consumption. The latter is clearly evident in the benchmarking of energy consumption across its peers.

For more information about Syntiant NDP120 visit: https://www.syntiant.com/ndp120

About the sponsor: Syntiant

Syntiant Corp. is a leader in delivering end-to-end deep learning solutions for always-on applications by combining purpose-built silicon with an edge-optimized data platform and training pipeline. The company’s advanced chip solutions merge deep learning with semiconductor design to produce ultra-low-power, high-performance, deep neural network processors for edge AI applications across a wide range of consumer and industrial use cases, from earbuds to automobiles. Syntiant’s Neural Decision Processors™ typically offer more than 100x efficiency improvement, while providing a greater than 10x increase in throughput over current low-power MCU-based solutions, and subsequently, enabling larger networks at significantly lower power.


More by John Soldatos

John Soldatos holds a PhD in Electrical & Computer Engineering from the National Technical University of Athens (2000) and is currently Honorary Research Fellow at the University of Glasgow, UK (2014-present). He was Associate Professor and Head of the Internet of Things (IoT) Group at the Athens In...