Novel Hiddenite accelerator aims to offer dramatic improvements in deep-learning energy efficiency

Designed to keep as much of the computation on-chip as possible to reduce expensive calls to external memory, the prototype Hiddenite accelerator offers state-of-the-art performance based on the "lottery ticket" hypothesis.

author avatar

04 Mar, 2022

The prototype Hiddenite processor, built on a 40nm process node, has already proven its creators' concepts in testing.

The prototype Hiddenite processor, built on a 40nm process node, has already proven its creators' concepts in testing.

Machine learning via neural networks has proven its worth across a broad swathe of problem domains, but its wide deployment brings a problem of its own: High computational demand, which translates into a need for large amounts of energy.

Dedicated accelerator hardware, better-suited to the workload than the general-purpose processors on which it was previously run, could hold the answer — and one new accelerator, dubbed Hiddenite, is showing impressive results in-silico thanks to a novel three-prong approach based on the concepts of “hidden neural networks” and keeping as much of the computation on-chip as possible.

Hidden knowledge

Hiddenite — short for the Hidden Neural Network Inference Tensor Engine — is, its creators claim, the first accelerator chip which targets hidden neural networks (HNNs). Designed to simplify neural network models without damaging accuracy, the HNN concept is based on Jonathan Frankle and Michael Carbin’s “lottery ticket hypothesis” which suggested a randomly-initialized deep neural network contains “subnetworks” equivalent to the original DNN post-training. By pruning the rest of the network away and keeping these “winning ticket” subnetworks, the complexity of the network is reduced for equivalent accuracy.

The Hiddenite accelerator aims to improve the efficiency of calculating these hidden neural networks, lowering computational complexity and thus power requirements — and doing away, where possible, with expensive calls to off-chip memory which can otherwise hamper a promising chip design.

A block diagram of the Hiddenite accelerator, showing: Model Construction Controller (MCC), Activation Memory (AMEM), a 2D Barrel Shifter, a Supermask Expansion Unit (SEU), a 4D PE Tensor with 4K PEs, a Weight Generation Unit, a controller, and a Post-Processing Unit (PPU). The SEU, PE Tensor, and WGU are called out in additional diagrams to the right.The Hiddenite accelerator aims to find and calculate "lottery ticket" subnetworks to boost the energy efficiency of neural networks.

“Reducing the external memory access is the key to reducing power consumption,” explains project co-lead Masato Motomura, of the Tokyo Institute of Technology. “Currently, achieving high inference accuracy requires large models. But this increases external memory access to load model parameters. Our main motivation behind the development of Hiddenite was to reduce this external memory access.”

The architecture of the accelerator chip is split into three key sections: A supermask expansion unit, which allows for the use of compression to reduce the size of the binary masks used to find subnetworks; a weight generator unit, which exploits the discovery that weights can be regenerated using a random number generator and a hashing function so as to avoid the need to store weights or seed values; and a high-density four-dimensional parallel processor which prioritizes data reuse for boosted efficiency.

Computationally efficient

To prove the concept, Motomura and colleagues fabricated a prototype Hiddenite processor on Taiwan Semiconductor (TSMC)'s 40nm process node — a considerably older and larger node than is typically used for state-of-the-art devices in the field of deep learning. Measuring 3×3mm (around 0.12×0.12"), the chip’s footprint is primarily taken up by memory — 8Mb of activation memory (AMEM), 256kb of supermask memory (SMEM), and 128kb of zero run-length memory (ZMEM) — with logic found at the center of the die.

A die shot of the prototype Hiddenite chip, showing large blocks of AMEM at either side, a small ZMEM block, two SMEM blocks, and a central area labelled Logic. A small piece of silicon art showing the Tokyo Tech ArtIC and Hiddenite logos is enlarged to the bottom-left.The Hiddenite processor, produced on a 40nm node, has the bulk of its footprint dominated by its memories.

Key to the Hiddenite concept is performing as much of the work on-chip as possible. The weight generator means there’s no need to store and load weights from external memory while the supermask expansion hardware means model parameters are less likely to exceed available on-chip memory, the team explains. The parallel processor, meanwhile, boosts efficiency through maximizing reuse of data.

“The first two factors are what set the Hiddenite chip apart from existing DNN inference accelerators,” says Motomura. “Moreover, we also introduced a new training method for hidden neural networks, called ‘score distillation,’ in which the conventional knowledge distillation weights are distilled into the scores because hidden neural networks never update the weights. The accuracy using score distillation is comparable to the binary model while being half the size.”

A table showing the performance of the Hiddenite chip compared to rival designs from 2018, 2020, and 2021. Despite using a larger process node than any other, Hiddenite shows impressive performance.Hiddenite's efficiency offers stiff competition for rival accelerator designs, despite being built on a much larger process node.

Early testing certainly shows promise: The 40nm Hiddenite prototype was able to beat rival designs build on considerably smaller nodes, from 28nm down to 5nm, in the performance-per-watt metric, offering between 8.1 and 34.8 trillion operations per second (TOPS) per watt depending on voltage and model used — a state-of-the-art showing which doesn’t take into account the additional power efficiency gains made by removing the off-chip memory access required by its contemporaries.

The team’s work was presented at the International Solid-States Circuits Conference 2022 (ISSCC 2022), as Session 15.4. No paper has yet been published for public consumption.

References

Kazutoshi Hirose, Jaehoon Yu, Kota Ando, Yasuyuki Okoshi, Ángel López García-Arias, Junnosuke Suzuki, Thiem Van Chu, Kazushi Kawamura, and Masato Motomura: Hiddenite: 4K-PE Hidden Network Inference 4D-Tensor Engine Exploiting On-Chip Model Construction Achieving 34.8-to-16.0TOPS/W for CIFAR-100 and ImageNet, International Solid-State Circuits Conference 2022, Session 15.4.

Jonathan Frankle and Michael Carbin: The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, Computing Research Repository (CoRR), arXiv. DOI arXiv:1803.03635 [cs.LG].

04 Mar, 2022

A freelance technology and science journalist and author of best-selling books on the Raspberry Pi, MicroPython, and the BBC micro:bit, Gareth is a passionate technologist with a love for both the cutting edge and more vintage topics.

Stay Informed, Be Inspired

Wevolver’s free newsletter delivers the highlights of our award winning articles weekly to your inbox.