Advances in deep learning have inspired in recent years a rapid evolution in hardware accelerators for machine learning applications. This has allowed to achieve state-of-the-art results in various classification and regression applications. However, there remain many challenges in deploying neural networks (NN) to edge devices. One of the biggest performance bottlenecks of today's NN accelerators is off-chip memory accesses.
Embedded non-volatile memories (eNVMs) are a promising solution for increasing the on-chip storage density. eNVMs are generally more dense and energy efficient than SRAM. Moreover, the storage density can be further increased by storing multiple bits in a single memory cell using multi-level cell (MLC) programming. While MLC encoding can potentially eliminate all off-chip weight accesses, it also increases the probability of faults.
In this talk, I will discuss the benefits of co-designing NN weights and memories such that their properties complement each other and the faults result in no noticeable NN accuracy loss. In the extreme case, the weights in fully connected layers can be stored using a single transistor. With weight pruning and clustering, the co-design technique reduces the memory area by over an order of magnitude compared to an SRAM baseline. In the case of VGG16 (130M weights), we are able to store all the weights in 4.9 mm2, well within the area allocated to SRAM in modern NN accelerators.