Deep Learning Hardware Design Challenges

A significant challenge facing Deep Learning hardware startups is that the ground continues to shift underneath them. By that I mean the rapid pace of innovation in Deep Learning network algorithm and general computational approaches makes it challenging workload to target. The development time for new silicon is far greater than the development time of new software and algorithmic approaches.

This is not as severe an existential consideration for silicon targeting the edge or IOT devices where the primary design goal is low power inference, but for training in the datacenter it is a different matter.

For inference an efficient CNN implementation for segmentation and recognition tasks such as tracking customers in a retail setting, or an LSTM RNN implementation for basic speech or other sequence classification is sufficient. For chip startups targeting inference for the edge/IOT, the focus is on low power implementation techniques, such as sub-threshold logic, asynchronous logic, low voltage computing or signaling approaches, analog computing techniques, use of denser and lower power memory, etc.

For Training – the task is trickier. A general datacenter hosted training solution must support training for a wide variety of workloads/networks. However to achieve sufficient levels of efficiency or performance improvements over incumbent solutions, new entries (as well as the incumbents) will attempt to differentiate by approaches such as:

Employing a “workload characterization” strategy. Synthesizing some architectural advantage(s) by carefully characterizing and targeting the architecture for known “embarrassingly parallel” Deep Learning workloads. This seems to be the strategy of Cambricon (a startup), who for example that recently received closed $100M in early funding. However, large monopolistic players such as Google, Facebook, and Amazon have a significant advantage in their ability to observe workloads in their datacenters and one would think would therefore provide significantly better insights into workload evolution than just about anyone else – but even they can be blindsided by rapid developments on the algorithm/software side.
Employing a “big bet” strategy – such as putting all memory on chip and have little to no off-chip memory access capability (GraphCore), or using wafer scale integration techniques (rumored for Cerebras). As for bets on waferscale approaches – those readers with grey hair may remember Gene Amdahl and Trilogy, and as for trying to put all the memory on chip, I think that is a risky bet in general, and just the leakage power alone from 600MB of SRAM should give one pause. Just to get a sense of the magnitude of leakage power that this much SRAM can burn, for a popular 16nm FinFET process, 600MB of SRAM, composed of standard 64KB single port SRAMs in a ULVT (ultra low voltage threshold => high performance) process can give a worst case leakage (125 degC) in the 90 amp range. Thats correct, 90A! – dissipating about 70W at a typical operating voltage of 0.8V. This does not include the additional SRAM required to support redundancy and ECC.
Employing an “ensemble bet” approach of an amalgam of architectural and tools techniques that are hopefully synergestic. Wave Computing has placed their chips (gambling metaphor there . . ) behind a “spatial computing” approach that blends a data flowgraph computing model (designed to train without much host CPU overhead), homogenous fabric of 10’s of thousands of small processors, asynchronous computing , static scheduling, and SAT solver based tools for high QOR.

But as the ground will continue to shift – a prudent architectural choice would seem to be one that leans towards general programmability and has good support for off-chip memory as well as ability to support high bandwidth communications between larger number of chips. I would put both Wave and nVidia in this category.

As an example of relatively recent tectonic movement, consider the paper published by Facebook research (FACEBOOK-LARGE-MINIBATCH-SGD-PAPER) in the middle of last year. This paper described work which demonstrated a linear (up to 30x!) decrease in Imagenet training of the ResNet-50 network with mini-batch sizes up to 8K images (with no loss of accuracy). The catch is the per-worker (chip) workload is large, and the overall system must support high bandwidth connections between large numbers of workers, in this case between up to 32 nVidia DGX platforms which themselves interconnect 8 largeTesla P100 GPGPUs in a high speed Torus network using NVLink interconnect. This type of disruptive development, requiring large per “worker” loads and significant high bandwidth network capabilities between large numbers of workers does not bode well for “big bets” such as the “all memory on the chip” bet made by GraphCore.