A Brief Survey of Deep-Learning Chip Startups

I thought I would provide a brief survey of Deep Learning chip startups with some links and comments. As I learn more based on new info including reader feedback and comments, I will add to this page. Financial data should be considered approximate and is based primarily on sources such as Pitchbook and Crunchbase.

Cambricon: http://www.cambricon.com/ Cambricon was founded by two young professors, Chen Yunji and Chen Tianshi (CEO), at the Institute of Computing Technology at the Chinese Academy of Sciences in 2016. Cambricon closed a small seed round in August of that year. The company published a paper at the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture entitled “Cambricon: An Instruction Set Architecture for Neural Networks“. In the paper, which builds on previous Machine Learning acceleration work called the DianNao Project, they propose a novel domain specific Instruction Set Architecture for Neural Network (NN) accelerators that is based on a comprehensive analysis of ten different existing NN techniques and workloads. I thought this to be rather nice work, using fairly classic Patterson/Hennessey analysis techniques. They even acknowledge that their approach was “inspired by the success of RISC ISA design principles”. The chip uses a load-store architecture and uses a scratchpad memory instead of a vector register file. Emphasis is placed on code density, and to that end a number of vector (including vector load) and six matrix instructions are defined. Normally multi-cycle instructions (such as vector load) are not favored by classic RISC because of the difficulties of handling exceptions or interrupts in the middle of a multi-cycle instruction – but that is not a factor with this type of domain specific processor.

The simulation results reported in the paper claimed 130x lower power consumption than an nVidia K40M board. The paper reported plans to tape out a 65nm demonstration chip. This work was rewarded with a $100M Series A round that closed in August 2017. The round was led by State Development Corp and Investment with participation from Alibaba Group, Lenovo Capital and Incubator Group, Chinese Academy of Sciences Investment Management, Turing, Oriza Seed Venture Capital and Yonghua Capital.

Chen Tianshi said the company is valued at one billion US dollars, and said: “We aim to account for 30 percent of China’s high-performance AI chip market in three years. In November 2017, the company launched three new AI chips to address different markets such as image recognition applications, self-driving and other scenarios – called the 1H8, 1H16 and 1M.

Cerebras: https://www.cerebras.net/ Cerebras is a stealthy startup that was founded in 2016 by Andrew Feldman and Gary Lauterbach. Andrew and Gary previously founded SeaMicro, which was acquired by AMD in 2012. SeaMicro build a datacenter server using a fabric of small low power CPUs (such as the Intel Atom). These CPUs were interconnected using a high bandwidth low latency 3D torus fabric that also utilized I/O virtualization. The fabric was realized as a 90nm chip that implemented the fabric switch, I/O virtualization, and node management.

The company raised a $27M A round in May 2016 which included investors Benchmark Capital, Foundation, and Open Field Capital. A $25M B round closed in January 2017 (with a $220M pre-money valuation), and the company is raising a $60M round scheduled to close in Q1 of 2018. As is mentioned elsewhere in this blog page, it is interesting that Foundation is invested in Cerebras given their investment in GraphCore as well.

The author is not (yet) aware of Cerebras’ architecture or implementation techniques. I have heard mention of special attention paid to sparse matrix arithmetic techniques to minimize power and memory bandwidth requirements – but this is puzzling as it is fairly standard stuff. I have also heard some rumors of a wafer scale approach – which would be quite ambitious. A review of Gene Amdahl’s experience with Trilogy might be in order there.

ETA Compute: https://www.etacompute.com/ ETA, co-founded in 2015 by Gopal Raghavan (former co-founder and CTO of Inphi) is based in Westlake Village, CA and is a developer of “ultra-low” power processors for mobile and edge device machine intelligence. On Jan 23, 2018 they announced the close of an $8M A round led by Walden International (Lip-Bu Tan) with participation by Acorn Pacific Ventures and Walden-Riverwood Ventures.

Unlike the other startups in this brief survey, ETA’s business model is one of IP licensing, rather than selling physical chips. An IP model is generally not favored by institutional investors as it usually takes longer to achieve a significant revenue rate, and their are not very many existence proofs of big successful IP companies (with Rambus and ARM being the primary examples). However the lead investor in Walden and Walden-Riverwood, Lip Bu Tan, is also CEO of Cadence, which has a significant IP licensing component to its business, so Lip-Bu clearly appears more comfortable with an IP model.

ETA claims their neural network implementation is based on a spiking neuron model (a neuromorphic approach), using an underlying asynchronous technology which they call DIAL, for Delay Insensitive Asynchronous Logic. It is the asynchronous technology which they claim enables “extraordinary low power”, and asynchronous approaches hves been adopted by a number of other players in this space to manage power – including Wave Computing, and REM. ETA told EE Times that their asynchronous ARM Cortexs-M3 runs on 2 milliwatts (mW) at 65Mhz. Spiking neural model-based algorithms developed by ETA are run on the M3. As to the adoption of a spiking neuron model, the author is somewhat skeptical of that approach especially if it fairly slavishly imitates the human brain, for reasons that are well presented by Yann LeCun in a Facebook post (YANN-LECUN-CRITIQUE-SPIKING-NEURON-NEUROMORPHIC-APPROACHES).

Graphcore: https://www.graphcore.ai/ Graphcore is based in Bristol England and is staffed by Ex XMOS, Icera, and Picochip veterans. The locus of chip/processor design talent in the Bristol area is one of the few, if not the only, remaining locations in Europe with the critical mass necessary to prosecute a processor design the scope Graphcore’s. XMOS and Icera were based in Bristol, and Picochip was in nearby Bath. The company has raised ~$110M total starting in July of 2016, with the most recent $50M round led by Sequoia. Amadeus, Robert Bosch Venture Capital, C4 Ventures, Dell Technologies Capital, Draper Esprit, Foundation , Pitango, Samsung Strategy and Innovation Center and Atomico also participated. Foundation, Amadeus, Draper Esprit and Robert Bosch were also investors in XMOS. Amadeus was also an investor in Icera. Interestingly Foundation is also an investor in Cerebras – which at first blush would seem a conflict.

Graphcore places a fundamental emphasis on power minimization as they believe it to be the primary performance limiting factor (which leads to the dark silicon problem), and to that end they emphasize use of low precision floating point, high dimensional sparse graph computation approaches, and of recomputing network parameters/weights on the fly rather than bringing them in from external memory or even from local cache. This is touched on in a presentation that the CTO, Simon Knowles gave at the 3rd RAAIS summit Simon_Knowles_Talk_RAAIS

The energy cost due to the distance of the data from the computation is well captured by Mark Horowitz in his classic analysis:

Diagram of Data Fetch Energy vs Distance (from Horowitz)

Your Content Goes Here

Based on what is said in that video, and what I have heard elsewhere – I would expect to see an architecture that de-emphasizes off-chip memory bandwidth, and instead places a large amount of SRAM on the die. That is in fact what was revealed in a recent presentation by Graphcore where they showed an implementation containing two very large (near reticle limit – what we used to call “aircraft carrier sized” die) in the nVidia Volta size range of 800 mm sq. These two die are interconnected (probably via SERDES) to provide 600MB of on-chip SRAM. An image of their Colossus (an aptly chosen name) product is shown below. The two die together are designed to operate within the maximum 300W TDP (Thermal Design Point) of a PCI board. This approach is believed to minimize overall system power while enabling very high computation rates, although their “no off-chip memory philosophy” would appear to place a hard ceiling on the size of problems that can be worked on and makes the company vulnerable to disruptive computational approaches – as is discussed further in another post here: DEEP-LEARNING-HARDWARE-DESIGN-CHALLENGES. However, Graphcore says that their Bulk Synchronous Parallel compute model can be extended to clusters of the chips, so additional memory can be had by interconnecting more of these very expensive chips together.

Nvidia has taken the approach to both using external memory, and interconnecting their chips together on products like the DGX using a toroidal topology NV-LINK based GPU interconnect technology. This interconnect technology has been refined over time and was originally developed to facilitate GPGPUs tacking very large HPC (High Performance Compute) problems.

Graphcore’s Bulk Synchronous Parallel compute model serializes computation and communication by leveraging the same processors to perform both tasks in a TDM (Time Division Multiplex) fashion. In contrast, in Wave Computing’s DPU fabric (see further down in this post for more detail) local communication is integral to the core compute model so that nearest neighbor communication does not introduce any latencies (quite the opposite in fact). Longer distance communication is handled by a switch fabric layered “on top” of the compute fabric that is statically scheduled by the same SAT Solver based placer/scheduler/router. Hence computation and communication occur simultaneously throughout the fabric.

Graphcore Colossus Processor Die-Pair Plot

Your Content Goes Here

Groq: https://groq.com/ Groq, which is funded with $10M in March 2017 from Social Capital, made a splash early in 2017 by “poaching” eight engineers from Google’s TPU-0 team. Jonathan Ross (CTO) and Doug Wright (CEO) are founders from Google, where the TPU-0 was based on Jonathan’s “20% project”. Krishna Rangasayee of Xilinx took the COO slot in August 2017. Their website claims a 8 TOPs/W single chip processor with 400Tops/s peak performance. The website also quotes a 1ms inference latency, which is a hot button for Google and nVidia where at last year’s GTC conference nVidia’s CEO Jen-Hsun Huang went to great pains to quote performance for inference latencies of no more than 7ms to match what Google said was the max latency allowed for datacenter inference applications. The website also says the chip is to released in 2018, but this is hard to believe unless they are a sales channel for something Google is building. You can’t develop a bleeding edge processor on $10M and I have heard nothing about additional funding.

Gyrfalcon: https://gyrfalcontech.com/ Gyrfalcon is a Milpitas CA based Deep Learning processor startup that touts an APiM (AI Processing in Memory) technology that is capable of both RNN and CNN network evaluation utilizing a specialized floating point format (5-bit mantissa and 4-bit exponent). They claim 28K parallel computing cores in the MPE (Matrix Processing Engine) that “does not require external memory for inference”. Gyrefalcon has recently announced their Lightspeeur 2801S AI processor in both Laceli USB 3.0 AI compute stick and PCIe board form factors. The company was co-funded by VP Engineering C.J. Tornq who holds a PhD in Materials Science and significant experience in STT (Spin Torque Transfer) MRAM technology (the author does not know if this novel memory technology is used in the Lightspeeur chip. The company is rumored to be funded in part by the CIC (Chinese Investment Corporation).

Hailo Technologies: Based in Tel Aviv and currently headed by CEO Orr Danon, “Hailo is a stealth mode startup in the area of deep learning, harnessing deep technology and a multidisciplinary approach to unleash machine learning potential.” Hailo is a Technion DRIVE accelerator alumni – and DRIVE was founded by Mayer Cars and Trucks Ltd and is sponsored by leading partners in the international and Israeli automotive industry. Here Hailo-Simplebooklet they say “Hailo’s processor has unprecedented performance an offers 1000x improvement in power and cost efficiency over existing solutions, offering a truly unique, groundbreaking product to the automotive industry. Hailo is working towards rolling out its first device throughout 2018.” Supposedly sucking up good local talent & something to watch. Can be reached at: contact@hailotech.com

Horizon Robotics: http://en.horizon.ai/ Chinese start-up founded in 2015 and along with Cambricon is another Chinese Deep Learning chip startup to recently receive a $100M A round (last October). Only this round was led by Intel Capital (ICAP). Other investors include GMO VenturePartners, Visionnaire Ventures, Itochu Technology Ventures, NTT DATA, Morningside Venture Capital, Wu Capital, Linear Venture, Hillhouse Capital Group, Innovative Venture Fund Investment jointly managed by NEC Capital Solutions and the Investments Department at SMBC Venture Capital and Archetype Ventures.

They call their technology the BPU (Brain Processing Unit). An early application is Autonomous Vehicles (which is no doubt of interest to Intel). “to have their technology in cars on Chinese roads by 2019”.

Mythic: https://www.mythic-ai.com Mythic started life in 2012 as Isocline after being founded by Mike Henry and David Fick shortly after Mike received his PhD from Virgina Tech and David received his PhD from U Michigan. They survived for the first 4-5 years on $2.25M of STTR type government contracts developing ultra-low power GPS receivers leveraging proprietary analog computing techniques. So they are an interesting player in the Deep Learning chip space as they bring fundamentally different implementation technology, and like Wave, they have been around long enough to mature it. Furthermore their technology would seem to directly address what others (including Graphcore) has pointed out is the fundamental limiting factor to performance – which is power dissipation.

The company has offices in Redwood City and Austin. The chip engineering is done in Austin. Mythic has raised ~$56M which includes a ~$9M A round in April 2017. Investors include DFJ (Steve Jurvetson – lead), Lux Capital (Shahin Farshchi), and DCVC (Matt Ocko). The company announced the close of a $40M Series B round at the end of March 2018. The round was led by SoftBank with participation by the existing insiders. Lockheed Martin Ventures and Andy Bechtolsheim also participated.

Mythic’s website claims they perform in-memory (Flash memory) inference calculations using a hybrid digital/analog calculation approach providing significant power advantages. Since network evaluation is relatively tolerant of noise (noise is often ADDED in back-propagation for example to improve convergence behavior) – an analog computing approach could be a very interesting , especially since embedded flash is ~10x denser than SRAM. Of course Flash has write wear-out issues, but if the network is not modified that often that issue could be readily managed. There is a patent application (#20160048755 ) titled “Floating-Transistor Array for Performing Weighted Sum Computation” which perhaps gives an indication of the techniques employed.

Their website claims 50x lower power than an all digital approach with negligible loss of accuracy for inference. Mythic is initially targeting edge devices such as drones, battery operated sensors etc, as opposed to the datacenter.

Reduced Energy Microsystems (REM): http://remicro.com/ REM is a scrappy startup targeting low power inference for the Edge/IOT for tasks such as image recognition. REM was founded in 2014 by Dyland Hand (Head of HW), Leazar Vega Gonzalez (Head of SW), and William Koven (CEO).

REM employs a low voltage (near threshold level) design techniques as well as asynchronous techniques to reduce power consumption. Peter Beerel is chief scientist, and Peter has worked on asynchronous technology for many years starting as an associate professor at USC. He advised Fulcrum, and then founded Timeless which was acquired by Fulcrum, and Fulcrum (a maker of high performance asynchronous switch fabric chips) itself was later acquired by Intel.

According to public press release, REM has joined the RISC-V foundation, and is building a low power (probably asynchronous) RISC-V core in Global Foundries’ 22FDX process.

According to Crunchbase REM received $2m in seed funding from Draper Associates in March of 2017. REM also received some funding as part of an SBIR grant awarded in August 2017.

SambaNova Systems: https://sambanovasystems.com/ SambaNova recently (3_2018) announced the close of a $56M A round led by Walden (Lib-Bu Tan) and Google Ventures (GV) with Atlantic Bridge Ventures and Redline Capital also participating. The company is led by Rodrigo Liang (ex Afara, Sun, Oracle), and Kunle Olukotun (Afara, Mines.io, Stanford), and Chris Re (Stanford). SambaNova is focused on “building machine learning and big data analytics platforms”. Their solution is claimed to “enable the acceleration of fast-evolving algorithms in data analytics and machine learning” – which no doubt leverages the concern about targeted workload analysis based architectures lacking sufficient flexibility to cope with the rapid technology development. The company has presence in Palo Alto and Austin Tx. Current headcount is claimed to be 50 only 5 months after being founded.

ThinCI: https://thinci.com/ ThinCI (ThinC Innovations) was founded in 2010 with a focus on high performance low power vision processing for consumer products and ran for quite some time on a low burn as they matured their technology. The company closed a B round in Oct 2016 with an investor syndicate that included Denso, Magna International, Intercept Ventures and a host of Angels. Dado Banatao was an early supporter of the company and sits the board, along with Tony Cannestra of Denso.

The original architecture focus was a multi-core multi-threaded architecture for 3D Graphics, Multimedia, and DSP with an emphasis on maximizing performance/area/watt. Early target markets included Digital Televison and Network Graphics .

As the company has matured the focus has shifted somewhat to energy and resource efficient camera sensor processing. Originally for classic CV tasks such as edge detection, corner detection, etc but as Deep Learning became the dominant approach for vision processing tasks such as scene segmentation, face detection, recognition, and tracking, and pedestrian detection the company has adapted the architecture to DL network based inference processing.

With recent investments from Denso and Magna the company likely has traction in the automotive space for vision processing tasks.

In a recent Hot Chips presentation, ThinCI rolled out their Streaming Graph Processor (SGP) architecture, claiming PCIe development board shipments in Q4 2017 and touting native graph (dataflow graph) processing. Wave Computing is another company that supports native data flowgraph processing, and this is a good match for Machine Learning and existing DL Frameworks. At a lower architectural level ThinCI uses fine-grained thread scheduling and hardware instruction scheduling.

The chip is fabbed in 28nm at TSMC, and ThinCI claims a power dissipation of ~2.5W, which would make their chip a good match for automotive or other edge Deep Learning inference vision/image processing applications.

Vathys: http://vathys.ai Vathys (formerly Ingemini) is a Portland based startup founded by Tapa Ghosh that is targeting the high performance end of deep learning chip market (along with GraphCore, Cerebras, and Wave). The company has backing from Y Combinator and is planning a 28nm test chip to prove the technology behind some bold claims. These enabling claims include:

Vathys uses asynchronous logic (as does Wave, REM, and ETA). They claim a new form of asynchronous logic that has only a 10% area overhead and a 10+ Mhz equivalent computation rate at 28nm.
A new memory cell that is 5x denser than standard 6T SRAM.
A new wireless interconnect technology to facilitate memory die stacking

Tapa recently gave an EE380 presentation at Stanford, and a video of that presentation can be found here Vathys-Presentation-Video, along with slides for the presentation Vathys-Presentation-Slides .

Wave Computing: https://wavecomp.ai/ Wave, founded in late 2009, is located in Campbell CA and recently closed a Series D round of $53.7M bringing to total investment into the company of $117M (according to Crunchbase). The company has announced a 3U rack mount Deep Learning compute appliance with ~3 PetaOps/s of performance and > 2TB of memory initially supporting TensorFlow. Wave also announced an early access program for beta machines. This product is based on Wave’s massively parallel Dataflow Processor called the DPU (Dataflow Processing Unit).

4 DPU chips are packaged per board per the figure below, putting a total of 64K 8-bit processors to work per board (these processors can be ganged together on a cycle-by-cycle basis to perform 8/16/32/64 bit operations).

At the recent Hot Chips conference in August 2017, the CTO (Dr. Chris Nicol) claimed a 1000x performance advantage over GPU. TheNextPlatform gave a pretty good summary of that talk here: TheNextPlatform-Wave

Several unique features of the product as pointed out in the NextPlatform article include:

A central Host CPU is not needed, minimizing the “Amdahls law” performance limitations of Host CPU interactions
The fabric is self-timed using a globally asynchronous/locally synchronous approach, with an average instruction rate of 6.7Gips per processor
There is a switch fabric that is also instruction driven and statically scheduled that runs at the same rate as the processors
Programs are loaded in and out of an on-chip cache using DMA without host CPU intervention
Datflow graphs are taken directly from TensorFlow and mapped into the DPU using Wave’s spatial compiler to schedule, map, and route the graph directly onto the DPU fabric. In a recent presentation by Samit Chaudhuri, Wave’s VP SW, it was disclosed that Wave uses a SAT Solver approach in its spatial compiler to achieve very high QOR.

Some history of my time as founder/CEO of Wave can be found on this site here: WAVE-COMPUTING

Several of the block diagrams from Dr. Nicol’s Hot Chips presentation are included below:

Wave DPU Chip Characteristics diagram from Hot Chips Talk

Wave DPU Chip Characteristics and Features

Wave DPU Board Block Diagram and Features