deep learning training processor

Wave Computing 3U Rack Mount Deep Learning compute accelerator

ACCOMPLISHMENTS as Founder & CEO at Wave Computing: (12/'09 - 6/'16)

As an Executive-in-Residence, incubated Wave Semiconductor within Tallwood Venture Capital starting in late 2008
Acquired key Asynchronous Logic Technology and Dataflow Architecture IP from third parties
Launched Wave as founder and CEO in Dec 2009 with investment from Tallwood VC and Southern Cross VP
Recruited the world class leadership team (VP SW, VP Eng, CTO, VP Marketing, VP Biz Dev, VP SoC Integration)
Closed series B funding in May 2012
Demonstrated successful 28nm prototype silicon (TSMC) in early 2013. Demonstrated dataflow architecture and asynchronous technology @ 10Ghz. Achieved this on total burn to that point of ~$12M and an FTE headcount of 15
Grew team to 40 US FTE, 12 US consultants, and 85 contractors overseas at multiple sites in India, Sri Lanka, and Armenia
Secured $20M+ development contract
Successfully pivoted company to Deep Learning in mid 2015 with name change to Wave Computing and focus on Deep Learning training acceleration. Pivoted business model from fabless semiconductor with initial market focus of boards for HPC, to rack-mount server solutions for datacenter Deep Learning network training needs
Presented Wave’s asynchronous technology at multiple venues, including Stanford SystemX, FDSOI Forum in Shanghai and UC Berkeley
Secured ~$40M in equity, notes and NRE during my 6.5yr tenure as CEO
Wave closed a Series D of ~$56M in Q3 2017, bringing in new investors and bringing total capital raised to ~$117M (Crunchbase data), and the company seems well positioned for success

SOME HISTORY

As of this writing in 2017, Wave Computing has announced a 3U rack mount Deep Learning compute appliance with ~3 PetaOps/s of performance and > 2TB of memory initially supporting TensorFlow. For more information visit Wave’s website: https://wavecomp.ai/

Wave’s Genesis – The High Cost of ASIC Initiative: Deep submicron chips cost too much to develop and take too long to design and build. This is a big problem for companies of all sizes and stages and creates a large unmet need in the market. Dado Banatao, the founder and managing general partner of Tallwood VC, starting searching for a solution to this unmet need in the mid 2000s by creating an initiative within Tallwood called “Cost of ASIC”.

Theseus Research: A company that caught Dado’s attention early was Theseus Research, where Karl Fant had developed a novel very robust data driven, low power, dual-rail asynchronous technology called Null Convention Logic (NCL). However NCL by itself was insufficient to develop a Cost of ASIC solution, and a few years later Theseus Research returned with a dataflow computational architecture (called the FlowGraph Machine, or FGM) utilizing NCL, and in 2008 I began incubating Wave Semiconductor within Tallwood to develop a Cost of ASIC solution incorporating the FGM.

FPGA and “The Giant Sucking Sound”: While incubating Wave several other trends became apparent that bore on Wave’s technical approach and business plan. The first was that the only realistic alternative to ASIC in the market was FPGA, but the performance of LookUp Table (LUT) based FPGAs had stalled to where effective clock rates realized for most designs were rarely above 200-300Mhz. This created a significant market opportunity if an alternative with better performance and shorter development times could be developed. The second trend was that the inexorable march of Moore’s law was making it difficult for targeted ASICs to survive as standalone chips for very long before being absorbed into a system SoC, transforming a fabless semiconductor business model into an IP licensing model – which generally was not positive for a company’s valuation. The latter issue argued for a “platform” approach for Wave’s product that would span a breadth of applications in a manner similar to FPGA. Although Wave’s technology and solution was very different from classic FPGA, and we were addressing a much broader market, I had previously had some due diligence exposure to FPGA, as well as pitching an FPGA fabless semiconductor business model at Leopard Logic, which can be found here on this site: LEOPARD-LOGIC

Assembling the IP Foundation: By early 2009 NCL had been acquired by a third party from Theseus, and even though one of my investors suggested I simply license NCL and the FGM, I felt it was crucial for Wave to take the time and effort to own these core technologies rather than simply licensing them, as it would make the company less vulnerable to product development schedules and market entry windows, allow us to improve and extend the patent portfolio that would be acquired, and eliminate the possibility of potential competition from other licensees. Hence, as part of the company formation process Wave acquired both NCL and Theseus’ flowgraph architecture.

Initial Product Requirements and Business Model: Wave’s product was to be a chip that would:

Be programmable in a Hardware Description Language (HDL) such as Verilog
Achieve a “functional density” that was superior to FPGA
Consume less power than an FPGA
Be faster than an FPGA

The chip would be offered as a platform, similar to an FPGA model, initially introduced into the market targeting specific niche(s).

Launch: After a typical Tallwood Executive-in-Residence tenure of 3+ yrs, I departing Tallwood in August ’09, and in mid December Wave was launched with seed funding from Tallwood and Southern Cross VP with three founders – myself as CEO, Karl Fant, and Karl’s business development partner, Ken Wagner.

Technology De-Risking: A key use of the initial funding was to de-risk (prove) critical proprietary technologies. Wave’s de-risking milestones were:

Demonstrate the underlying asynchronous technology worked and was scalable
Demonstrate the data flowgraph fabric worked and was scalable
Demonstrate we could compile clocked Hardware Description Language, aka Verilog, into untimed FlowGraph form, map it to the dataflow fabric and obtain functionally correct results

Initial Product – the Software Defined ASIC: Wave’s initial product goal as a fabless semiconductor company to address the “Cost of ASIC Problem” was to create a massively parallel dataflow computational fabric that would enable a “Software Defined ASIC” – later called the DPU (DataFlow Processing Unit). The goal of the DPU was to emulate the designer’s RTL (logic description) so efficiently that it could “be the chip”, providing an alternative for systems designers to implement their ASICs and thus bridge the growing gap between LUT (LookUp Table) fabric used by the FPGA companies Xilinx and Altera, and extremely expensive deep submicron custom ASIC design which fewer and fewer generally large companies were capable of pulling off. The chip would also displace FPGA in certain applications. For such an application goal there was little need for floating point capability.

Meeting with Flip: I met with Flip Gianos of Interwest in mid 2009 to discuss Wave (Interwest was an original lead investor in Xilinx in 1984, and at that time Flip was Chairman) and he told me the Xilinx board was concerned that the LUT fabric had “run out of gas”, and urged me to talk to Xilinx research. I did not follow-up on his suggestion as I had hoped Flip might take off his Xilinx hat and put on his investor hat – but it only reinforced what I already knew, which was that FPGA LUT fabric performance improvements had stalled, and that the FPGA companies were morphing into SoC (Systems on Chip) developers in no small part because the performance of their LUT fabrics were stalling out.

Bits vs Bytes: The next logical step up in operand granularity above the bit oriented LUT for a programmable approach was the 8-bit byte – so a dataflow fabric of simple byte processors was chosen by Wave as the target approach. One can think of Wave’s fabric as mesh of 10’s of thousands of minimalist 8-bit microprocessors like those used in the Apple II, only each processor runs 10,000 times faster, and where many threads of instruction execution flow in both space and time through the fabric. Larger precision operations would be composed using multiple byte processors ganged together.

Enabling Approaches: Extremely high performance was required of the DPU to achieve the required functional density to enable more logic to be emulated than can be emulated in an FPGA, so silicon efficiency and high instruction execution rates (5G+ instructions/sec/core) were crucial for product success. However density and very high speed beget high power dissipation – and here is where it was believed our proprietary asynchronous technology was a key enabler, as it made possible a low power implementation using logic that was glitch free, data driven, and did not require clocks or other highly conditioned (and power hungry) signals.

I made the somewhat controversial decision that given the limited resources of the company at that time, and the fact that full custom silicon design in bleeding edge fabrication technology was required to be power, performance, and price competitive, that the design should consist of a small number of full custom tiles that Wave could afford to highly optimize that in turn could be regularly “stepped and repeated” to arbitrarily scale the DPU size as needed. Furthermore all routing (signal connections) within the DPU were to be achieved simply by abutment of the tiles (aka “routing by abutment”). This decision to select a homogenous tileable architecture was serendipitously to later have profoundly positive consequences in terms of the compiler approach and QOR, as is explained further below.

Static vs Dynamical Scheduling: The FGM dataflow architecture acquired from Theseus Research was a fully asynchronous and dynamically scheduled architecture. However a dynamically scheduled architecture, while wonderfully flexible and insensitive to operand arrival times, paid a penalty for that flexibility in terms of area and speed. Despite concerted efforts by me to streamline the dynamically scheduled architecture, because of the costs associated primarily with operand queuing and tagging, the resulting area and performance was still off by roughly a factor of three from where it needed to be, and there were also concerns about potential chaotic behavior with congestion. It was not until a suggestion was made by Rado Danilak and Dmitry Vyshetsky to convert to a statically scheduled version of the DPU that the necessary breakthrough in density and performance was achieved. So the dataflow model remains, but the low level coordination between processors in the fabric and all flow of data is statically scheduled in time by software. With this architectural modification, much of the work of coordinating operation on the chip was shifted from the hardware to the compiler/placer/scheduler/router, with a significant improvement in run-time power, area, and performance – provided the tools could deliver a sufficient QOR!

Staffing Wave’s Senior Leadership: The senior leadership team I recruited included Samit Chaudhuri, who led the software/tools team after joining the company early on. GP Singh, who excels at high performance and full custom design, was initially brought in as a consultant in Sept 2010 but later joined the company full time. In addition to the full custom design of the core processing elements, GP led the effort to significantly improve and evolve the NCL logic into three distinct logic families (which became known as WTL – Wave Threshold Logic), and he, John Li, Chris Nicol, Samit Chadhuri, Roger Carpenter and others really stepped up to get the 28nm test chip (described below) done and out the door. In January 2010 I was introduced to Chris Nicol by Larry Marshall of Southern Cross VP, and that introduction kicked off a collaboration with NICTA, an Australian National Research institute where Chris was CTO of Embedded Systems. Chris was to join the company full time in May 2011 initially as VP WTL IP Products but he later assumed roles as VP Systems Engineering and CTO. Chris has since led the DPU architectural effort, making it real, designing an elegant DMA architecture, designing the switch fabric that sits on top of the processor fabric, designing the I/O interfaces, etc. To drive the marketing side, in March 2011 the company brought in Richard Terrill as a consultant, and Richard was to later join the company full time. In Feb of 2012 I brought in Derek Meyer in a consulting VP Business Development role. In late 2015, as the product design center took on more of an SoC character, I felt the company needed additional leadership with higher level hardware silicon systems implementation and in particular with the design verification task, and in November Darren Jones, who had led the DV (Design Verification) on the 16nm Xilinx Zynq SoC product, started consulting for us and shortly afterwards, in mid-December, joined full time. In late 2015, after I had pivoted the company to a Deep Learning systems solution focus, I brought in Jin Kim who had extensive experience in Machine Learning/Deep Learning, as VP Marketing, and Richard Terrill moved to VP Ops.

Photo of Wave Computing Corporate Headquarters

Wave Campbell HQ

Your Content Goes Here

Location, Location, Location: The office environment and physical location of the company HQ plays a significant role in defining the character of a company as well as in both attracting and retaining talent. By mid 2013 Wave needed to move out of its third location (at that time in Sunnyvale), and with the commute in silicon valley becoming so bad that it can have a significant negative impact on Quality of Life, special attention was paid to commute impact on all employees as we searched for a new location. After an extensive search I settled on the current HQ a block west of downtown Campbell. This building is across the street from the Campbell community center (where company all-hand meetings were frequently held) and which has many exercise facilities (track, tennis, pool), and it is a block from downtown Campbell where many restaurants and shops are within easy walking distance. The company has been able to stay in the three story building by taking more of the space as it has grown.

Shelby Test Chip Plot

Photograph of Wave Computing Prototype Chip Development Board

28nm Test Chip Eval Board

Test Chip Eval Bench Setup

Your Content Goes Here

Shelby 28nm Test Chip: In early 2013 Wave saw first silicon on a 28nm test chip, internally called Shelby, that was fabbed at TSMC (a photo of the the eval board with test chip is shown to the left). The chip successfully demonstrated Wave’s asynchronous logic and the operation of the a core tile consisting of a quad (4) full custom asynchronous Processing Elements as well as other functionality. Shelby had 16 different power domains, demonstrated robust operation across temp and voltage, and operated down to 300mV. When the voltage was pushed a bit, the chip achieved 10Ghz operation – which was a remarkable full custom design achievement by a small cash strapped startup in a 28nm process – and I’m sure this demonstration played a role in TSMC granting us, a small startup, early access to their 16 nm Process Design Kit (PDK). Shelby was an achievement I remain proud of. At that time the company only had 15 full time employees, and about 10 local consultants. The full time employee engineering team that delivered the chip and the evaluation board numbered less than 10. By this time we had also demonstrated that we could take a non-trival amount of clocked RTL and convert it into untimed flowgraph form, map it into the fabric, execute the flowgraph and achieve functionally correct results.

SAT Solver Based Place, Route, & Scheduling – a Really Big Deal: The quality of the dataflow graph placement, route, and with the move to a statically scheduled architecture – scheduling, is critical to overall DPU performance on every metric – throughput, power consumption, latency, functional density – all of which can simply be described as Quality of Result (QOR). Place and Route (P&R) for ASIC design or for FPGA is a perennial challenge, and for FPGA in particular once resource – particularly routing resource – utilization exceeds 70% or so the P&R system becomes quite brittle and small additions/changes can have a large negative impact on the QOR. In early 2015 a senior team member (Asmus Hetzel) suggested that the DPU’s homogeneous regular array of simple processing elements, coupled with static scheduling, would enable a different approach using a SAT Solver (also called Propositional or Boolean Satisfiability).

Attempting to “compile” – or place, route, and schedule – dataflow graphs on a spatially resource rich non-Von Neumann architecture like the DPU is very hard. Traditional approaches and even manual compilation simply cannot explore all the of the potential alternatives available, and so “leave a lot on the table”. The DPU has a great many rich resources and alternatives for data locality including temporal richness relating latency of access to spatial distance.

A SAT Solver approach requires the architecture’s spatial resources and temporal behavior be formally describable as “propositions” or “templates”, and is not likely adoptable by FPGA or GPU – but with the homogenuous, fairly simple, statically scheduled DPU fabric it was doable.

SAT solvers historically suffer from “combinatorial explosion” when attempting larger problems, but the Wave team has invented some “divide and conquer” techniques that make the approach not only amenable to larger problems (say up to 100 or so processing elements), but also reduce the compile time.

So far the QOR of the SAT solver “compilation” approach greatly exceed either manual or conventional heterogeneous/ensemble approaches – and in my mind represent a breakthrough enabling the full potential of a massively parallel homogeneous CGRA (Course Grained Reconfigurable Array) like the DPU to become the dominant compute implementation approach for “embarrassingly parallel” workloads such as those found in Deep Learning. A recent paper on Wave’s SAT solver based compiler presented by Samit Chaudhuri at ICCAD 2017 can be found here: WAVE’S-SAT-SOLVER-BASED-COMPILER-PAPER

Pivot to Deep Learning: By mid 2014 Wave was examining a number of potential go-to-market products in the High Performance Computing (HPC) market and had changed its business model from fabless semiconductor to PCIe board level products. In May 2015 after additional market analysis and discussions with others in industry, including Sumit Gupta (who then was VP and GM of HPC at IBM and who had previously played a significant role in nVidia’s pivot to deep learning) it became clear the DPU was exceptionally well suited to “embarrassingly parallel” Deep Learning computational tasks. The DPU architecture has many characteristics providing advantages for Deep Learning computation, such as an abundance of uniformly distributed high speed memory very close to the (also uniformly distributed) loci of computation – which is ideal from both a minimum power dissipation and maximum performance perspective. The dataflow architecture of the fabric is also a good “impedance match” to the dataflow model of neural network computation, facilitating the ability of the DPU to operate with minimal external host CPU interaction. It was not long afterwards that I pivoted Wave to Deep Learning and Wave Semiconductor became Wave Computing with a business model change from PCIe boards for HPC to rack mounted Deep Learning acceleration systems. By this time Wave had also secured a $20M+ customer contract to deliver silicon based on the DPU (Dataflow Processing Unit).

Headcount Growth: By early 2016, Wave had a peak local headcount of 40 FTE, with about a dozen local consultants, and ~85 contractors overseas at four locations. Full custom design requires a lot of heads for chip layout, simulation, verification etc – and we used the overseas resources for these types of tasks, with all the decision making occurring in Campbell. This extensive headcount bias (2.5:1) towards contractors was driven by cost efficiency requirements, but placed a heavy management burden on the small home team, and the quality of many of the overseas heads was not satisfactory. However, with up to a 4:1 cost advantage, the cash flow incentive was compelling.

Capital Efficiency: Over my 6+ yr tenure as CEO, Wave received roughly $40M in equity, convertible promissory note, and NRE funding. This was during a long drought when funding by institutional investors for semiconductor startups was virtually nonexistent. I am most grateful to Wave’s investors, particularly Dado at Tallwood, for believing in the company, the technology, and the opportunity and continuing to support the company prior to the Deep Learning revolution and the revival of interest by institutional investors in hardware/systems solutions for specialized workloads such as Deep Learning..

The company burned $12M in its first 3.5yrs, which included delivery of the Shelby 28nm test chip. However during my tenure the company never had a month-end cash-on-hand in excess of $6M, and had 28 months where month-end cash-on-hand was less (often much less) than $500K. At the end of each of roughly 45 months there was less than a 90 day cash runway, and for roughly 17 months there was less than a 30 day cash runway – so yes, I know something about capital efficiency. The cost of a mask set alone to fabricate a 16nm chip is on the order of $6M, and annual costs for chip design tools at bleeding edge process nodes can easily run into the millions.

As any startup CEO can tell you, recruiting experienced valley engineers when there is a short cash runway is a challenge, but fortunately Wave recently (~Q3 2017) closed a $56M+ Series D round (according to Crunchbase) so the company currently has the kind of capital muscle necessary to deliver to the market a sophisticated rack-mount system based on proprietary 5+ Ghz massively parallel dataflow processor fabbed on a bleeding edge process node. The future looks bright for Wave.

Wave has had the time to develop a significant IP portfolio, and has had the time to mature the software (a process that simply takes time). Immature software/tools and the resulting poor QOR (Quality Of Results) has historically been one of the two main failure mechanisms for processor and reconfigurable startups (the other is making the mistake of fabbing in an older “cheaper” process node).

After a very long drought the market and the investment community has (re)awakened to the value of dedicated silicon targeting specialized “embarrassingly parallel” workloads such as Deep Learning. Deep Learning has an infinite appetite for compute power, and we haven’t seen anything remotely like it since the early days of 3D graphics. There has been a significant uptick in investment activity now for dedicated silicon solutions, with recent funding for Deep Learning silicon startups such as Cerebras, Cambricon, Graphcore, Mythic, and others – including Wave. There is a brief survey of Deep Learning silicon startups on a post on this site that can be found here: DEEP-LEARNING-SILICON-STARTUP-SURVEY

The Importance of Power: The importance of managing power dissipation is crucial to the success of any Deep Learning training acceleration system. I would repeatedly implore the team at all-hands meetings that my biggest worry was power, and that every engineer, for every design decision he or she makes should consider the power dissipation implications – it the major reason I pushed for systemic adoption of a data driven asynchronous computing approach by Wave. Techniques such as placing data as close to computation as possible, putting as much memory on chip as possible to minimize off chip memory access, recomputing network parameters on the fly, various network pruning and pooling approaches, using integer instead of floating point, variable precision computing with stochastic rounding, asynchronous computing approaches, sub-threshold voltage level operating techniques, analog computing approaches (as Deep Learning algorithms are relatively tolerant of some imprecision), and other more “neuromorphic” approaches – all of these techniques will be employed by players in this market. A brief review of Deep Learning silicon startups can be found ona post on this site here: BRIEF-REVIEW-OF-DEEP-LEARNING-CHIP-STARTUPS

Moving On: I departed Wave in June 2016 as the last remaining founder. Derek Meyer, our acting VP Business Development consultant, took the helm.

Puzzling Acquisition of MIPS by Wave: After my departure, Wave spend a bunch of $ improving and then moving into a much larger more expensive office and then acquired MIPS — claiming synergies between the two companies that made little sense to me. Wave also took on substantial amounts of debt as well as other liabilities (known as “hair” in the trade). MIPS had sold the majority of its patent portfolio years prior and seemed more like a “dead man walking” just milking the “long tail” of its licensing business. After raising some more funding, but then facing lawsuits (both threatened and filed) Wave was ultimately to file for Chapter 11 in the spring of 2020.

I am proud to have founded and incubated Wave and then led the growth of the company as CEO for 6+ years under far more stringent capital constraints than other fabless semiconductor startups addressing the same market.

SOME TAKEAWAYS:

1) Optics are really important in startups. Unfortunately optics can trump technical or development merit when it comes to fundraising. It is what it is. If a startup does not follow a “typical” funding cadence and development trajectory – it will create headwinds with potential new investors, customers and newhire prospects.

2) Seneca said that “Luck is what happens when preparation meets opportunity”. Start with differentiating and protectable technology that addresses a market with a large unmet need, build a solid IP base, build a great team and hold that team together, put one foot in front of the other and build value day by day, never forget what you have and make sure you remind the team of that – but most importantly stay in the game. You can’t seize opportunity if you are not there to do so.