Asynchronous Technology - Has Its Time Finally Arrived?
Asynchronous logic has something in common with fusion energy – they are both technologies that have always seemed to be “just around the corner”. There are a few isolated cases of adoption, including use in smart cards and most notably by Fulcrum in fast Ethernet switch silicon (Fulcrum was acquired by Intel). But recently the author has seen a significant increase in the adoption of asynchronous logic, in particular by a number of low power edge/IOT inference chip startups. Wave Computing incorporated asynchronous logic into its technology kitbag with the acquisition of NCL and Theseus Research’s dynamically scheduled asynchronous data flowgraph architecture in late 2009.
The Relationship between Clock and Logic – an Unhealthy Co-dependency?: If one steps back and considers how clocked standard Boolean logic based digital systems function one is struck by the co-dependency between two disparate domains – the clock domain, and the logic domain. Neither can do anything useful without the other – but they are very different in structure, implementation and constraints. Clocks are highly conditioned signals with very fast rail-to-rail rise and fall times, they must be distributed throughout the chip with minimum skew and jitter and are input to high capacitance clocked state-holding elements. Clock power can easily be 50% of total power for high performance chips – and the primary clock generally has the highest duty cycle of any signal in the system.
Determining Completion of Computation: There is no way of knowing by simply observing classic Boolean logic (AND, OR, XOR gates etc) outputs when a computation is completed, especially since logic outputs will glitch as signals taking different paths through the logic arrive at different times to various gates. Hence standard logic cannot self-coordinate – so in a way the clock is there to SLOW DOWN the computation occurring in the logic by erecting spatial/temporal fences using state-holding elements such as latches or registers that are “opened” and “closed” by the clock.
Glitching and Glitch Power in Standard Boolean Logic: Standard Boolean logic on the other hand is generally designed for fast evaluation speed and minimum area. However the output glitching that occurs in standard logic as arriving inputs are evaluated generates “glitch power’ because small delays in the time of arrival of inputs result in short duration DC “crowbar” currents and large fast swings in the gate outputs. In certain functional blocks, such as multiplier adder trees, glitch power can be a significant (> 50%) percentage of overall power.
Hence one role of the clock is to prevent glitches from traveling farther than the temporal fence (registers etc) until the logic has finished evaluating, the glitching has stopped, and the system has “settled”.
This co-dependent relationship between clock and logic seems rather unnatural and forced – but it has been rich fodder for the EDA tools vendors as there has been substantial investment in the tools necessary to shore up the relationship – including clock gating tools, timing analyzers, setup and hold time analyzers, tools to measure clock skew, clock jitter, optimizers to fit logic evaluation within the allowed clock period, etc.
Increasing RC Delays and the Performance Plateau: As the industry has followed Moore’s law and steadily marched down into finer process geometries the behavior of clock and data has further diverged and driven a deeper wedge into the co-dependency. Rapidly rising sheet resistance of metal interconnect, while not causing significant issues for short distance communications associated with local logical computations, is a big headache for communicating long distance signals such as the clock due to increasing RC delay. In part because of these issues (and because POWER dissipation has become such a problem) microprocessor clock speeds stalled in ~2003, but the computation performance of localized blocks of logic continues to track Moore’s law reasonably well.
Goals for Asynchronous Logic: Asynchronous logic is an attractive alternative for mitigating the issues discussed above, which include achieving one or more of the following goals:
- Reducing power consumption
- Increasing the voltage operating range
- Increasing system performance (speed)
- Increasing process insensitivity
- Increasing system security by making the system less susceptible to side channel attack, such as DPA (Differential Power Analysis).
The Muller C-Element: A logic element used in many flavors of asynchronous logic is the Muller C-element, first described in 1955 by David Muller. Muller C-elements exhibit hysteresis – implying they must be state holding logic elements. For the element to transition high – all of its inputs must go high. The element output will then stay high (hold its state) until all the inputs transition low.
Merging Data and Clock There are many different flavors of asynchronous logic and associated design methodologies, including Huffman, bundled data and micropipelines (Sutherland), but I will not review them here. For me there is a key dividing line between the various types of asynchronous logic, and that is whether there are control signals, separate from the data, which coordinate behavior. I believe that the most robust, densest, fastest, and lowest power asynchronous logic is one where the data is the clock – where the data and the clock are merged in to unified paradigm. This can be realized by efficiently merging the state-holding behavior of the Muller C element with a dual-rail M of N (M ≤ N) thresholding logic that is as general as the standard Boolean logic it replaces.
Ideal Behavior in an Asynchronous Logic: But to meet the goals above and realize an asynchronous logic that merges the data and the clock, we ideally would like to see the following BEHAVIOR and Characteristics from the logic:
- It must be glitch free
- There must be a completeness indication
- It must be fully general (be as expressive as standard Boolean)
- It must be fully standard CMOS compatible
- It must be delay, voltage, and temperature insensitive
NCL (Null Convention Logic): An asynchronous logic which exhibits this ideal behavior is NCL (Null Convention Logic) – which was invented by Karl Fant some 20 years ago. Wave acquired NCL in 2009 and has since evolved and refined NCL into multiple versions that are now known collectively as WTL (Wave Threshold Logic). The improvements to NCL which led to multiple different libraries that comprise WTL as well as the development of WTL-BB (WTL Body Bias) to uniquely leverage the capabilities of FDSOI technology using WTL, were developed primarily by GP Singh at Wave and are patent protected. WTL-BB is presented in the slide show at the bottom of this post.
NCL is a dual rail logic, where each bit it is encoded on two wires – this allows the creation of an crucial new state associated with the data – which is the NOT DATA or NULL state.
The be precise though – NCL is not entirely delay insensitive, there is a hazard that can arise in the layout and interconnection of the logic called the orphan gate. In practice however this hazard is easy to check for, and is less of a concern than the isochronic fork hazard found in other asynchronous logic such as the style generally known as QDI (Quasi Delay Insensitive), and Wave has found NCL/WTL to be basically bullet-proof, enabling a “compose and forget” approach to design. The 28nm test chip that Wave built and tested in early 2013 operated robustly anywhere from 300mv to 2V, with operating speed strictly depending on supply voltage. Prior to Wave’s test chip there were some 20 test chips built in NCL that all worked on first silicon.
The monotonic (glitch free) behavior of WTL facilitates high speed as it enables optimization focus on high-to-low voltage transitions in the logic and fosters a “seamless” domino style circuit design.
NCL Logic Area Overhead. A significant area penalty is normally associated with Asynchronous Logic. In practice Wave found the area overhead of WTL to be in the 30%-40% range, depending on the function being implemented. Since WTL is state holding, all state holding elements found in normal logic, such as registers, flip-flops, and latches are eliminated. Hence for high speed logic, where there is a high percentage of registers relative to the depth of logic between the registers, the area of a WTL implementation can actually be less
Dual rail encoding needs more metal as there are more signal wires, but modern processes have 11-13 layers of metal and you don’t have a clock to route, and supply rail metal can generally be thinner (lower continuous as well as instantaneous current demands) – hence the increase in metal is not strongly impact overall NCL/WTL logic area.
The schematic below shows two “functionally equivalent” circuits. These circuits use roughly the same number of transistors but the NCL circuit (the upper one) does not require a clock (the fairly significant logic/area cost associated with generating, conditioning, and controlling/gating the clock is not counted in the lower circuit), and will evaluate as fast as the logic will allow for the given voltage and temperature.
Diagram Comparing NCL 2-stage 1-bit Pipeline with Cascaded D Flip-Flops
Your Content Goes Here
To provide the reader a rough sense of the relative size of some WTL vs Boolean standard cells, the figure below shows to scale area comparisons in the same process of standard 28nm TSMC standard cell vs 28nm WTL cells of roughly the same output drive strength, from several different WTL libraries. For WTL cells the input transistors are almost always minimum size, with cell drive strength determined by a single final stage output inverter. The constant minimum size of WTL gate inputs also means gate input capacitance is generally lower than Boolean equivalents, contributing to lower power and high speed.
Example Std Cell Area Comparison of 28nm Boolean vs 28nm WTL for Two WTL Libraries
Your Content Goes Here
The NULL Cycle: NCL uses four-phase signaling, which means the completion signal (Ready For Data or RFD as shown above) that is fed back to the previous stage to apply back pressure also resets the logic by propagating a NULL wavefront after the data evaluation. Hence each propagating DATA wavefront is separated by a NULL wavefront. This means at face value that throughput is ½ of what might be expected, however it turns out in practice that much of the NULL cycle overhead can be hidden via micro-pipelining and using the “self-resetting” version of WTL called Flash WTL. One can argue that the Ready for Data feedback signal is a form of control and is not purely data – and there is some merit to that – but this is the cleanest approach I have seen to controlling evaluation flow to prevent evaluation wavefronts from overrunning each other. The logic depth of each “cycle” in the logic, that is the number and depth of gates encompassed by an acknowledgment signal, is an arbitrary design choice – similar to a choice of clock rate in a synchronous design. One can think of this logic depth as akin to a clocked system pipeline depth (number of gates between registers) – but NCL also enables “pipelining” in multiple dimensions. In the example above that 1D depth is a minimum value of 1 gate.
Other Benefits with Asynchronous Logic: Other benefits of asynchronous logic include much lower electromagnetic and radio frequency emissions, reduced metal migration concerns, and reduced requirements for on-chip bypass capacitance. The last two are very helpful in modern Deep Submicron nodes as FinFET gates have much higher capacitance than planar gates, thereby increasing capacitance seen by the clock drivers and hence increasing switching currents, metal migration issues, and requirements for on-chip bypass capacitance. These benefits all come from the fact there is no fast rise/fall time clock driving all the high capacitance inputs and clock lines at once, which results in significant periodic current spikes in the chip power supply rails. Another benefit, which in particular applies to WTL because the gates are state-holding is that scan path insertion is straightforward with minimum area penalty.
So What’s the Holdup?: In a word – Tools. Tools and integration with existing chip design workflows. Early in Wave’s life we showed WTL and the benchmarks we had done (which included the conversion of an ultra low-power 16-bit hearing aid DSP) to a large US semiconductor OEM who said – “Wow, this is really great – come back to me when you can compile million gate designs from standard untouched clocked Verilog and integrate seamlessly into our tools flow and we will adopt this!”. These requirements can be met, and Wave made significant strides on the compilation side, but ultimately this is a ~$25M+ EDA tools company proposition to deal with continued compiler development and maturation, cell library development/porting, DFT, logic optimization, integration with current timing analysis/timing closure flows, shuttle runs, etc. That was not Wave’s mission. But if what you want to do is build a few highly optimized full custom tiles that are stepped and repeated to build a fabric – well then it can work just fine as a “secret sauce under the hood”.
Stanford SystemX Presentation on WTL: Below is a video of a presentation on WTL that I gave at a Stanford SystemX seminar in 2015.
Your Content Goes Here
WTL-BB for FDSOI: WTL is exceptionally well suited to a FD-SOI (Fully Depleted Silicon on Insulator) chip fabrication technology that has been developed by ST Microelectronics. This technology provides the equivalent of a “second gate” on the underside of the planar transistors using Body Bias (BB). The Ready For Data (RFD) acknowledgment signal in NCL/WTL can be input into transistor footers to “gate off” gates (isolate them from the power rails) when that block of logic is waiting for new data. The best isolation is achieve using standard voltage threshold transistors (RVT) under Reverse Body Bias (RBB). Since the circuits spend most of their time turned off, low threshold transistors (LVT) under strong Forward Body Bias (FBB) can be used for the forward data evaluation path transistors. These transistors will be blazingly fast, but will leak a lot – but we don’t care because they are immediately switched off by the gated footers upon data evaluation by the logic. Essentially this gives the designer the equivalent of ultra-fine grained power gating (and since there is no clock – clock gating) “for free”. Neither the designer nor the tools need to “know” or “manage” this capability.
The slide show below is a presentation I gave on WTL-BB at the FDSOI Forum in Shanghai in 2014.