# Performance Comparison between Self-timed Circuits and Synchronous Circuits Based on the Technology Roadmap of Semiconductors Masashi Imai Takashi Nanya Komaba Open Laboratory, Research Center for Advanced Science and Technology The University of Tokyo 4–6–1 Komaba, Meguro-ku, Tokyo, 153–8904, Japan {miyabi, nanya}@hal.rcast.u-tokyo.ac.jp #### **Abstract** Synchronous circuits must have a large clock margin to guarantee correct operations due to delay variations which become main issues in the Moore's Law-based trend in VLSI chip development. Self-timed asynchronous circuit is a feasible solution to the timing-related problems while it suffers large energy dissipation due to signal transitions in all bits every cycle. The purpose of this study is to compare self-timed asynchronous circuits with synchronous circuits in the view point of speed performance and energy dissipation in the future technologies based on the Technology Roadmap of Semiconductors. As a result, the cycle-time of synchronous circuits becomes large as the process feature size decreases since the maximum delay becomes large due to large variations while the cycle-time of self-timed circuits does not become so large since it depends on the average delay. We can conclude that the self-timed asynchronous circuits are effective in the future technologies. # 1 Introduction As VLSI technology advances, delay variations due to fabrication process variations and environmental changes become more serious [1]. There are many factors that cause delay variations in VLSI design and implementation. They can be typically classified as "systematic variations" and "random variations [1,2]." In systematic variations, it can be assumed that delay variations in local parts of a chip are correlative. Thus, timing-dependent circuits like traditional synchronous circuits and asynchronous bundled-data transfer circuits [3] which utilize delay information may be effective in the view point of performance. On the other hand, among many variation factors, it has been recognized that random factors like random process variations [2], voltage supply noise, heat generation in each device, and crosstalk noise caused by dynamic transitions between adjacent wires become main issues in future technologies [1]. These random variations exhibit almost complete randomness even in the neighborhood devices. Thus, the timing margin that guarantees the correct order relations of signal transitions must be decided under the worst conditions in which the delay of a timing signal becomes fast and that of the corresponding combinational circuit becomes slow. Consequently, timing-dependent circuits must have a large margin to guarantee the correct operations, resulting in slow circuits. Self-timed asynchronous circuits based on the Quasi-Delay-Insensitive (QDI) model [3] are a feasible solution to the timing margin problems since they can work in accordance with delay variations. In the QDI-model-based design, data input to and output from combinational circuits must be encoded in any delay-insensitive manner. Some encoding methods are as follows; Delay-Insensitive-Minterm-Synthesis [4] in which Muller's C-elements generate all minterms of the input variables, Null Convention Logic circuits which use m-of-n threshold gates [5], dual-rail encoded circuits in which a logical value is represented by two physical lines [3, 6-8], 1-out-of-4 encoded circuits in which only one bit line of every four physical lines causes a rising transition and a falling transition every cycle [9, 10], and heterogeneous m-out-of-n encoded circuits [11]. It has been recognized that they tolerate any delay variations with low performance overhead for delay variations while they suffer large energy dissipation due to signal transitions in all the bits every cycle. The purpose of this study is to compare self-timed circuits and synchronous circuits in the Moore's Law-based trend in chip development. We show some evaluation results using 90nm, 65nm, 45nm, and 32nm process technologies. The remainder of this paper is organized as follows. We first explain technology models and parameters in future technologies in Section 2. Second, we explain compared circuits and evaluation conditions in Section 3. Then, we show some evaluation results in the future technologies in Section 4. Finally, we describe our conclusions in Section 5. # 2 Technology models and parameters A number of prediction models in VLSI design and implementation have been proposed [1, 12, 13]. In order to evaluate the performance of Moore's Law-based trend in CMOS technology development, the appropriate transistor models and parameters must be needed. The International Technology Roadmap for Semiconductors (ITRS) is a famous assessment of the semiconductor industry's technology requirements [1]. The objective of the ITRS is to ensure cost-effective advancements in the performance of integrated circuits. The ITRS describes a lot of technology challenges and needs in future technologies, however, it does not provide transistor models. The Predictive Technology Model (PTM) provides SPICE compatible parameters for future technology generation [12] and an online tool that can customize parameters with user's technology specifications. It also provides not only nominal model files but also variational model files for process corners. Thus, we can evaluate the variance of delays in the future technologies. In this paper, we refer the process parameters in the chapter of Process Integration, Devices & Structures in the ITRS and use the customize tool to make SPICE models in order to compare the performance based on the ITRS. Table 1 shows technology specifications based on the ITRS [1,14] to customize parameters in the PTM. We evaluate the performance of typical functional circuits using both high performance transistor models and low standby power transistor models. In Table 1, a transistor model "HP" means a High-Performance model which refers to chips of high complexity, high performance, and high power dissipation. A transistor model "LSP" means a Low-Standby-Power model which refers to chips of lower performance with the lowest possible static power dissipation for mobile systems where the allowable leakage currents are limited. $L_{\rm eff},\,V_{\rm th},\,V_{\rm dd},\,T_{\rm ox},$ and $R_{\rm dsw}$ mean effective channel length, threshold voltage, power supply voltage, oxide thickness, and effective parasitic series source/drain resistance, respectively. We use these parameters to make SPICE models and use the HSPICE analog simulator for performance comparison. We evaluate the performance under the standard case conditions shown in Table 2 (a) and the performance under the worst case conditions and the best case conditions as shown in Table 2 (b) and Table 2 (c) in order to evaluate the variance of delays. In Table 2, PMOS transistors and NMOS transistors with average performance are taken as Table 1. Technology parameters. | | 90nm | | 65nm | | 45nm | | 32nm | | |---------------------------------------|-------|-------|-------|-------|-------|-------|-------|-------| | | HP | LSP | HP | LSP | HP | LSP | HP | LSP | | $L_{\rm eff}$ [nm] | 32 | 32 | 25 | 25 | 18 | 18 | 13 | 13 | | $V_{\mathrm{th}}\left[V\right]$ | 0.195 | 0.482 | 0.134 | 0.534 | 0.103 | 0.535 | 0.093 | 0.547 | | V <sub>dd</sub> [V] | 1.1 | 1.2 | 1.1 | 1.2 | 1.0 | 1.1 | 0.9 | 0.95 | | Tox [nm] | 1.2 | 2.1 | 1.1 | 1.9 | 0.65 | 1.4 | 0.5 | 1.1 | | $R_{\mathrm{dsw}}\left[\Omega\right]$ | 180 | 180 | 200 | 180 | 180 | 180 | 170 | 180 | the standard for the evaluation and this pair is designated as (PMOS, NMOS) = (Center, Center). A process parameters "Slow" and "Fast" mean $+3\sigma$ and $-3\sigma$ device corner models where $\sigma$ is the standard deviation. We assume that the standard value of voltage is taken as the value of $V_{\rm dd}$ in the Table 1 and the worst value and the best value of voltage are taken as the standard value minus 0.1 volts and plus 0.1 volts, respectively. We also assume the temperature under the standard case conditions is 50 degrees Celsius and the worst case and the best case are 100 degrees Celsius and 25 degrees Celsius, respectively. Table 2. Standard conditions and parameter variations. | | (a) Standard | (b) Worst | (c) Best | |-------------------|----------------------------|----------------|----------------| | Process | (Center, Center) | (Slow, Slow) | (Fast, Fast) | | (PMOS, NMOS) | | | | | Supply voltage[V] | Standard value | Standard value | Standard value | | | (Table 1:V <sub>dd</sub> ) | -0.1 | +0.1 | | Temperature [°C] | 50 | 100 | 25 | # 3 Circuit comparison ## 3.1 Self-timed circuit Figure 1 (a) shows a typical Register Transfer Level (RTL) structure of traditional synchronous circuits in which source latches and destination latches are all updated at the same time, by being synchronized with the global clock. On the other hand, in self-timed asynchronous circuits, the destination latches may be updated independently based on the request-and-acknowledge handshake protocol as shown in Figure 1 (b). There are mainly two handshake protocols to transfer data between latches based on the QDI model as shown in Figure 2. In the 2-phase handshake protocol, transition signaling circuits [15] can be used for handshake circuits, while it is difficult to apply these circuits to combinational circuits. Thus, in this paper, we use the 4-phase handshake protocol which has a working phase and an idle phase. In the working phase, encoded data is processed Figure 1. Typical RTL structures. in a combinational circuit. In the 4-phase handshake protocol, combinational circuits must be initialized before the next data is inserted. The spacer of which all symbols are zero is flowing for initializing the circuit in the idle phase. In the QDI-model-based circuits, data input to and output Figure 2. Handshake protocols. from combinational circuits must be encoded in any delayinsensitive manner. Among many coding methods as mentioned in Section 1, in this paper, we focus on the dual-rail encoded circuits and the 1-out-of-4 encoded circuits using standard cell libraries that contain static CMOS gates [10] in order to compare with synchronous circuits using the same standard cell libraries. ## 3.2 Evaluation setup We used the Synopsys Design-Analyzer for technology mapping, the Synopsys Astro for CTS (Clock-Tree-Synthesis) and place-and-routing, the Mentor Calibre for extracting SPICE files from the placed-and-routed data, the HSPICE analog simulator for performance evaluation, respectively. When we made floorplans of synthesized circuits, we assumed that the utilization of cells is 60%. We designed a 32bit Ripple-Carry-Adder (RCA) circuit for self-timed circuits and a 32bit Carry-Lookahead-Adder (CLA) circuit which is synthesized by the Synopsys Design-Analyzer. We first synthesized an add circuit without timing constraints, and got a minimum area circuit. Then, we imposed timing constraints on the circuit from 0.1(ns) to the latency of the minimum area circuit and got some synthesized circuits. Finally, we selected the minimum "delay × power" circuit among many synthesized circuits. In addition, we designed a 32bit Binary-tree Carry-Lookahead-Adder (BLA) [16] which is the fastest circuit using CMOS standard cell libraries. We assumed that the cycle time of synchronous circuits is calculated as follows; "the maximum delay of combinational circuits" + "the set-up time of latch circuits" + "the delay of writing data in latch circuits." When we evaluated the delay of synchronous circuits, we assumed that device parameters are taken as shown in Table 2 (b) since a clock cycle must be decided under the worst case conditions. When we evaluated self-timed circuits, we assumed that the cycle time which contains the delay of a working phase and the delay of an idle phase is the average delay of 100 random inputs based on the 4-phase handshake protocols since self-timed circuits can be evaluated as the average performance. Dynamic energy depends on input vectors, generally. In this paper, we evaluated the total energy under the standard condition shown in Table 2 (a) as the average value of the same 100 random input vectors using the HSPICE simulator. It can be assumed that the switch factor of input vectors is about 0.5. Static energy also depends on input vectors. We evaluated the static energy of synchronous circuits as the average value of 100 random inputs using the HSPICE simulator. On the other hand, when we evaluated the static energy of self-timed circuits, we used the spacer as the input vector since it can be assumed that the circuit is in the idle phase. ## 4 Simulation results and discussions # 4.1 Cycle-time comparison Figure 3 and Figure 4 show the cycle-time of add function circuits using High-Performance transistors and Low-Standby-Power transistors, respectively. The horizontal axis represents the process feature size (nm) and the vertical axis represents cycle-time (ns). In Figure 3 and Figure 4, "Sync-CLA", "Sync-BLA", "Self-Dual", and "Self-104" represent the 32bit synchronous CLA circuit, the 32bit synchronous Binary-tree CLA circuit, the dual-rail encoded 32bit asynchronous RCA circuit, and the 1-out-of-4 encoded 32bit asynchronous RCA circuit, respectively. From Figure 3, it can be seen that the cycle-time of both synchronous circuits and self-timed circuits becomes small when the feature size is larger than 32nm and it becomes large in the 32nm process technology. It can be also seen that the cycle-time of the synchronous CLA circuit which has long critical paths becomes about 0.9ns larger than that of the 45nm process technology while the cycle-time of the synchronous BLA circuit and self-timed circuits becomes about 0.1ns larger than that of the 45nm process technology. It can be considered that self-timed circuits and small depth Figure 3. Delays of add function circuits using High-Performance transistors. Figure 4. Delays of add function circuits using Low-Standby-Power transistors. circuits in which the number of gates in critical paths is small are effective in the future process technologies. From Figure 4, it can be seen that the cycle-time of both synchronous circuits and self-timed circuits becomes large as the feature size decreases. It can be also seen that the delay of the synchronous CLA circuit in the 32nm process technology is 7.85 times larger than that in the 90nm process technology and the delay of the synchronous BLA circuit is larger than that of self-timed circuits in the 32nm process technology. It means that the variation of Low-Standby-Power transistors is larger than that of High-Performance transistors and it affects the performance of synchronous circuits directly. From Figure 3 and Figure 4, it seems reasonable to consider that the self-timed asynchronous circuits are effective in the view point of speed performance in the future technologies. Figure 5 and Figure 6 show the delay variations of add function circuits using High-Performance transistors and Figure 5. Delay variations of add function circuits using High-Performance transistors. Figure 6. Delay variations of add function circuits using Low-Standby-Power transistors. Low-Standby-Power transistors, respectively. The vertical axis represents scaling ratio [17] which is the ratio of varied delay in each condition to the delay in the standard condition shown in Table 2 (a). From Figure 5 and Figure 6, it can be seen that delay variance of the circuits using High-Performance transistors is almost same. On the contrary, delay variance of Low-Standby-Power transistors increases as the feature size decreases. It can be also seen that the delay variance of the synchronous BLA is smaller than those of others, in other words, the delay variance depends on the logic depth of combinational circuits when we use the corner models of delay variations. #### 4.2 Energy dissipation Figure 7 and Figure 8 show the average energy dissipation per one cycle which includes both dynamic energy and static energy using High-Performance transistors and Figure 7. Energy dissipation of add function circuits using High-Performance transistors. Figure 8. Energy dissipation of add function circuits using Low-Standby-Power transistors. Low-Standby-Power transistors, respectively. The horizontal axis represents the process feature size (nm) and the vertical axis represents energy dissipation (pJ) per one cycle. From Figure 7 and Figure 8, it can be seen that the energy dissipation of the circuits using High-Performance transistors increases as the feature size decreases and that of Low-Standby-Power transistors decreases. It can be also seen that the difference between synchronous circuits and self-timed circuits decreases. From Figure 7 and Figure 8, it seems reasonable to consider that the self-timed asynchronous circuits are also effective in the view point of energy dissipation in the future technologies. Figure 9 and Figure 10 show leak current using High-Performance transistors and Low-Standby-Power transistors, respectively. The vertical axis represents leak current ( $\mu$ A). From Figure 9 and Figure 10, it can be seen that the leak current increases as the feature size decreases and Figure 9. Leak current of add function circuits using High-Performance transistors. Figure 10. Leak current of add function circuits using Low-Standby-Power transistors. the leak current of High-Performance transistors is about 175~280 times larger than that of Low-Standby-Power transistors. Generally, leak current depends on the number of transistors. Table 3 shows the area of placed-and-routed circuits in the 90nm process technology and the number of transistors. The number of transistors of self-timed circuits tends to become large compared with the synchronous circuits as shown in Table 3. This is a drawback of self-timed circuits in future technologies. Consequently, it can be considered that reduction methods of leak current for self-timed circuits become a main issue in the future technologies. #### 5 Conclusion There are many factors that cause delay variations in VLSI design and implementation. Among many delay variations, it has been recognized that random delay varia- Table 3. Area and the number of transistors of placed-and-routed circuits. | - | Sync-CLA | Sync-BLA | Self-Dual | Self-1o4 | |-------------------------|----------|----------|-----------|----------| | Area [um <sup>2</sup> ] | 3658 | 9145 | 6706 | 7316 | | # of transistors | 4400 | 9736 | 6374 | 7882 | tions which exhibit almost complete randomness even in the neighborhood devices become main issues in the future technologies. In such situation, timing-dependent circuits must have a large margin to guarantee the correct operations, resulting in slow circuits. Asynchronous self-timed circuit is a feasible solution to the timing margin problems. In this paper, we have compared traditional synchronous adder circuits with self-timed asynchronous adder circuits using the same standard cell libraries based on the Technology Roadmap of Semiconductors. We have shown that the cycle-time of synchronous circuits becomes large as the process feature size decreases since the maximum delay becomes large due to large variations while the cycle-time of self-timed circuits does not become so large since it depends on the average delay. We have also shown the difference between the energy dissipation of self-timed circuits and that of synchronous circuits decreases as the feature size decreases. As the results, we can conclude that the selftimed asynchronous circuits are effective in the view point of both speed performance and energy dissipation in the future technologies. #### **Acknowledgments** This work was supported in part by Japan Society for the Promotion of Science Grant-in-Aid for Scientific Research (19700039) and (19300009). It was partially supported by VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with Synopsys, Inc and Cadence Design Systems, Inc. #### References - [1] International technology roadmap for semiconductors. http://public.itrs.net/,The 2007 Edition. - [2] Shin-ichi Ohkawa, Masakazu Aoki, and Hiroo Masuda. Analysis and characterization of devie variations in an LSI chip using an integrated device matrix array. *IEEE Transac*tions on Semiconductor Manufacturing, Vol. 17, No. 2, pp. 155–165, May 2004. - [3] Scott Hauck. Asynchronous design methodologies: An overview. *Proceedings of the IEEE*, Vol. 83, No. 1, pp. 69–93, January 1995. - [4] Editted by Jens Sparso and Steve Furber. Principles of Asynchronous Circuit Design –A Systems Perspective. Kluwer Academic Publishers, 2001. - [5] Karl M. Fant and Scott A. Brandt. NULL conventional logic: A complete and consistent logic for asynchronous digital circuit synthesis. In *International Conference on Application-specific Systems, Architectures, and Processors*, pp. 261–273, 1996. - [6] Danil Sokolov, Julian Murphy, Alexander Bystrov, and Alex Yakovlev. Design and analysis of dual-rail circuits for security applications. *IEEE Trans. Computers*, Vol. 54, No. 4, Apr. 2005. - [7] L.A.Plana, P.A.Riocreux, W.J.Bainbridge, A Bardsley, J.D.Garside, and S.Temple. SPA - a synthesisable amulet core for smartcard applications. *Proc. Async* 2002, pp. 201– 210, Apr. 2002. - [8] Akihiro Takamura, Masashi Kuwako, Masashi Imai, Taro Fujii, Motokazu Ozawa, Izumi Fukasaku, Yoichiro Ueno, and Takashi Nanya. TITAC-2: An asynchronous 32-bit microprocessor based on scalable-delay-insensitive model. In *Proc. International Conf. Computer Design (ICCD)*, pp. 288–294, October 1997. - [9] John Bainbridge and Steve Furber. CHAIN: A delayinsensitive chip area interconnect. *Micro*, Vol. 22, No. 5, pp. 16–23, 2002. - [10] Masashi Imai and Takashi Nanya. A design method for 1out-of-4 encoded low-power self-timed circuits using standard cell libraries. *Proc. ACSD 2008*, Jun. 2008 (to appear). - [11] W.B.Toms, D.A.Edward, and A.Bardsley. Synthesising heterogeneously encoded systems. *Proc. Async* 2006, pp. 138– 148, Mar. 2006. - [12] Yu Cao, Takashi Sato, Michael Orshansky, Dennis Sylvester, and Chenmin Hu. New paradigm of predictive MOSFET and interconnect modeling for early circuit simulation. *Proc. of CICC*, pp. 201–204, 2000. - [13] Shinichiro Uemura, Akira Tsuchiya, Masanori Hashimoto, and Hidetoshi Onodera. A predictive transistor model based on itrs roadmap. *Proc. IEICE General Conference*, p. 81, 2006 (in Japanese). - [14] International technology roadmap for semiconductors. http://public.itrs.net/, The 2006 Update. - [15] Montek Singh and Steven M. Nowick. MOUSE-TRAP: Ultra-high-speed transition-signaling asynchronous pipelines. In *Proc. International Conf. Computer Design* (ICCD), pp. 9–17, November 2001. - [16] R.P.Brent and H.T.Kung. A regular layout for parallel adders. *IEEE Trans. on Computers*, Vol. C-31, pp. 260–264, Mar. 1982. - [17] Masashi Imai, Metehan Özcan, and Takashi Nanya. Evaluation of delay variation in asynchronous circuits based on the scalable-delay-insensitive model. *Proc. Async2004*, pp. 62–71, Apr. 2004.