Download PDF

Circuit-Technology Co-Optimization of SRAM Design in Advanced CMOS Nodes

Publication date: 2024-06-27

Author:

Liu, Hsiao-Hsuan

Abstract:

Modern computing engines, such as central processing units (CPUs), graphics processing units (GPUs), and neural processing units (NPUs), play a pivotal role in advancing general-purpose computing, artificial intelligence (AI), machine learning (ML), and deep learning (DL) applications. These silicon-based microprocessors necessitate substantial amounts of static random-access memory (SRAM) for level-2 (L2) and level-3 (L3) caches. For instance, the AMD CPU Ryzen™ 7 5800X3D features 96-MB L3 cache, while Nvidia's GPU GeForce RTX 4090 carries 72-MB L2 cache. The increasing demand for L2 and last-level cache (LLC) size underscores the urgency for improved on-die SRAM density, thereby exerting significant pressure to enhance speed and energy consumption metrics. Simultaneously, level-1 (L1) cache is essential for each core to store frequently accessed data and instructions, operating at the same speed as the CPU. The increasing number of on-chip cores (e.g., from 4 to 8 cores from AMD Zen 1 to Zen 4) necessitates an increase in the number of L1 caches. These L1 caches dictate CPU performance and consume a significant portion of power due to high activity factor in modern multi-core system-on-chips (SoCs). Therefore, this doctoral research focuses on identifying the power-performance-area (PPA) booster in deeply scaled nodes, covering two primary areas: (1) the continuation of SRAM bitcell scaling and (2) the exploration of alternative SRAM subarray design. This doctoral dissertation leverages both the impacts of bitcell scaling and alternative subarray design on achieving notable PPA enhancements with technology scaling. The progression of SRAM bitcell scaling encompasses a systematic research cycle that involves identification of scaling bottlenecks, development of novel device architectures utilizing complementary field-effect transistor (CFET) technologies, assessment of associated challenges (e.g., process and routing complexities), and proposition of potential solutions from a bottom-up design-technology co-optimization (DTCO) standpoint. A variety of design alternatives have been scrutinized using a dedicated PPA simulation framework, aimed at selecting the most promising candidate for future SRAM bitcell scaling strategies. The conventional write margin methodology has proven inadequate in resistance-dominated technology nodes, as it fails to account for the time-dependent effects of a complete set of parasitic bitline (BL) components. To address this, a novel write margin methodology has been developed to mitigate the risk of underestimating yield loss. The scaling of cell height in CFET SRAM bitcells is constrained by gate-cut (GC) occupancy, which occupies up to 44% of the total cell height in 5-Å-compatible technology (A5). By replacing the GC with dielectric isolation wall (DIW) and assuming self-aligned M0A/gate merges, up to 29% bitcell area scaling can be achieved when transitioning to 3-Å-compatible technology (A3). Moreover, challenges associated with insufficient back-end-of-line (BEOL) space for routing signal and power lines in functional SRAM bitcell designs have been effectively resolved through innovative approaches. To further enhance performance and power, hybrid CFET has been integrated into SRAM bitcell design, resulting in significant reductions of 41%, 25%, and 35% in wordline (WL) resistance, BL capacitance, and WL capacitance, respectively, compared to sequential (seq) and monolithic (mono) counterparts. A5 CFET SRAM macro demonstrates a 50%~70% improvement in energy-delay-area product (EDAP) compared to its 14-Å-compatible technology (A14) counterpart, underscoring the importance of continuous bitcell-level scaling. However, macro-level EDAP improvement saturates beyond A5. A3 CFET SRAM macros offer only a marginal additional improvement of 10%~20% over A5 counterparts at the expense of a more intricate process and heightened cost concerns. This finding highlights the need to explore alternative SRAM subarray designs as more potent avenues for optimizing PPA metrics in subsequent phases of this dissertation. In the second part of this dissertation, the design strategy has shifted towards a top-down DTCO approach, driven by the diminishing returns of bitcell scaling beyond A5 at the macro level. The adoption of a larger subarray size is anticipated to yield a significant energy-delay-area product (EDAP) gain at the macro level due to reduced inter-subarray interconnect overhead. However, attaining this advancement in standard (Std.) subarray design is impossible owing to the write failures induced by the increased resistance and capacitance (RC). Hence, there is a strong motivation to investigate the feasibility of enabling large-size subarray designs that can address potential write failure concerns. Initially, conventional (Conv.) divided BL and divided WL techniques are employed to decrease parasitic RC through hierarchical subarray design in A14. Optimization of 12 distinct subarray sizes and corresponding design parameters is conducted with a focus on ensuring write ability. This study demonstrates that the Conv. divided subarray design can effectively mitigate the write failure risk, allowing for the extension of the upper limit of subarray size from 256-columns×256-rows to 1024-columns×512-rows compared to the Std. subarray. However, the Conv. divided design incurs a substantial subarray area penalty of up to 30%, leading to macro-level PPA degradation despite subarray-level PP and write margin benefits. To address this challenge, the concept of active interconnect (AIC) is introduced, aimed at relocating additional logic gates from the front-end-of-line (FEOL) to the BEOL active layer. Integration of A14 carbon nanotube FET (CNFET) as BEOL-compatible devices facilitates remarkable macro-level EDAP improvements of up to 65% when transitioning from Std. subarrays (128-columns×128-rows) to AIC divided subarrays (1024-columns×512-rows). These compelling outcomes illuminate the future trajectory of SRAM subarray design roadmap in deeply scaled nodes.