# Circuit Design of Programmable Logic and Interconnect Blocks using Spin Transfer Torque RAM for Non-Volatile FPGAs

Karrar Hussain Department of Electronics & Communication Engineering, CVR College of Engineering Telangana, India *karrarhussain@cvr.ac.in* 

> Dr. C V Krishna Reddy Director, NNRES NallaNarsimha Reddy Engineering Society Telangana, India *director@nnres.org*

Naveen Pitla Department of Electronics & Communication Engineering, CVR College of Engineering Telangana, India *naveen.pitla116@gmail.com* 

> Dr. K Lal Kishore Dean-Research, CVR College of Engineering Telangana, India *research@cvr.ac.in*

*Abstract*—Most of the Field-Programmable Gate Arrays (FPGAs) are currently SRAM based. The conventional SRAM has been the primary choice for memory storage in the Configurable Logic Blocks (CLBs) as well as for the configuration bits of the reconfigurable interconnects. However SRAM based FPGAs are volatile and needs an external non-volatile memory to store the configuration data. Also SRAM leakage currents increases as technology scales towards lower nodes. The use of non-volatile memories such as Spin-Transfer Torque (STT)-RAM helps to overcome the drawbacks of SRAM-based FPGAs without significant speed penalty. In this paper we present the design of simple non-volatile CLBs using STT-RAM technology. For verifying the design these CLBs have been programmed to implement various functions. The design has been simulated and verified using cadence tools in CMOS 40nm technology.

Keywords- FPGAs, Configurable Logic Blocks, STT-RAM, Magnetic Tunnel Junction (MTJ)

\*\*\*\*

# I. INTRODUCTION

SRAM based FPGAs have been the predominant technology over the last few decades. The advantage of SRAM-based FPGAs is that vendors can leverage the benefit of state of the art CMOS technology.[1] Nevertheless with the technology nodes scaling towards sub-nanometer range, SRAM has the disadvantage of high leakage currents which leads to increased power dissipation in FPGAs.[2]

Also one of the major limitations in SRAM-based FPGA is that these are volatile. The configuration data is lost when the power is switched off. The configuration data has to be reloaded every time the FPGA is switched-on. Thus there is some latency in instantly switching on the FPGA. The other limitation is that since the configuration data has to be stored in separate non-volatile ROM, the data can be prone to theft. Encrypting the configuration data leads to lot of overheads.

In this paper we demonstrate the realization of simple CLBs and programmable interconnects using STT-RAM which can be used for non-volatile FPGAs. The advantages of realizing CLBs using STT-RAM are manifold. Firstly there are no leakage currents in STT-RAM because the main element used is Magnetic Tunnel Junction (MTJ) which is made of thin ferro-magnetic films. These thin magnetic films have almost zero leakage currents. Secondly the static power dissipation is reduced since the MTJs are non-volatile, the circuit can be kept in idle state. Thirdly the configuration data need not be stored in separate ROM and hence it provides data security. Fourthly the FPGAs need not be configured every time they are switched-on. This significantly reduces the latency and provides instant-on like feature in FPGAs.

This paper is divided into five parts. Part 1 gives brief introduction to STT-RAM technology. Part 2 describes the basic non-volatile flip flop (NVFF) using which we realize simple CLBs. Part 3 explains the circuit design of these nonvolatile CLBs (NVCLB) and interconnects. Part 4 shows the simulation results of some example circuits that been realized using these NVCLBs. Part 5 gives conclusion of our work.

# II. STT-RAM TECHNOLOGY

The STT-RAM is an emerging non-volatile memory that is based on Spintronics. It has all the features of a universal memory.[3]The advantages of this memory technology compared to other technologies is shown in Fig. 1. The basic element in STT-RAM is the MTJ which consists of two Ferromagnetic layers separated by MgO insulator.[4]

|                             | SRAM              | DRAM             | Flash<br>(NOR)  | Flash<br>(NAND) | FeRAM            | MRAM              | PRAM            | RRAM            | STT-<br>MRAM      |
|-----------------------------|-------------------|------------------|-----------------|-----------------|------------------|-------------------|-----------------|-----------------|-------------------|
| Non-volatile                | No                | No               | Yes             | Yes             | Yes              | Yes               | Yes             | Yes             | Yes               |
| Cell Size [F <sup>2</sup> ] | 50-120            | 6-10             | 10              | 5               | 15-34            | 16-40             | 6-12            | 6-10            | 6-20              |
| Read Time [ns]              | 1-100             | 30               | 10              | 50              | 20-80            | 3-20              | 20-50           | 10-50           | 2-20              |
| Write/Erase<br>Time [ns]    | 1-100             | 15               | 1µs/1ms         | 1ms/0.1ms       | 50/50            | 3-20              | 50/120          | 10-50           | 2-20              |
| Endurance                   | 10 <sup>16</sup>  | 10 <sup>16</sup> | 10 <sup>5</sup> | 10 <sup>5</sup> | 10 <sup>12</sup> | >10 <sup>15</sup> | 10 <sup>8</sup> | 10 <sup>8</sup> | >10 <sup>15</sup> |
| Write Power                 | Low               | Low              | Very High       | Very High       | Low              | High              | Low             | Low             | Low               |
| Other Power<br>Consumption  | Leakage           | Refresh          | None            | None            | None             | None              | None            | None            | None              |
| High Voltage<br>Required    | No                | 3V               | 6-8V            | 16-20V          | 2-3V             | 3V                | 1.5-3V          | 1.5-3V          | <1.5V             |
|                             | Existing Products |                  |                 |                 |                  |                   |                 | Prototypes      |                   |

Figure 1. Comparison of different memory technologies.

When the magnetization of these two ferromagnetic layers are parallel to each other, the MTJ is in low resistance state  $(R_p)$  which can interpreted as logic '0'. On the other hand when the two layers are antiparallel to each other, the MTJ is in high resistance state  $(R_{ap})$  i.e. logic '1'. Fig. 2 shows the MTJ structure. The resistance characteristics of the MTJ are shown in Fig 3. The hysteresis curve shows that the two resistance states are asymmetric, meaning that different magnitude currents are required for switching between the two states. An important parameter that is often used to measure these resistance ratio is the Tunneling Magnetoresistance (TMR). TMR is defined in equation (1)



Figure 2. MTJ structure showing Antiparallel (high resistance) and Parallel (low resistance) states.

$$TMR = \frac{Rap - Rp}{Rp}(1)$$

A high value of TMR is beneficial to easily measure these resistance change and also to design a sense amplifier with good sensing margin. The other important parameters of the MTJ device are:aspect ratio, resistance-area product (RA), critical current (Ic) and switching probability (Psw).[5]



The aspect ratio of an MTJ device is the ratio of its longer dimension to its shorter dimension. A typical ratio of 2:1 is chosen to maintain the required thermal stability. RA represents the resistance and area product of an MTJ device which is determined by the material and structure of MTJ. The critical current  $I_c$  is the magnitude of switching current required to switch the MTJ states. The critical current  $I_c$  is proportional to the physical area of MTJ device. Therefore as MTJ dimensions become smaller  $I_c$  decreases accordingly. Thus STT-RAM technology has the advantage for further scaling. Theoretically the switching current is given by the equation (2).

$$Ic = Ico\left[1 - \left(\frac{KT}{E}\right)In\left(\frac{\tau}{\tau o}\right)\right]$$
(2)

Where  $\tau$  = write pulse width, I<sub>c</sub>= critical switching current, Ico = critical switching current at 0K, E= magnetization stability barrier,  $\tau$ o = write pulse width at 0K, K=Boltzmann constant, T=operating temperature (K).

The MTJ switching probability  $P_{sw}$ gives an understanding of the switching behavior of the device. The magnetization direction switching of the free layer depends on the magnetization stability energy barrier height. Table 1 gives the MTJ parameters we use for simulation in our paper. The compact model for MTJ simulation is provided in [6].This device parameters mainly depend on the process and mask design and the designers can change them to suit their requirements. However the default shape of the MTJ surface is circular (a=b).

#### II. DESIGN OF NON-VOLATILE FLIP FLOP

The basic element required to design a CLB is the memory storage element or memory point. In this section we describe the circuit design of a non-volatile flip flop (NVFF) using STT-RAM. This NVFF can store one single bit of information. W. Zhao et al. [7] has proposed a NVFF using MTJs which can be used to store one bit of information. However in this implementation the flip flop does not have a reset input.

| Parameter | Description                           | Unit               | Default value (range) |
|-----------|---------------------------------------|--------------------|-----------------------|
| RA        | Resistance area<br>product            | ohmµm <sup>2</sup> | 5 (5-15)              |
| Tsl       | Thickness of the free<br>layer        | nm                 | 1.3 (0.8 – 2)         |
| а         | Length of surface (long axis)         | nm                 | 40                    |
| b         | Width of surface<br>(short axis )     | nm                 | 40                    |
| Tox       | Thickness of the oxide barrier        | nm                 | 0.85 ( 0.6 – 1.2 )    |
| TMR       | TMR(0) with zero volt<br>bais voltage |                    | 200% ( 50% - 600%)    |

Table I. MTJ parameters used for simulation.

We have augmented the design to include a reset input. The schematic of this NVFF is shown in Fig. 4. This NVFF consists of a sensing structure along with a bidirectional current source. The sensing structure is used to read the state of the flip flop and the bidirectional current source can be used to write logic states into the NVFF. A reset input at the slave latch can be used to asynchronously make the output Q=0.



Figure 4. NVFF with reset input

The complete truth table of the NVFF is given in Table II.Note that 'X" in the truth table indicates don't care condition. The simulation results of this NVFF with reset are shown in Fig 5. We have obtained a read time of 80 ps. The total propagation delay is around 540 ps. Therefore the maximum operating frequency of this NVFF is about 1.8 GHz.

| Table II. | Truth | Table | of NVF | F with | reset. |
|-----------|-------|-------|--------|--------|--------|
|           |       |       |        |        |        |

| Innut | CLK | Reset | State of MTJs             |                   | Decorintion  |
|-------|-----|-------|---------------------------|-------------------|--------------|
| Input |     |       | MTJ1                      | MTJ2              | Description  |
| 1     | 1   | 0     | $\mathbf{R}_{\mathrm{p}}$ | $\mathbf{R}_{ap}$ | write '1'    |
| 0     | 1   | 0     | R <sub>ap</sub>           | $R_p$             | write '0'    |
| Х     | 0   | 0     | $\mathbf{R}_{\mathrm{p}}$ | R <sub>ap</sub>   | read '1'     |
| Х     | 0   | 0     | R <sub>ap</sub>           | $R_p$             | Read '0'     |
| Х     | Х   | 1     | Х                         | Х                 | Output (Q=0) |



Figure 5. Simulation outputs of NVFF with reset.

#### III. CIRCUIT DESIGN OF CLBS AND SWITCH MATRIX

The basic structure of an FPGA includes programmable logic elements, programmable interconnects and Inputoutput blocks (IOB). In this section we present the circuit design of STT-RAM based non-volatile CLBs and programmable interconnect.An FPGA architecture where CLBs are arranged in 2-D grid and interconnected by programmable routing resources and IOBs is shown in Fig 6.



Figure 6. FPGA architecture.

#### A. Look-Up Table(LUT) Design

Memory points are essential component of LUTs. The memory points are used to store logical values corresponding to the truth table of the circuit to be realized. The LUTs in our design has been designed using the NVFF presented in the previous section. This makes the LUT non-volatile and hence can store data even when power is switched off. A truth table can fully characterize the binary function. For example a 3input function needs an array of eight memory points inside the LUT to store the complete truth table values, bit[0] to bit[7] as shown in Fig 7. This LUT has three main inputs A, B, C. The output Fout is any logical function of A, B, C. The three inputs create an address between i=0 to 7 so that  $F_{out}$  gets the bit[i]. A 3x8 decoder is required to select one memory location from eight different locations.In Fig 7 the inputs ABC=100 (i=4) selects bit[4] to be routed to the output. Table III shows an example of a 3-input NAND gate implementation using this LUT.



Figure 7. LUT with 8 memory points

Table III. 3-input NAND gate implementation in LUT.

| Α | B | С | Fout = (A & B & C)' | Values assigned to Bits |
|---|---|---|---------------------|-------------------------|
| 0 | 0 | 0 | 1                   | Bit[0]                  |
| 0 | 0 | 1 | 1                   | Bit[1]                  |
| 0 | 1 | 0 | 1                   | Bit[2]                  |
| 0 | 1 | 1 | 1                   | Bit[3]                  |
| 1 | 0 | 0 | 1                   | Bit[4]                  |
| 1 | 0 | 1 | 1                   | Bit[5]                  |
| 1 | 1 | 0 | 1                   | Bit[6]                  |
| 1 | 1 | 1 | 0                   | Bit[7]                  |

Decoder design is an important part of the LUT design. A decoder is required to select one bit among the available memory bits. Conventional CMOS AND decoder requires two-level NAND-NOT structure. This increases the number of transistors, power consumption and delay. Pre-decoding scheme reduces the gate count and also the number of stages from input to output. Therefore pre-decoders are used in our LUT design to reduce power. The results obtained are compared in Table IV. Similarly a 4-input LUT requires a 4x16 decoder to select among the 16 different memory points.

Table IV. Power dissipation comparison of conventional and pre-decoding scheme.

| Decoder | Using NAND gates | Using pre-decoding |
|---------|------------------|--------------------|
| 3x8     | 340 nW           | 205 nW             |
| 4x16    | 447 nW           | 347 nW             |

## B. CLB Design

The CLBs are the programmable elements inside the FPGA. There exist numerous possible structures for design of CLBs. A simple CLB consisting of a LUT, NVFF and some multiplexer cells is shown in Fig 8.



Figure 8. Simple CLB with LUTs, NVFF& Multiplexer cells.

The two active elements are the LUT and the NVFF that may work independently or can be combined together. Both the active elements are designed using NVFF. Thus this CLB design in completely non-volatile. This CLB has three outputs. Out3 is the direct output from Fout. The Foutcan also serve as input data for the NVFF cell via a multiplexer controlled by S1. The Out1 can simply pass the signal DI in which case the CLB is transparent. Out1 can also pass signal NQ (complemented) output of NVFF depending upon the select line S2. Out 2 gives the registered output of NVFF on clock edge. To allow serial programming we chain all the memory points in the CLB. The bit that flows at the far end of the register chain is defined at the first cycle, while the closest bit is configured by the data bit present at the last active clock edge. The configuring of the CLB is achieved on active negative clock edges.

## C. Switch Box and Connection Box Design

In an FPGA the routing resources consume most of the chip area, and are responsible for most of the circuit delay. Therefore the routing architecture is the most important factor in determining system speed and the achievable logic density. [1]Fig 8 shows the routing resources inside an FPGA. It comprises segments of wires and two kinds of modules, Switch Boxes (SB) and Connection Boxes (CB).



Figure 8. Routing resources inside an FPGA.

The intersection of horizontal and vertical channels is referred to as a Switch Box (SB); the Switch Box serves to connect wire segments, and this requires using programmable switches inside it. Connection Boxes are used to connect CLB module pins to wire segments. Logic circuits are implemented in an FPGA by partitioning logic into individual logic modules and then interconnecting the modules by programming the switches in switch and connection boxes. Fig 9 shows the detailed routing architecture used in our design.



Figure 9. Detailed routing architecture showing switch & connection box.

The fraction of wire segments in a channel which connect to an input logic block pin is the input connection block flexibility, F<sub>cin</sub>. Similarly, the fraction of wire segments in a channel which connect to an output logic block pin is the output connection block flexibility, F<sub>cout</sub>.The number of possible connections a wire segment can make to other wire segments is the switch block flexibility, Fs. The routability of common switch block styles, such as the Disjoint [8] and Wilton [9] switch blocks is shown in Fig 10. In our design we use the disjoint switch for switch boxand multiplexer as the connection box. The realization of disjoint switch box using STT-RAM is shown in Fig 11. A wire entering a disjoint switch block can only connect to other wires with the same numerical designation. Despite its limited routing flexibility the disjoint switch has been used in a number of commercial FPGAs including devices from the Xilinx XC4000 family.



Figure 10. Routability of (a) Disjoint and (b) Wilton switches



Figure 11. Realization of Disjoint switch box using STT-RAM.

# IV. SIMULATION RESULTS

This section shows the simulation results of the implementations specified in previous section. All the simulation are done in CMOS 40nm technology using cadence tools. To verify the design of this STT-RAM based FPGA we have programmed the CLBs to implement different circuits. Two types of CLBs i.e 3-input and 4-input CLBs have been designed We have taken a 2x2 and 3x3 matrix structure to implement a 2-bit comparator and ripple carry adder (RCA) respectively. The serial bit stream used to configure the CLBs and the switch matrix for the above two examples were also determined.

Fig 12 shows the 2x2 matrix structure which contains 4 CLBs and 9 switch boxes (S1 to S9). The serial bit stream used to configure this 2x2 structure as a 2-bit comparator is given in Fig 13. The simulation outputs are presented in Fig 14.



Figure 12. 2x2 matrix structure.

|                 | BIT STREAM                              |
|-----------------|-----------------------------------------|
| CLB 1           | 0000100011001110 (A>B)                  |
| CLB 2           | 1000010000100001 (A=B)                  |
| CLB 3           | 0111001100010000 (A <b)< td=""></b)<>   |
| CLB 4           | 000000000000000                         |
| SWITCH MATRIX 1 | 011010010010000010000010                |
| SWITCH MATRIX 2 | 00000010000000000000000                 |
| SWITCH MATRIX 3 | 01000000000001010011010                 |
| SWITCH MATRIX 4 | 0000000001000000000000                  |
| SWITCH MATRIX 5 | 000000000000000000000000000000000000000 |
| SWITCH MATRIX 6 | 000000000000000000000000000000000000000 |
| SWITCH MATRIX 7 | 000000000000000000000000000000000000000 |
| SWITCH MATRIX 8 | 000000000000010000000                   |
| SWITCH MATRIX 9 | 000000000000000000000000000000000000000 |
| SERIAL DATA 1   | 1110101111101011                        |
| SERIAL DATA 2   | 1110101111101011                        |

Figure 13. Serial bit stream used to configure the 2x2 structure as 2-bit comparator.

International Journal on Recent and Innovation Trends in Computing and Communication Volume: 4 Issue: 12



Figure 14. Simulation outputs of 2-bit comparator.

Similarly Fig. 15shows another example of a 3x3 matrix structure used to realize ripple carry adder (RCA). It contains 9 CLBs and 16 switch boxes. The corresponding serial bit stream is given in Fig 16 and finally the simulation outputs are presented in Fig 17.



Figure 15. 3x3 matrix structure.

|                  | DEL STREAM                              |
|------------------|-----------------------------------------|
| CLB 1            | 0001011100010111                        |
| CLB 2            | 0110100101101001                        |
| CLB 3            | 0001011100010111                        |
| CLB 4            | 0110100101101001                        |
| CLB 5            | 0001011100010111                        |
| CLB 6            | 0110100101101001                        |
| CLB 7            | 000000000000000000000000000000000000000 |
| CLB 8            | 0110100101101001                        |
| CLB 9            | 0001011100010111                        |
| SWITCH MATRIX 1  | 00001000101000100000000                 |
| SWITCH MATRIX 2  | 000010001000000000000000                |
| SWITCH MATRIX 3  | 100001000001000000000000                |
| SWITCH MATRIX 4  | 000000000000000000000000000000000000000 |
| SWITCH MATRIX 5  | 000010001010100000000000                |
| SWITCH MATRIX 6  | 000011001000000000000000                |
| SWITCH MATRIX 7  | 1000000000100000000000                  |
| SWITCH MATRIX 8  | 00000001000000000000000                 |
| SWITCH MATRIX 9  | 000000000000000000000000000000000000000 |
| SWITCH MATRIX 10 | 1100000000000010000000                  |
| SWITCH MATRIX 11 | 0000010000000000000000000               |
| SWITCH MATRIX 12 | 000000010000000000000000                |
| SWITCH MATRIX 13 | 000000000000000000000000000000000000000 |
| SWITCH MATRIX 14 | 000000000000000000000000000000000000000 |
| SWITCH MATRIX 15 | 110000000000000000000000000000000000000 |
| SWITCH MATRIX 16 | 000000000000000000000000000000000000000 |
| SERIAL DATA 1    | 0010010000100100                        |
| SERIAL DATA 2    | 0010100000101100                        |
| SERIAL DATA 3    | 0010110000101000                        |
| SERIAL DATA 4    | 000000010001100                         |

Figure 16. Serial bit stream used to configure the 3x3 structure as RCA.

1000110000000000

SERIAL DATA 5

#### V. CONCLUSIONS

This paper has presented the circuit design of simple CLBs and programmable switch matrix for non-volatile FPGAs using STT-RAM technology. Two circuits i.e 2-bit comparator and ripple carry adder has been implemented using the above designed non-volatile FPGA. The STT-RAM technology provides a good alternative for designing non-volatile FPGAs in near future.



Figure 17. Simulation outputs of RCA.

## ACKNOWLEDGMENT

The authors would like to thank Yue Zhang et al [6], for providing the MTJ models for simulation.

#### REFERENCES

- [1] Ian Kuon et. al, "FPGA architecture: survey and challenges", Electronic Design Automation, Vol 2, No. 2 (2007), pp 135-253.
- [2] Jason Helge Anderson, "Power optimization and prediction techniques for FPGAs", Phd thesis, University of Toronto, 2005.
- [3] Mohamad T. krounbi et al, "Status and challenges for nonvolatile STT-RAM", Int. Symp. New york, 2010.
- [4] Cheng T. Horng, "A high performance MTJ element for STT-RAM and method of making", Patent No. EP2073285, July 2007.

- [5] Hai Li, Yiran chen, Non-volatile memory design, CRC press, 2012.
- [6] Yue Zhang, Weishing Zhao et al, "A compact model of perpendicular magnetic anisotropy magnetic tunnel junction" IEEE transaction on electron device, vol 59, pp 819-826, 2012.
- [7] W. Zhao et al, "New non-volatile logic based on spin-MTJ", Phys.stat.sol (a) 205, No.6, 1373-1377(2008).
- [8] Y. L. Wu, M Marek Sadowska, "Orthogonal greedy computing, new optimization approach for 2-D FPGAs", proceedings ACM/IEEE Design Automation conference, pp. 508-573,1995
- [9] S.Wilton, "Architectures and algorithms for field programmble gate arrays with embedded memories, Phd thesis, University of Toronto, 1997.