# Analysis of $G F\left(2^{m}\right)$ Multiplication Algorithm: Classic Method v/s KaratsubaOfman Multiplication Method 

Manish Kumar Goyal<br>Lecturer, Government Polytechnic College, Bagidora<br>manishbmgoyal@gmail.com

Shiv Karan Meghwal<br>Lecturer, Government Polytechnic College, Dungarpur<br>shivkaran_sonel@yahoo.co.in


#### Abstract

In recent years, finite field multiplication in $\mathrm{GF}\left(2^{\mathrm{m}}\right)$ has been widely used in various applications such as error correcting codes and cryptography. One of the motivations for fast and area efficient hardware solution for implementing the arithmetic operation of binary multiplication, in finite field $\mathrm{GF}\left(2^{\mathrm{m}}\right)$, comes from the fact, that they are the most time-consuming and frequently called operations in cryptography and other applications. So, the optimization of their hardware design is critical for overall performance of a system. Since a finite field multiplier is a crucial unit for overall performance of cryptographic systems, novel multiplier architectures, whose performances can be chosen freely, is necessary. In this paper, two Galois field multiplication algorithms (used in cryptography applications) are considered to analyze their performance with respect to parameters viz. area, power, delay, and the consequent Area×Time (AT) and Power×Delay characteristics. The objective of the analysis is to find out the most efficient $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ multiplier algorithm among those considered.


## 1. InTRODUCTION

A variety of computer arithmetic techniques can be used to implement a digital multiplier. Most techniques involve computing a set of partial products, and then summing the partial products together. This method, involving partial products is mathematically correct, but it has the following two serious engineering problems:

1. It involves 32 intermediate additions in a 32-bit computer, or 64 intermediate additions in a 64-bit computer. These additions take a lot of time. The engineering implementation of binary multiplication consists, in fact, of simplification of the mathematical process and compromising with the increased complexity hence enforced, in order to do fewer additions. When implemented in software, long multiplication algorithms have to deal with overflow during additions, which can be expensive.
2. The basic school method handles the sign with a separate rule ("+ with + yields +", "+ with - yields -", etc.). Modern computers embed the sign of the number in the number itself, usually in the two's complement representation. That forces the multiplication process to be adapted to handle two's complement numbers, complicating the process a bit more.

To multiply two numbers with $n$-digits using this method, one needs about $n^{2}$ operations. More formally: using a natural size metric of number of digits, the time complexity of multiplying two $n$-digit numbers using long multiplication is $\Theta\left(n^{2}\right)$.Therefore, to solve above problem modular multiplication can be used. The modular exponentiation applies modular multiplication repeatedly. Modular
multiplication is a mathematical operation on integer $A . B \bmod$ $M$ with $A, B<M$ where by $A$ and $B$ are the operand and $M$ is moduls. Modular multiplication $A \times B \bmod M$ can be performed in two different ways:

- Multiplying, i.e. computing $P=A \times B$; then reducing, i.e. $R$ $=P \bmod M$
- Interleave the multiplication and the reduction steps.


Figure-1.1: Flow-chart of steps undertaken for analysis of different Multiplier Algorithms

So the performance of systems based on modular multiplication, e.g. the public key Cryptosystem, is primarily
determined by the implementation efficiency of the modular multiplication and exponentiation. As the operands are usually large (i.e. 1024 bits or more), it is necessary to improve upon the computation time of the encryption/decryption operations. Hence, it is essential to minimize the number of modular multiplications performed and to reduce the time required by a single modular multiplication. There are various algorithms that implement modular multiplication. Here following Two algorithms have been considered:

## 1 Classic Multiplier

2 Karatsuba-Ofman multiplier
Starting with each of the above mentioned algorithms, the following methodology has been adopted for the analysis of a particular design based on design parameters viz. area-onchip, power consumption and related dlays. Subsequent comparison among the various designs based on the extracted values of the design parameters mentioned earlier has been provided

## 2. The Basic Modular Multiplier: TwoStep Classic Multiplier

The two-step classic multiplication in $G F\left(2^{m}\right)$ is a straightforward translation of the classic school multiplication algorithm. This method has an asymptotic complexity $O\left(n^{2}\right)$ [6]. In the two-step multiplication, Let $a(x)$ and $b(x)$ be two field elements, the field product $c(x)$ given in Eq. (2.1).

$$
\begin{equation*}
c(x)=a(x) b(x) \bmod f(x) \tag{2.1}
\end{equation*}
$$

Where,

$$
f(x)=x^{m}+f_{m-1} x^{m-1}+\ldots+f_{1} x+f_{0} \ldots \ldots .
$$

And, $\quad f_{i} \in G F(2)=\{0,1\}$, the set $\left\{1, x, \ldots, x_{m-1}\right\}$ is polynomial basis in $G F\left(2^{m}\right)$.

### 2.1 FUNDAMENTAL CONCEPT

This involves two steps for multiplication

1. Polynomial multiplication
2. Reduction modulo an irreducible polynomial [5]

### 2.1.1 POLYNOMIAL MULTIPLICATION

The product $d(x)$ of the polynomials $a(x)$ and $b(x)$, i.e. $d(x)=$ $a(x) b(x)$, is a polynomial with maximum degree $(2 m-2)$. Polynomial multiplication $d(x)$ can be written in matrix form [3]. The coefficients of $d(x)$ are determined by the following expression:

$$
d_{k}=\left\{\begin{array}{c}
\sum_{i=0}^{k} a_{i} b_{k-i} ; k=0, \ldots, m-1  \tag{2.3}\\
\sum_{i=k}^{2 m-2} a_{k-i+(m-1)} b_{i-(m-1)} ; k=m, \ldots, 2 m-2
\end{array}\right.
$$

This expression have addition and multiplication in $G F(2)$. Assume that the following two functions can compute addition and multiplication for mod 2, using logical operation $x$ XOR $y$ and $x$ AND $y$ respectively.

> function m2xor(x, y: bit) return bit
> function m2and(x, y: bit) return bit

The total gate complexity for the bit-parallel computation of the matrix-vector product given in Eq. (2.3) is $m^{2} A N D$ gates and $(m-1)^{2}$ XOR gates. The AND gates operate all in parallel and require a single $A N D$ gate delay $T_{A N D}$, while the $X O R$ gates are organized as a binary tree of depth $\left\lceil\log _{2} j\right\rceil$ in order to add $j$ operands. The total time complexity is then found by considering the largest number of terms, which is equal to $m$ for the computation of $d_{m-1}$. Therefore, the total delay complexity for the bit parallel matrix-vector product is $\mathrm{T}_{A N D}+$ $\left[\log _{2} j\right\rceil \mathrm{T}_{\text {XOR. }}$.

### 2.1.2 REDUCTION MODULO AN IRREDUCIBLE POLYNOMIAL

After the polynomial multiplication $d(x)=a(x) \cdot b(x)$, the next step is a reduction modulo an irreducible polynomial $f(x)$ must be performed. In modular reduction $\quad c(x)=d(x) \bmod$ $f(x)$, the degree $(2 m-2)$ polynomial $d(x)$ is reduced by the degree $m$ irreducible polynomial $f(x)$, resulting in a polynomial $c(x)$ with degree $\quad \operatorname{deg}(c(x)) \leq(m-1)$.

$$
\begin{align*}
c(x) \quad & =d(x) \bmod f(x) \\
& =\left(d_{2 m-2} x^{2 m-2}+\ldots+d_{1} x+d_{0}\right) \bmod f(x) \\
& =c_{m-1} x^{m-1}+\ldots+c_{1} x+c_{0} \tag{2.4}
\end{align*}
$$

Reduction modulo $f(x)$ can be viewed as a linear mapping of the $(2 m-1)$ coefficients of $d(x)$ into the $m$ coefficients of $c(x)$. Matrix consists of an $(m \times n)$ identity matrix and an $(m \times$ $\square m-1$ ) matrix $\mathbf{R}$ named reduction matrix. The $\mathbf{R}$ matrix is a function only of the irreducible polynomial $f(x)$ as in Eq (2.5). Therefore, a reduction matrix $\mathbf{R}$ is uniquely assigned to every $f(x)$. The $r_{j, i} \in G F(2)$ coefficients of $\mathbf{R}$ can be recursively computed in function of $f(x)$ as follows:
$r_{j, i}=\left\{\begin{array}{c}f_{j} ; j=0, \ldots, m-1 ; i=0 \\ r_{j-1, i-1}+r_{m-1, i-1} r_{j, 0} ; j=0, \ldots, m-1 ; i=1, \ldots, m-2\end{array}\right.$

$$
\text { Where, } \mathrm{r}_{\mathrm{j}-1, \mathrm{i}-1}=0 ; \quad \text { if } \mathrm{j}=0
$$

$\mathbf{R}$ is function of the selected irreducible polynomial. Therefore, by choosing an appropriate reduction polynomial $f(x)$ the complexity of this operation can be reduced. The following function can compute the reduction matrix $\mathbf{R}$

## Function

reduction_matrix_R(f: poly_vector) return poly_ matrix_mlm2

Where,
poly_matrix_m1m2 is an $(m \times m-1)$ matrix of bits.
Finally, the two-step classic multiplication performing $c(x)=$ $a(x) b(x) \bmod f(x) \quad=d(x) \bmod f(x)$ using Eq.(2.4) and the reduction matrix computed with Eq.(2.5) can be given,
where the previously defined functions poly_multiplication and reduction_matrix_ $R$ are used.

### 2.2 SIMULATION , SYNTHESIS RESULTS

A VHDL model for the classic multiplication algorithm has two components poly_multiplier and poly_reducer that implement the polynomial multiplication and the reduction modulo $f(x)$, respectively. The VHDL code for each value of $m$ has been synthesized using Design Compiler with 180nm UMC library. The script for Design Compiler has been written so as to include the capability to generate reports for Power, Time and Area of the synthesized design. The procedure has been repeated for different values of $m$, and the values of each of the design parameters, hence obtained, has been provided in tabular format below.

Table-2.1: For different $m$ Power requirement of Classic Multiplier Algorithm

| $\boldsymbol{m}$ | Cell Internal <br> Power <br> $(\boldsymbol{m} \boldsymbol{W})$ | Net Switching <br> Power <br> $(\boldsymbol{m} \boldsymbol{W})$ | Total Dynamic <br> Power <br> $(\boldsymbol{m} \boldsymbol{W})$ | Cell <br> Leakage <br> Power <br> $(\boldsymbol{\mu} \boldsymbol{W})$ | Total cell <br> Area | Data <br> Arrival <br> Time <br> $(\boldsymbol{n} \boldsymbol{s})$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 8 | 3.9246 | 1.8381 | 5.7627 | 0.0272528 | 4119.096191 | 0.81 |
| 16 | 11.3900 | 5.6327 | 17.0227 | 0.0746353 | 11695.97754 | 1.07 |
| 32 | 52.6769 | 28.8272 | 81.5041 | 0.3519799 | 54402.57813 | 1.45 |
| 64 | 234.1922 | 208.9930 | 443.1851 | 1.4260 | 223284.0156 | 1.95 |
| 128 | 909.1452 | 1690.3625 | 2599.5077 | 5.429 | 854699.1563 | 4.283 |
| 163 | 1371.254 | 3473.258 | 4844.512 | 8.295 | 1301869.125 | 7.943 |
| 233 | 2225.3654 | 102960.145 | 105185.5104 | 14.83 | 2295698.589 | 24.471 |

## 3. KARATSUBA-OFMAN MULTIPLICATION

The Karatsuba-Ofman algorithm is a recursive method for efficient polynomial multiplication or efficient multiplication in positional number systems. It has recursive application of the divide-and-conquer, thus, Karatsuba algorithm leads to a running time of $O\left(n^{\log 23}\right) \approx O\left(n^{1: 585}\right)$ for $n=2^{i}(i>0)$ [6]. It is known that two arbitrary polynomials in one variable of degree less or equal to $(m-1)$ with coefficients from a field $G F\left(2^{m}\right)$ can be multiplied with not more than $m^{2}$ multiplications in $G F\left(2^{m}\right)$ and $(m-1)^{2}$ additions in $G F\left(2^{m}\right)$.

### 3.1. FUNDAMENTAL CONCEPT

The Karatsuba-Ofman algorithm provides a recursive algorithm which reduces the above multiplicative and additive (for large enough $m$ ) complexities [12]. A Karatsuba-Ofman algorithm restricted to polynomials,
where $m=2^{t}$ with $t$ an integer, Let $a(x)$ and $b(x)$ be two elements in $G F\left(2^{m}\right)$. We are interested in finding the product $d(x)=a(x) b(x)$, with degree $\leq(2 m-2)$. Both elements can be represented in the polynomial basis as follow [10][14] :
$a(x)$
$=x^{m / 2}\left(x^{m / 2-1} a_{m-1}+\ldots+a_{m / 2}\right)+\left(x_{m / 2-1} a_{m / 2-1}+\ldots+a_{0}\right)$
$=x^{m / 2} A_{H}+A_{L}$
$b(x)$
$=x^{m / 2}\left(x^{m / 2-1} b_{m-1}+\ldots \ldots+b_{m / 2}\right)+\left(x_{m / 2-1} b_{m / 2-1}+\ldots . .+b_{0}\right)$
$=x^{m / 2} B_{H}+B_{L}$
Using above expression, the polynomial product is given as $d(x)=x^{m} A_{H} B_{H}+x^{m / 2}\left(A_{H} B_{L}+A_{L} B_{H}\right)+A_{L} B_{L}$ $\qquad$
Let us define the following auxiliary polynomials
$M_{0}{ }^{(1)}=A_{L}(x) B_{L}(x) \quad \ldots \ldots(3.4)$
$M_{I}^{(1)}=\left[A_{L}(x)+A_{H}(x)\right]\left[B_{L}(x)+B_{H}(x)\right]$
$M_{2}{ }^{(1)}=A_{H}(x) B_{H}(x)$
Then the product is given by:
$d(x)$
$=x^{m} M_{2}^{(1)}(x)+x^{m / 2}\left[M_{I}^{(1)}(x)+M_{0}^{(1)}(x)+M_{2}^{(1)}(x)\right]+M_{0}^{(1)}(x)$

The algorithm becomes recursive if it is applied again to the polynomials given in Eq. (3.3) [10]. The next iteration step splits the polynomials $A_{L}, B_{L}, A_{H}, B_{H},\left(A_{L}+A_{H}\right)$, and $\left(B_{L}+\right.$ $B_{H}$ ) again in half [11]. With these newly halved polynomials, new auxiliary polynomials $M^{(2)}(x)$ can be
defined in a similar way to Eq. (3.6). The algorithm eventually terminates after $t$ steps. In the final step the polynomials $M^{(t)}(x)$ are degenerated into single coefficients. Since every step halves the number of coefficients, the algorithm terminates after $t=\log _{2} m$ steps [9][10].

### 3.2 SIMULATION, SYNTHESIS AND RESULTS

The VHDL code for the Karatsuba-Ofman algorithm has been simulated on the ModelSim-6.2c for different values of the $m$. The corresponding results obtained post-simulation were found to match the theoretical results. The VHDL code for each value of $m$ has been synthesized using Design Compiler with 180 nm UMC library, following results have been obtained.

Table-3.1: For different $m$ Power requirement of Karatsuba-Ofman multiplier

| $\boldsymbol{m}$ | Cell Internal <br> Power <br> $(\boldsymbol{m} \boldsymbol{W})$ | Net Switching <br> Power <br> $(\boldsymbol{m} \boldsymbol{W})$ | Total <br> Dynamic <br> Power <br> $(\boldsymbol{m} \boldsymbol{W})$ | Cell Leakage <br> Power <br> $(\boldsymbol{n} \boldsymbol{W})$ | Total cell Area | Data <br> Time <br> $(\boldsymbol{n s})$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 8 | 0.8923746 | 0.2291768 | 1.1216 | 4.4110 | 599.960938 | 0.32 |
| 16 | 1.0196 | 0.278 .8957 | 1.2985 | 5.0031 | 677.375977 | 0.33 |
| 32 | 1.8970 | 0.4995943 | 2.3966 | 9.3389 | 1270.885132 | 0.32 |
| 64 | 3.9009 | 1.0411 | 4.9420 | 19.195 | 2612.731689 | 0.32 |
| 128 | 7.9043 | 2.1214 | 10.0257 | 38.9081 | 5296.433594 | 0.32 |
| 163 | 15.98 | 4.282 | 20.262 | 78.58 | 10629.8964 | 0.32 |
| 233 | 20.3058 | 5.4825 | 25.7883 | 99.8702 | 13592.66992 | 0.33 |

## 4. Analysis Of Polynomial Multiplier

The hardware design of the multiplier is desired to have the following properties:

1. Minimum Area $\times$ Time (AT)
2. Minimum power $\times$ Delay

### 4.1 AT ANALYSIS

Graph-4.1 shows Area $\times$ Time (AT) characteristics of multiplication algorithms. This is one of important characteristics to analyze the performance of multiplication algorithm design. It can be inferred from the above graph that for small value of $m, A T$ is highest for the Classic multiplier. Therefore, interleaved multiplication algorithm has minimum $A T$ for all values of $m$.


Graph-4.1: (Area $\times$ Time) graph of Multiplier Algorithms for different $m$

### 4.2 Power $\times$ Delay ANALYSIS



Graph-4.2: Power $\times$ Delay Vs $m$ for different algorithm
Another figure of merit for the design of a multiplier is the product of the power and delay values. Power $\times$ Delay graph is shown in the Graph-4.2. It shows that the designs based on Classic Multiplier algorithms have large values of the power-delay product, hence making them impractical for the design purpose. The design based on the interleaved multiplication algorithm has significantly low values of the product, making them viable options for design of a multiplier.

## 5. Conclusion

Performance analysis of two finite field multiplier algorithms has been done in terms of area, power and time. Further a comparative study between the performances of these algorithms has been done with the following inferences:

1. Synthesis results for the Classic Multiplier Algorithm show that total cell area, power and delay increase with $m$ in orders greater than unity. It has large values of the power-delay product. It implies that hardware design of a multiplier based on the Classic Multiplier Algorithm is impractical for higher values of $m$.
2. For all values of $m$, small as well as large, the KaratsubaOfman multiplier yields best characteristics for area, power requirements and Area $\times$ Time (AT) characteristics. It has a significantly lower value of the power-delay product compared to Classic Multiplier. The design for Karatsuba-Ofman multiplier stands out to be best suitable for multiplier systems for cryptography, errorcontrol coding and computer algebra

Based on the study of the performance of the design for each of the individual multiplier, and a comparative study between their performance parameters, it can be concluded that the Karatsuba-Ofman multiplier is the best option in two method for hardware design of an efficient multiplier. This design can be used as a multiplier in systems cryptography and error correcting codes.

## References

[1]. J.S. Milne, "Fields and Galois Theory", Version 4.21, September, 2008.
[2]. Chen Guanghua, Liu Ming, Zhu Jingming and Zheng Weifeng, "Improvement of Interleaved Multiplication Algorithm", the 1st International Conference on Information Science and Engineering (ICISE2009), 978-0-7695-38877/09, pp. 1752-1755
[3]. Peter Kornerup, "High-Radix Modular Multiplication for Cryptosystems", IEEE 1063-6889/93, pp. 277 - 283
[4]. David Narh Amanor, Christof Paar, Jan Pelzl, Viktor Bunimov, Manfred Schimmler "Efficient Hardware Architectures for Modular Multiplication on FPGAS" 0-7803-9362-7/05, pp.539-542
[5]. Diego Viot, Rodolfo Aurélio, Helano Castro and Jardel Silveira, "Modular Multiplication Algorithm For Pkc"
[6]. H. Fan and M.A. Hasan, "Alternative to the Karatsuba algorithm for software implementations of GF(2n) multiplications", IET Inf. Secur., 2009, Vol. 3, Iss. 2, pp. 6065
[7]. Gang Zhou, Harald Michalik, and Laszlo Hinsenkamp, "Complexity Analysis and Efficient Implementations of Bit Parallel Finite Field Multipliers Based on Karatsuba-Ofman Algorithm on FPGAs", IEEE Transactions On Very Large Scale Integration (VLSI) Systems, 1063-8210, pp.1-10.
[8]. El Hadj Youssef Wajih, Zeghid Medien, Machhout Mohsen, Bouallegue Belgacem and Tourki Rached "Efficient Hardware Architecture of Recursive Karatsuba-Ofman Multiplier", 2008 International Conference on Design \& Technology of Integrated Systems in Nanoscale Era, 978-1-4244-1577-9/08, pp.1-6.
[9]. Nadia Nedjah and Luiza de Macedo Mourelle, "A Reconfigurable Recursive and Efficient Hardware for Karatsuba-Ofman's Multiplication Algorithm", IEEE 2003, 0-7803-7729-X/03,pp.1076-1081
[10]. Serdar S. Erdem and C, etin K. Koc "A Less Recursive Variant of Karatsuba-Ofman Algorithm for Multiplying Operands of Size a Power of Two", Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH'03), 1063-6889/03
[11]. Steffen Peter and Peter Langend"orfer, "An Efficient Polynomial Multiplier in $\mathrm{GF}\left(2^{\mathrm{m}}\right)$ and its Application to ECC Designs" EDAA-2007, 978-3-9810801-2-4
[12]. Viktor Bunimov and Prof. Dr. Manfred Schimmler, "Area and Time Efficient Modular Multiplication of Large Integers", Proceedings of the Application-Specific Systems, Architectures, and Processors (ASAP'03) ISBN0-7695-1992-X/03 pp 1-10.
[13]. E. Savas, A.F. Tenca, M.E. C, iftc, ibasi and C, .K. Koc, "Multiplier architectures for GF (p) and GF $\left(2^{\mathrm{n}}\right)$ ", IEEE2004 Comput. Digit. Tech., Vol. 151, No. 2,pp 147-160 March 2004
[14]. J. Bhasker, "A VHDL Primer", Thrid Edition, Pearson Prentice Hall, 2008.
[15]. "Design Compiler Command-Line Interface Guide" Version 2006.06, synopsys.

