Thesis 237
Thesis 237
1
Public Key Cryptography in 32-bit for
Ultra Low-Power Microcontrollers
P.J. de Clercq
Without written permission of the thesis supervisor and the author it is forbidden
to reproduce or adapt in any form or by any means any part of this publication.
Requests for obtaining the right to reproduce or utilize parts of this publication should
be addressed to Departement Elektrotechniek, Kasteelpark Arenberg 10 postbus
2440, B-3001 Heverlee, +32-16-321130 or by email info@esat.kuleuven.be.
A written permission of the thesis supervisor is also required to use the methods,
products, schematics and programs described in this work for industrial or commercial
use, and for submitting this publication in scientific contests.
List of Figures
3.1 The input parameters are split into three equal parts and the following
multiplication methods are used: a) Schoolbook multiplication b)
Operand-scanning c) Optimized operand-scanning. . . . . . . . . . . . . 38
3.2 Splitting the input parameters a) Symmetrically into two parts of equal
length b) Asymmetrically on the word size of the processor. The word
size for this example is 32-bit. . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 An asymmetric split causes the most significant polynomial to be
completely populated with bits and will therefore require a larger lookup
table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Cost estimation for different values of w. . . . . . . . . . . . . . . . . . . 54
i
List of Tables
ii
List of Tables
4.1 The energy used per cycle for different instructions. The clock frequency
is 48MHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2 Energy measurements of point multiplications. The clock frequency is
48MHz. The measured energy is column indicates the average measured
energy, and the energy column indicates the consumed energy per
operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 Timings for point multiplications. All timings are given in milliseconds
and energy is given in millijoules (mJ). The ATMega128L runs at
7.37MHz except when indicated with a , the MSP430 runs at 8.192MHz,
the ARM7TDMI runs at 80MHz, and the ARM Cortex-M0+ runs at
48MHz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Timings in cycles for modular multiplication and modular squaring on
different platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5 Timings in cycles for field arithmetic algorithms in F2233 . . . . . . . . . 82
4.6 Total accumulated timings per operation for random and fixed-point
multiplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.7 Optimal values for w on F2233 . . . . . . . . . . . . . . . . . . . . . . . . 90
4.8 Memory operations on the lookup tables required for the Asymmetrical
Hybrid, Symmetrical Hybrid and Symmetrical Karatsuba methods by
using a factor two split in F2233 . . . . . . . . . . . . . . . . . . . . . . . . 91
iii
List of Tables
4.9 Memory operations on the lookup tables required for the Symmetrical
Hybrid vs Symmetrical Karatsuba methods by using a factor two split in
F2233 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.10 Memory overhead for Symmetrical and Asymmetrical methods in F2233 .
Symmetrical overhead is the cost of performing the pre and post shift
operations (Eq. 3.7). Asymmetrical overhead is calculated from the
additional memory accesses due to the larger lookup table. . . . . . . . 91
4.11 Memory operations to the intermediate values stored inside memory in
F2233 . The Symm/Asymm column is used to indicate that both the
Asymmetrical Hybrid, and the Symmetrical Hybrid methods do not
use any intermediate values stored in memory. . . . . . . . . . . . . . . 92
iv
List of Algorithms
v
Abbreviations, parameters, and
mathematical symbols
← Multi-precision assignment
⊕ Bitwise XOR
w Window parameter
Eq. Equation
Fig. Figure
vi
Preface
P.J. de Clercq
vii
Contents
List of Figures i
List of Tables ii
Preface vii
Abstract ix
1 Introduction 1
1.1 An introduction to Public-Key Cryptography . . . . . . . . . . . . . 1
1.2 PKC on wireless sensor networks . . . . . . . . . . . . . . . . . . . . 2
1.3 Goals of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Theory and Background 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Elliptic Curve Protocols . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Elliptic Curve Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Field Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Low-Power Processors and Microcontrollers . . . . . . . . . . . . . . 25
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Implementation 31
3.1 Matching a curve to the architecture . . . . . . . . . . . . . . . . . . 31
3.2 Actual algorithms used . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Analysis of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4 Results 73
4.1 Measurement setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Measurement results . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Comparison with other libraries . . . . . . . . . . . . . . . . . . . . . 75
4.4 Analysis of the effect of different processor word sizes . . . . . . . . . 84
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5 General Conclusion 93
5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Bibliography 97
viii
Abstract
ix
Chapter 1
Introduction
The purpose of this chapter is to serve as a general introduction to the thesis topic.
The research field of the topic is described in a broad context with an introduction
to public-key cryptography and a discussion on the goals of the thesis.
1
1. Introduction
ment, digital signatures, and encryption. Key establishment protocols are used to
generate a secret key between two parties over an insecure channel. Digital signatures
provide authentication, which means that the two communicating parties can guaran-
tee that they are the ones generating the messages between each other. Encryption
provides secure communications between parties. PKC is the only type of system
that can achieve non-repudiation. Non-repudiation is the ability to provide prove the
integrity and origin of the data. While PKC systems can be used for many different
cryptographic tasks, their biggest drawback are that they are computationally very
demanding. The best symmetric key schemes are always several orders of magnitude
faster than a properly implemented PKC system. This gives rise to the use of Hybrid
cryptosystems where the key exchange is done using a public key system and the
computationally less demanding symmetric key cryptosystem is used to efficiently
perform the encryption and decryption operations. This solves the problem of each
user needing to store the key to every other user as the two communicating parties
can just exchange their public keys over the insecure channel, generate a secret key by
means of a key establishment protocol, and use this key to encipher communications
between each other.
The first public key cryptographic system that was ever published in literature is
due to Rivest, Adleman and Shamir (RSA) in 1978 [3]. It is based on the difficulty
of factoring a large integer which is composed of two large prime factors.
Miller [4] and Koblitz [5] independently introduced the idea of Elliptic Curve
Cryptography (ECC) in 1986. It is based on the ability to perform a multiplication
of a scalar k with a point P , and the difficulty of obtaining the scalar k given the
original point P and product kP . This is called the Elliptic Curve Discrete Logarithm
problem. ECC is mathematically more complex than RSA, but its key size is smaller
and it’s operations can be performed faster for the same level of security.
2
Goals of the thesis
tamper proof because they are often made to be as cheap as possible with no tamper-
resistance. As these devices are made to be economically viable, they have a limited
amount of energy, computation power, memory and communication abilities. A
node’s lifetime is also directly influenced by the amount of energy that it uses to
perform computations and is therefore also directly influenced by the efficiency of its
algorithms.
Due to their expensive computational requirements, RSA and DSA are considered
to be impractical for use in WSNs where the devices are very constrained in processing
power and energy. ECC is an attractive alternative due to its low computational and
memory requirements, and is particularly attractive for use in Hybrid cryptosystems
where PKC is used for key exchange, and symmetric cryptography is used for the
efficient encryption of data. Digital signature can also be useful for WSNs as they
can be used to guarantee the authenticity and integrity of the data. To perform
a key exchange the Elliptic Curve Diffie-Helman exchange (ECDH) can be used
and for generating digital signatures the Elliptic Curve Digital Signature Algorithm
(ECDSA) can be used.
3
1. Introduction
4
Chapter 2
This chapter will serve as a general introduction to the theory of Elliptic Curve
Cryptography (ECC), and provide some background on processors which will be
discussed in the following chapters, including the target processor, the ARM Cortex-
M0+.
First, the introduction section discusses the relative key strength of ECC systems
compared to other cryptographic systems, a brief motivation for the choice of
ECC, followed by a discussion on elliptic curve parameters, including the three-
level hierarchy of an ECC. Next, the Elliptic Curve Discrete Logarithm (ECDL) is
introduced, followed by a description of two ECC protocols: (1) Elliptic Curve Diffie-
Hellman (ECDH), and (2) Elliptic Curve Digital Signature Algorithm (ECDSA).
Subsequent, Elliptic Curves are defined, Point Addition and the Point Doubling
operations are discussed, followed by a brief introduction to a number of different
point multiplication algorithms. Following, finite fields are introduced with a brief
mathematical overview of group laws, coordinate systems, and field arithmetic
including: field multiplications, additions, squaring, reduction and inversion over
binary fields, as well as prime fields. Finally, we introduce several processors including
the target platform, and processors for which implementations were found in literature
and software libraries.
2.1 Introduction
This section will be used to discusses the relative key strength of ECC systems
compared to other cryptographic systems, motivate the choice of ECC, as well as
discuss elliptic curve parameters, including the three-level hierarchy of an ECC.
5
2. Theory and Background
ECRYPT [2] performed an analysis of the key lengths of various different crypto-
graphic methods. The authors classified a number of different threats according to
the threats’ capabilities. A small organization is defined as having a $10k budget
and an FPGA, a medium organisation has a $300k budget and FPGA/ASIC, a large
organization has a $10M budget and FPGA/ASIC, and an intelligence agency has
a $300M budget and ASIC. Table 2.1 shows the duration that a key strength will
provide protection against different threats. We can see that a symmetric key size of
80 bits will only provide very short term protection against agencies. The minimum
recommended symmetric key size that will provide medium term protection against
agencies is 112 bits.
Table 2.2 shows a key strength comparison between symmetric cryptography,
ECC, and RSA [3]. The results from [8, 9, 10, 2, 11] are used to make a comparison of
the recommended key lengths for ECC, RSA and symmetric key cryptography. The
reason why the number of bits for ECC and RSA are displayed as a range of values
is due to the fact that different authors used different methods to determine the
strength of each algorithm. An online tool to easily compare the recommendations
for key lengths from the mentioned publications can be found at [12].
Table 2.2: The key strength of various key length of Elliptic Curves are compared
to RSA and symmetric key cryptography.
ECRYPT II [2] notes that the elliptic curve key length recommendations are
only applicable for prime fields. Cryptosystems based on binary curves are required
to have a field length of approximately 10 bits more than prime curves to have the
same relative key strength.
From the above comparisons it is clear that the key size required by ECC is
always significantly smaller than that of RSA and this makes ECC particularly
6
Introduction
appealing for use in constrained devices. There are alternatives but that is beyond
the focus of this paper.
Protocol
Curve Arithmetic
Field Arithmetic
The protocol layer can be seen as a type of applications layer as it depends on the
type of elliptic curve cryptography operation being performed. The parameters for
this layer consists of the elliptic curve protocol. The curve arithmetic layer depends
on this layer.
The elliptic curve arithmetic layer consists of all the operations which are per-
formed on the curve. The parameters for this layer includes the choice of coordinate
7
2. Theory and Background
system (which depends on the underlying field type), algorithm selection for the
point addition, and point doubling operations (which depend on the projective
coordinate system used), as well as the algorithm for point multiplication. The
point multiplication algorithm typically depends on the protocol. e.g. when perform-
ing the Elliptic-Curve Diffie-Helman (ECDH) key exchange protocol, one random
point multiplication and one fixed point multiplication is required. However, when
performing the Elliptic-Curve Digital Signature Algorithm (ECDSA) protocol, the
signature generation is done with a random point multiplication and the signature
verification is typically done by performing two simultaneous point multiplications.
By implementing the point multiplication according to the application (protocol) it
is possible to get a more efficient implementation.
Field arithmetic forms the foundation of an ECC system as most of the execution
time is spent in this layer. The parameters for this layer include the field type
(Fp , F2m or Fpm ), the field order, and the field arithmetic algorithms for field addition,
multiplication, reduction and inversion. Optimizations in the field arithmetic layer
will lead to a speed increase of the whole cryptosystem. The field multiplication
routine is typically where the most time is spent inside an elliptic curve cryptosystem,
and therefore a fast implementation of this algorithm is crucial for an efficient
implementation.
The following sections will provide some background related to the three layers in
order to aid in the process of selecting the correct parameters for the target platform
and application.
8
Elliptic Curve Protocols
Input: Curve parameters (a, b, P, m), private key d and message msg
Output: Signature r, s
1: Select k ∈ [1, m − 1]
2: Compute kP = (x1 , y1 ) and convert x1 to an integer x̄1 .
3: Compute r = x̄1 mod m. If r = 0 then go to step 1
4: Compute e =Hash(msg)
5: Compute s = k −1 (e + dr) mod m. If s = 0 then go to step 1.
6: return (r, s)
Input: Curve parameters (a, b, P, m), private key d, message msg and signature
(r, s)
Output: Accept or Reject signature
1: Verify if r, s ∈ [1, m − 1]
2: Compute e =Hash(msg)
3: Compute w = s−1 mod m
4: Compute u1 = ew mod m and u2 = rw mod m
5: Compute X = u1 P + u2 Q
6: If X = ∞ then return("Reject signature")
7: Convert the x-coordinate x1 of X to an integer x̄1 .
8: Compute v = x̄1 mod m.
9: If v = r then return("Accept signature") else return("Reject signature")
After receiving the public keys, both parties can calculate the shared secret by:
dB · PA = dA · PB = dA · dB · G.
An attacker cannot determine the shared secret from the shared curve parameters
and the base point G due to the difficulty of solving the Elliptic Curve Discrete
Logarithm problem.
9
2. Theory and Background
Y 2 Z + a1 XY Z + a2 Y Z 2 = X 3 + a3 X 2 Z + a4 XZ 2 + a5 Z 3 , (2.1)
The equation can be rewritten in the affine Weierstrass form if x = X/Z and
y = Y /Z:
y 2 + a1 xy + a3 y = x3 + a2 x2 + a4 x + a5 , (2.2)
y 2 = x3 + ax + b, (2.3)
10
Elliptic Curve Arithmetic
Q
P R'
Figure 2.2: A point addition is performed on an elliptic curve using the tangent-
and-chord method.
11
2. Theory and Background
left-to-right. For each bit inside k, a point addition is performed when ki is equal
to 1, followed by a point doubling operation which is always performed. This is
the slowest point multiplication algorithm and it has an expected running time of
m/2A + mD since a random k will have an average of m/2 bits set equal to one.
(This is a reasonable assumption because each of the bits inside the scalar can be
seen as a random variable, each with an expected value of 0.5). Therefore, on average
half the bits are set to one and the other half of the bits are set to zero.
k = k1 + k2 λmodn
where k1 and k2 are approximately half the bit-length of k, and is called a balanced
12
Elliptic Curve Arithmetic
kP = k1 P + k2 λP = k1 P + k2 φ(P ). (2.6)
13
2. Theory and Background
The TNAF method can be seen as the τ -adic analogue of the NAF method. When
the TNAF method is used in a binary point multiplication scheme, then the running
time is approximately m/3A, because the number of non-zero elements are on average
m/3. A squaring operation can be computed very quickly on binary curves. Just
like the NAF method, this method can be accelerated considerably by a employing
windowing method. This method is used for simulation in section 3.1.2, as well as in
an implementation in section 3.4.2.
14
Field Arithmetic
Polynomial representation
A useful way to represent a binary field is with the polynomial basis representation:
Here we represent the elements of F2m inside a polynomial, with each element
having a value of either 0 or 1. The degree of the polynomial is at most m − 1.
An irreducible polynomial f (z) of degree m is used for reduction after a field
multiplication, or squaring operation.
Coordinate systems
If the standard affine coordinate system is used, then both the point addition, and
point doubling operations require multiple field multiplications, as well as a field
inversion. It is common to find that the inversion operation is significantly more
expensive than the multiply operation in an actual implementation. If this is the case
then it is often more efficient to represent the point in another coordinate system,
because in a non-affine system no inversion is required. However, inversion is required
to change the system, but only once.
Affine coordinates The affine coordinate (x, y) represent a point on the curve and
must satisfy Eq. 2.3.
15
2. Theory and Background
Both the point doubling operation, and the point addition operation can be
performed in any one of the above mentioned coordinate systems, which will eliminate
the need for a field inversion. Furthermore, a mixed-coordinate point addition is also
possible, where the two input points use different coordinate systems. By doing so
the amount of multiplication operations can be reduced to an even lower number
than would be possible if input both points use the same coordinate system.
Table 2.3: The number of operations required to perform a Point Addition and a
Point Doubling, for the Affine (A), Projective (P), Jacobian (J) and López-Dahab
(LD) coordinate systems on prime fields. The field multiplication operation is
indicated with M, and a field inversion is indicated with I.
Coordinate system
over binary fields Doubling General Addition Mixed Addition
Affine 1I + 1M 1I + 1M -
Projective 7M 13M 12M (P + A)
Jacobian 5M 14M 10M (J + A)
López-Dahab 4M 14M 8M (LD + A)
Table 2.3 shows the amount of field operations which are required for a point
doubling, general addition and mixed-coordinate addition. The results in this table
only show the multiply and inversion operations because it was assumed that field
additions and squarings are cheap in comparison with multiplications and inversions.
The operation count for the different coordinate systems: Affine (A), Projective (P),
Jacobian (J) and López-Dahab (LD) are written in terms of the number of field
operations (multiplication (M) and inversion (I)) which are required to perform the
point operation. It can be seen that the mixed-coordinate addition of a López-Dahab,
and an Affine point provides the fastest point addition with 8 field multiplication
operations. The fastest point doubling operation is for a point on the López-Dahab
coordinate system with 4 field multiplications.
For a mixed Point addition in F2m , with P (X1 , Y1 , Z1 ) in López-Dahab coor-
dinates, and Q(X2 , Y2 ) in affine coordinates, the following operations need to be
performed:
A ← Y2 · Z12 , B ← X2 · Z1 + X1 , C ← Z1 · B, D ← B 2 · (C + aZ12 ), Z3 ← C 2 ,
E ← A·C, X3 ← A2 +D+E, F ← X3 +X2 ·Z3 , G ← (X2 +Y2 )·Z32 , Y3 ← (E+Z3 )·F +G.
For a point doubling with P (X1 , Y1 , Z1 ) in López-Dahab coordinates, the follow-
ing operations need to be performed:
16
Field Arithmetic
For an algorithmic listing of these methods please see [15] page 94.
Group Laws
The group laws for the non-supersingular form are as follows:
λ = (y1 + y2 )/(x1 + x2 )
λ = x1 + y1 /x1 .
17
2. Theory and Background
Field addition
The addition of two polynomials x and y on a binary field are done by performing the
bit-wise exclusive or (XOR) operation on x and y. This method is used extensively
in many of the algorithms listed in this document.
Square
The squaring operation is obtained by inserting a 0 between each of the bits of
the input polynomial c(z). If c(z) = cm−1 z m−1 + · · · + c2 z 2 + c1 z 1 + c0 then
c(z)2 = cm−1 z 2m−2 + · · · + c2 z 4 + c1 z 2 + c0 .
This process can be accelerated considerately by means of a lookup table. This
is discussed in section 3.2.3.
18
Field Arithmetic
Input: Binary polynomials x(z) and y(z) of degree at most m − 1, Processor word
size W .
Output: c(z) = c[0 · · · 2n − 1] = x(z)y(z)
1: c[0 · · · 2n − 1] ← 0
2: for k ← W − 1 downto 0 do
3: for j ← 0 to n − 1 do
4: if bit k of x[j] = 1 then
5: for l ← 0 to n do
6: c[l + j] ← c[l + j] ⊕ y[l]
7: end for
8: end if
9: end for
10: if (k 6= 0) then
11: c(z) = c(z) · z w
12: end if
13: end for
14: return c
Reduction
19
2. Theory and Background
Inversion
The Extended Euclidean Algorithm for polynomials is used to perform inversion and
is shown in algorithm 10. This method is used in an implementation in section 3.4.2.
Coordinate systems
If the standard affine coordinate system is used, then both the point addition, and
point doubling operations require multiple field multiplications, as well as a field
inversion. Recall that the inversion operation is significantly more expensive than
the multiply operation. Hence it is often more efficient to represent the point in
another coordinate system:
Affine coordinates The affine coordinate (x, y) represent a point on the curve and
must satisfy Eq. 2.4.
Y 2 Z + a1 XY Z + a3 Y Z 2 = X 3 + a2 X 2 Z + a4 XZ 2 + a6 Z 3
20
Field Arithmetic
Both the point doubling operation, and the pont addition operation can be
performed in any one of the above mentioned coordinate systems, which will eliminate
the need for a field inversion. Furthermore, a mixed-coordinate point addition is also
possible, where the two input points use different coordinate systems. By doing so
the amount of multiplication operations can be reduced to an even lower number
than would be possible if input both points use the same coordinate system.
Table 2.4: The number of operations required to perform a Point Addition and a
Point Doubling, for the Affine (A), Projective (P), Jacobian (J) and Chudnovsky (C)
coordinate systems on prime fields. The field multiplication operation is indicated
with M, the polynomial square operation is indicated with S and a field inversion is
indicated with I.
Coordinate system Doubling General Addition Mixed Addition
Affine 1I, 2M, 2S 1I, 2M, 1S (J + A) 8M, 3S
Projective 7M, 3S 12M, 2S -
Jacobian 4M, 4S 12M, 4S (J + C) 11M, 3S
Chudnovsky 5M, 4S 11M, 3S (C + A) 8M, 3S
Table 2.4 shows the amount of field operations which are required for a point
doubling, general addition, and mixed-coordinate addition. The results in this table
only show the multiply, square and inversion operations because it was assumed
that field additions are cheap in comparison with multiplications, squarings and
inversions. The operation count for the different coordinate systems: Affine (A),
Projective (P), Jacobian (J) and Chudnovsky (C) are written in terms of the number
of field operations (multiplication (M), squaring (S) and inversion (I)) which are
required to perform the point operation. It can be seen that the mixed-coordinate
addition of a Jacobian, and an Affine point provides the fastest point addition with 8
field multiplications and 3 polynomial square operations. The fastest point doubling
operation is for a point on a Jacobian coordinate system with 4 field multiplications
and 4 polynomial square operations.
21
2. Theory and Background
A ← Z12 , B ← Z1 · A, C ← X2 · A, D ← Y2 · B,
E ← C − X1 , F ← D − Y1 , G ← E 2 , H ← G · E, I ← X1 · G,
X3 ← F 2 − (H + 2I), Y3 ← F · (I − X3 ) − Y1 · H, Z3 ← Z1 · E
For the complete algorithmic listing please see [15, p. 89].
Group Laws
The group laws for the prime form are as follows:
Negation If P = (x, y) ∈ E(Fp ), then (x, y) + (x, −y) = O. The point (x, −y) is
denoted with −P . Thus, P + (−P ) = O.
λ = (3x21 + a)/(2y1 )
.
X3 ← D2 − 2B, Y3 ← D · (B − X3 ) − C, Z3 ← 2Y1 · Z1
For the complete algorithmic listing please see [15, p. 91].
In the following text we discuss a number of different basic algorithms to perform
prime field arithmetic.
Field addition
Addition in Fp is done by first performing multi-precision integer addition on the two
input parameters x and y followed by a reduction modulo p. Algorithm 11 shows
the multiprecision integer addition. This method is provided for reference only.
22
Field Arithmetic
Input: Integers x, y ∈ Fp )
Output: c = x + y mod p
{Multi-precision integer addition}
1: c ← x + y.
2: if c ≥ p then
3: c ← c − p.
4: end if
5: return c
Input: Integers x, y ∈ Fp )
Output: c = x − y mod p
{Multi-precision integer subtraction}
1: c ← x − y.
2: if c < 0 then
3: c ← c + p.
4: end if
5: return c
Field subtraction
Addition in Fp is done by first performing multi-precision integer subtraction on the
two input parameters x and y followed by a reduction modulo p. Algorithm 12 shows
the multiprecision integer subtraction. This method is provided for reference only.
Field multiplication
Field multiplication is accomplished by first performing a multiplication with the
two operands x and y followed by a reduction modulo p. A number of different
methods for performing the multiplication exists including the operand-scanning
form, product-scanning or Comba [16], the Hybrid method [17], operand caching
[18] and the Karatsuba-Ofman [19] method. Each one of these methods has its
advantages and disadvantages and the appropriate method needs to be selected based
on the underlying architecture.
The operand-scanning form is the most basic way to perform multiplication and
requires n2 multiplications. Both the product scanning form and the Hybrid method
reduces the number of memory accesses when compared to the operand-scanning
form. The product-caching method reduces the number of memory load instructions
by performing a novel way of caching the operands. The Karatsuba-Ofman method
reduces the number of multiplications at the cost of a more complex way of adding
the multiplication results.
Algorithm 13 shows the operand-scanning form of a field multiplication in Fp .
(U V ) denotes a vector of 2W words. This is a way to carry it out without handling
23
2. Theory and Background
carry bits, but the underlying platform needs to provide a 2-word register to make
efficient use of it.
All the methods listed here, except for the Hybrid method and the Operand-
scanning form are provided for reference purposes. The Hybrid method and operand-
scanning forms are used for simulation in section 3.1.1.
Square
Squaring in Fp can be accomplished by first performing the square operation, followed
by a reduction modulo p. This operation is essentially a field multiplication with x = y.
Because the two operands are equal to each other, the number of multiplications are
reduced to m · (m + 1)/2. This method is provided for reference only.
Reduction
If the curve is not of a special form then Barrett reduction [20] and Montgomery
reduction [21] can be employed to perform the reduction operation. Barret reduction
works by using an estimation technique to make an estimate of the quotient bz/pc,
followed by a subtraction of the quotient multiplied with p. Thereafter p is subtracted
from the remainder until the remainder is smaller than p.
Pseudo-Mersenne primes have a special form which make them very suitable for
fast modular reduction techniques which are not available to general primes. The
set of NIST prime curves all have this property so that they allow for a very fast
reduction operation. For a thorough description of this method please see [15]. This
method is provided for reference only.
Inversion
The extended Euclidean method can be used to perform inversion but the drawback
with it is that it contains a large number of expensive divisions. The extended binary
Euclidean (Algorithm 14) method can be used to replace the expensive division
24
Low-Power Processors and Microcontrollers
operations with cheaper shift and subtraction operations. This method is provided
for reference only.
25
2. Theory and Background
where α is the fraction of gates actively switching, and C is the total capacitive load
of all gates. The first term is the dynamic power lost from charging and discharging
the processor’s capacitive loads. The second term is the static power lost due to
leakage current. A term which is not included in this equation is the relatively small
amount of leakage due to a short-circuit which momentarily occurs every time a gate
switches. Many of the modern microcontroller’s data sheets give the static power as a
constant value, and the dynamic power consumption in the unit W/MHz. Since there
is sometimes a difference between the manufacturer’s stated power consumption and
the actual power consumption, a literature study was made to find actual energy
consumption measurements of the different microcontrollers. The dynamic power
increases linearly with frequency, whereas the static power consumption remains
constant regardless of the frequency. Therefore, a processor runs more efficiently
at a higher frequency because the static power consumption contributes a smaller
percentage of the overall power consumption.
ARM Microcontrollers
There are generally two instruction sets that ARM microcontrollers can use: (1)
ARM code with 32-bit instructions and (2) Thumb code with 16-bit instructions.
In ARM mode all instruction are 32-bits long, and all 16 registers are addressable
from the instruction set. Many instructions support conditional execution. Many
of the data-processing instructions can apply a shift on one of its parameters at no
extra cost in cycles. e.g. The instruction add r0, r1, r1, lsl #3 is equivalent
to r0 = r1 + (r1 << 3). Most data processing instructions use one destination
register and two operands.
In Thumb mode all instructions are 16 bits long, and most of the instructions
can only address the 8 low registers (r0 to r7). In this mode the only conditional
instruction is the branch instruction, no shift operation can be embedded with a
data-processing instruction, and most data instructions take only two operands,
where the destination register is also one of the two sources.
The benefits of working in Thumb mode are that the code size is much smaller,
and it costs less energy to fetch an instruction because moving data from memory
to the processor costs energy; however, the drawbacks are that the Thumb code
executes slower than the ARM code because Thumb has a limited instruction set,
with access to less of the registers, and no embedding of shifts into the data-processing
instructions.
ATMega128L
The Atmel AVR ATMega128L [22] is found on a number of different Wireless Sensor
Nodes like the MICA2, MICAz and MICA2DOT. It has a RISC processor with a
modified Harvard architecture, a maximum clock speed of 8MHz, a word size of 8
26
Low-Power Processors and Microcontrollers
bits, 4 kB SRAM, 128 kB Flash, and 32 general purpose registers. It can perform
1 MIPS per MHz.
A power analysis done by [23] showed that this microcontroller consumes 30mW
when the processor is clocked at 7.37 MHz, and 12mW when the processor is clocked
at 4MHz. The energy per cycle when the processor is clocked at 4 MHz is therefore
12 mW/4 MHz = 4.125 nWs. The energy per cycle when the processor is clocked
at 7.37 MHz is therefore 30 mW/7.37 MHz = 4.07 nWs. Note that the energy
consumption per cycle is lower when the processor is clocked at a higher frequency,
as expected.
MSP430F1611
The Texas Instruments MSP430F1611 [24] is found in a number of Wireless Sensor
Nodes like the TelosB[25], and the Tmote Sky[26]. It has a RISC processor that uses
a Von Neumann architecture with a maximum clock speed of 25 MHz, a word size of
16 bits, 10 kB SRAM, 48 kB Flash, and 12 general purpose registers. It can perform
1 MIPS per MHz.
A power analysis done by [23] showed that this microcontroller has an average
power consumption of 12 mW when clocked at 8 MHz. The energy per cycle when
the processor is clocked at 8 MHz is therefore 12 mW/8 MHz = 1.5 nWs.
ARM7TDMI
The ARM7TDMI [27] is not found on any WSNs that we are aware of. It has a RISC
processor with a Von Neumann architecture, a three-stage pipeline, a maximum
clock speed of 133 MHz, a word size of 32 bits, 32 kB SRAM, 512 kB Flash, supports
ARM and Thumb instructions, and has support for a long multiply instruction which
produces a result of 64-bits. It has 31 general-purpose 32-bit registers, and 6 status
registers. It can perform 60 MIPS at 59.8 MHz.
ARM claims that the typical power consumption in in 0.18 µm technology is
0.25 mW/MHz at 1.8 V, and in 0.13 µm technology is 60 µW/MHz at 1.2 V [28].
The energy per cycle in 0.13 µm is therefore 60 µW/1 MHz = 0.06 nWs.
To make a fair energy consumption comparison with the two other ARM chips,
we scale the energy values so that they are representative for a technology size of
90 nm. Let’s assume that the ARM microcontroller runs at 1 V on 90 nm technology.
We use Eq. 2.11 to scale the power consumption of the 0.13 µm technology to 90 nm.
This leads to a power consumption of 0.029mW/MHz, and an energy per cycle of
0.029 nWs. This estimated value is just given for reference and is not used in the
actual energy estimations in Chapter 4.
PXA271
The PXA271 microcontroller uses the ARMv5TE [29] architecture, and is found on
the Imote2 [30] WSN. It has a RISC processor, with a word size of 32 bits, 32 kB
27
2. Theory and Background
SRAM, 512 kB Flash, supports ARM and Thumb instructions, and has support for
a long multiply instruction which produces a result of 64-bits. The PXA271 uses
dynamic voltage scaling and frequency scaling in order to provide different modes of
energy efficient operation. It has a Harvard architecture, and supports clock speeds
between 13MHz and 540MHz.
Some models also have a Wireless MMX coprocessor which supports 43 SIMD
instructions and also provides 16 extra 64-bit registers.
The Imote2 data sheet [30] states that at a clock frequency of 13 MHz and an
input voltage of 0.9 V, the current consumption is 33 mA. The power consumption
is therefore 29.7 mW, which leads to an energy per cycle of 29.7 mW/13 MHz =
2.285 nWs.
ARM Cortex-M3
The ARM Cortex-M3 [31] has a 32-bit word size with 13 general purpose registers.
It has a RISC processor with a Von Neumann architecture, 3-stage pipeline, a
maximum clock speed of 150 MHz, a word size of 32 bits, up to 200 kB SRAM, up to
1 MB Flash, supports only Thumb instructions, and has support for a long multiply
instruction which produces a result of 64-bits.
It can perform 1.25 MIPS/MHz and has a dynamic power consumption of
149 µW/MHz for 0.180um, 32 µW/MHz for 90 nm and 7 µW/MHz for 40nm. There-
fore, on the 90 nm technology it has energy consumption per cycle of 32 µW/1 MHz
= 0.032 nWs.
ARM Cortex-M0+
The ARM Cortex-M0+ [6] has a 32-bit word size that uses the ARMv6-M architecture
(which is a subset of the ARMv7-M with fewer instructions), 13 general purpose
registers, and always runs in Thumb mode. It has a RISC processor with a Von
Neumann architecture with a two-stage pipeline in which the fetch, decode and
execute operations are split over two cycles. It has a dynamic power consumption of
11 µW/MHz and a static power requirement of 0.95 µW in 90 nm technology. It has
16 kB of SRAM, and the amount of flash memory is manufacturer dependent. It
has a short 32-bit hardware multiply instruction which executes in a single cycle,
but only outputs the low 32 bits. It is capable of performing 1.08 DMIPS/MHz.
Freescale is using the Cortex-M0+ in a wireless sensor node called the MKW01Z128
[7].
The two-stage pipeline is illustrated in Fig. 2.3. By splitting the decode stage
over two cycles, the processor is able to complete an instruction in two cycles.
Section 4.2 describes a measurement of the power consumption of the ARM
Cortex-M0+. The microcontroller consumes 600 µW at a clock speed of 48 MHz.
The energy consumption per cycle is therefore 600µW/48MHz = 12.5pWs.
28
Conclusion
Clock
Pre-decode Decode
Figure 2.3: The two-stage pipeline inside the ARM Cortex-M0+. The decode stage
is split over two cycles, which allows the processor to complete an instruction in two
cycles. This image is based on an image found in [1].
Discussion
A comparison of the energy consumption per cycle is shown in Fig. 2.4. It needs
to be mentioned that some of these microcontrollers have a different technology
size. The technology size of the three microcontrollers found on the WSNs (AT-
Mega128L, MSP430, PXA271) are unknown, and therefore the energy consumption
was determined from measurements. The energy consumption values for the ARM
Cortex-M3, and the ARM Cortex-M0+ were taken from their data sheets. The ARM
Cortex-M3 and the ARM Cortex-M0+ both use a technology size of 90 nm, whereas
the ARM7TDMI uses a technology size of 0.13µm.
The ATMega128L is the least efficient microcontroller with 4.07nWs per cycle.
The MSP430 is the second least efficient with 1.5nWs per cycle. The PXA271 is
closely following with 1.25 nWs per cycle. The ARM7TDMI is the third most efficient
microcontroller with 0.06nWs per cycle. The ARM Cortex-M3 is the second most
efficient with a 0.032nWs per. The ARM Cortex-M0+ is the most efficient per cycle,
with an improvement of 125% over the ARM7TDMI.
As the technology size decreases, the dynamic power consumption decrease due
to a lowered amount of gate capacitance together with a lowered input voltage.
However, with a decrease in technology size, comes an increase in leakage current
[32]. Generally speaking the microcontrollers with a smallest technology size have
the lowest power consumption.
2.6 Conclusion
We introduced finite fields and described how they are used in elliptic curves. Elliptic
curves were introduced on binary fields, as well as prime fields together with some
mathematical background for field arithmetic, and elliptic curve arithmetic, and the
basic workings of the point multiplication operation. Different key strengths were
evaluated to RSA and symmetric key cryptography. Elliptic Curve Cryptography
was introduced with two different protocols which use it. The target microcontroller
was described together with microcontrollers which are used in implementations
found in literature as well as software libraries. The choice of ECC was motivated.
29
2. Theory and Background
4.5 4.07
4
Energy per cycle (nWs)
3.5
3
2.5
2
1.5
1.5 1.25
1
0.5 0.06 0.032 0.0125
0
ARM7TDMI,
ARM Cortex-
ARM Cortex-
ATMega128L
MSP430
(Imote2)
(TelosB)
PXA271
M0+, 90nm
M3, 90nm
(MICA2 &
0.13um
MICAz)
30
Chapter 3
Implementation
Point multiplication is the core operation of any ECC protocol. Two major choices
exist and both have a major impact on the operations carried out in the ECC protocol.
As such, we will first compare binary curves and prime curves with each other in
order to determine which curve will lead to the faster implementation of a point
multiplication. Next, we will provide the algorithms which were actually used inside
the implementation and (in some cases) compare their performance with each other
for the target platform. Subsequent, the algorithms are analysed to determine the
optimal algorithms, and the optimal values for certain parameters. Finally, we will
present the initial implementation and discuss the optimizations which were done in
order to increase the performance.
31
3. Implementation
This leads to a reduced number of point additions and doublings. The bit-length
for the underlying field is 224 because this provides 112-bit security.
32
Matching a curve to the architecture
33
3. Implementation
Table 3.1: Prime field simulation. ∗ ECE is an abbreviation for Efficiently com-
putable endomorphism.
Hybrid MUL
Operation Amount Note
Reads 112 2n + 2n2
Writes 49 n2
32×32 MULs 49 n2
Total 1988
Squaring
Operation Amount Note
SQR Cost 1670 85% of Hybrid MUL
Totals
Operation Amount Note
NAF 4.86 M (m/3)A+mD
ECE ∗ 3.25 M NAF×77%
shift operations. Each shift operation requires 2n shift and n add operations. We
estimate that each 233-bit field multiplication requires 1208 read, 752 write, 745 XOR
and 315 shift instructions, and further requires a total of 11350 cycles to perform a
point multiplication. From the results in [33] it was observed that the amount of
cycles are reduced by a factor of 2.3 if the López-Dahab (LD) method with a window
of size 4 is used. Therefore with a factor 2.3 reduction in cost it was estimated that
the LD, w = 4 method will require 4935 cycles for a field multiplication.
Using the TNAF approach we estimate that a point multiplication will require
approximately 3.07M cycles. Table 3.2 shows the estimates for the field multiplication
34
Matching a curve to the architecture
Left-to-Right Comb
Operation Amount Note
Read 2967 (n2 /2)W + (n + 1)(W − 1)
Write 1312 (n2 /2)W + (n + 1)W
n-word additions 2048 (n2 /2)W
n + 1 word shifts 744 W −1
Left-to-Right Comb Total 11350
LD, w = 4 4935 (Left-to-Right Comb Total)/2.3
Totals
Operation Amount Note
TNAF 3.07 M m/3A
3.1.3 Conclusion
In order to determine which curve will lead to the fastest execution time efficient
algorithms were selected to make an estimate on the cost of performing a point
multiplication on curves with different underlying fields. Even though we tried to
make the estimate as accurate as possible, it is very difficult to make an accurate
estimate without actually writing the algorithms. The estimates show that the cycle
counts for both binary field and prime fields will lead to similar execution times with
binary fields requiring 3.07 M cycles and prime fields requiring 3.25 M cycles for a
point multiplication. This is consistent with results found in literature [15, 34, 35]
where the execution times for a point multiplication on either a prime or a binary
field will lead to similar timings.
A disadvantage of using prime fields is that a large amount of multiply instructions
are required due to the target having only a 16-bit multiplication instruction. This
has the further disadvantage of having a large number of carry handling operations.
Prime fields require a large amount of multiply and addition instructions whereas
binary fields require a large amount of shift and XOR instructions. If we look at the
energy measurement results in Chapter 4 we can see that both the shift and XOR
instructions require approximately 8% less energy than the multiply and addition
instructions. Therefore an implementation which is based on a binary field will
require less energy than an implementation based on a prime field with the same
35
3. Implementation
cycle count because the instructions required to perform the arithmetic on the binary
field requires less energy.
We chose to use the standard SEC [36] binary Koblitz curve of order 2233 called
sect233k1. This curve allows all doublings to be replaced by squarings by use of
a Frobenius map. The polynomial squaring operations are very cheap for binary
curves and a fast reduction method is available because the reduction polynomial is
a trinomial. The irreducible polynomial is
The curve
E : y 2 + xy = x3 + ax2 + b
over F2m is defined by
a = 0, b = 1
The base point in compressed form is:
G = 04 017232BA 853A7E73 1AF129F2 2FF41495 63A419C2 6BF50A4C
9D6EEFAD 612601DB 537DECE8 19B7F70F 555A67C4 27A8CD9B
F18AEB9B 56E0C110 56FAE6A3
36
Actual algorithms used
combination steps, and the reduction of memory accesses inside the multiplication
routine. By increasing the number of splits, the overhead (and number of cycles)
from the splitting and recombining steps increase; whereas the number of memory
accesses inside the multiplication operation reduces. One needs to find the sweet
spot which will lead to the best results for the target architecture.
Processing the input parameters as a whole has the disadvantage of requiring
a larger amount of intermediate values which leads to a larger number of memory
accesses inside the multiplication routine. The advantage of this method is that there
is no overhead for splitting and recombination as the whole input block is processed
in a single function.
In the remainder of this section we describe the different approaches to perform
field multiplications in more detail. First the basis of the divide-and-conquer method
is introduced, followed by two potential algorithmic enhancements: The operand-
scanning method and the Karatsuba-Ofman method. Thereafter the effects of
combining the divide-and-conquer approach with a windowed-multiplication method
is discussed. Subsequent three different methods are discussed to process the input
parameters as a whole.
Divide-and-conquer
The divide-and-conquer method is based on the principle of splitting the input
parameters, performing the multiplications on the smaller input parameters, and
then combining the results of the multiplications. Suppose that n = 2l , x = x1 2l + x0
and y = y1 2l + y0 . Then
Operand-Scanning
The order in which the multiplication operations are performed can have an impact
on the overall performance of the multiplication. Fig. 3.1 shows three different
methods for performing field multiplications when the input parameters are split into
three equal parts. The first method is the classic row-wise multiplication method
in which the summands are sorted from left-to-right. The second method is the
operand-scanning method in which the summands are sorted in a manner where
the algorithm moves through the multiplier xi , starting with x0 , and multiplies xi
with any multiplicand yj . The third method is a type of optimized operand-scanning
method in which the same approach is followed as for the operand-scanning method
where the algorithm moves through the multiplier xi , starting with x0 and multiplying
xi with any multiplicand of yj ; however, instead of performing the multiplication
37
3. Implementation
operations each inside its own function, the multiplication operations for all terms
with the same xi are interleaved inside a single function. In other words if a split of 3 is
performed on the input parameters, then the three terms (x2 ·y2 +x2 ·y1 2l +x2 ·y0 22l ),
(x1 ·y0 2l +x1 ·y1 22l +x1 ·y2 23l ), and (x0 ·y2 22l +x0 .y1 23l +x0 .y0 24l ) are each multiplied
inside a single function. In Fig. 3.1 the three multiplication functions are indicated
with P1, P2, and P3. Inside each function the multiplication of xi with yj are
interleaved in a manner in which the accesses to xi are minimized. This has the
advantage of minimizing the number of accesses to read xi at the cost of an increase
in code size because the multiplication function is larger.
If combined with a windowed multiplication algorithm, the total number of
loops inside the multiplication routine is reduced at the cost of storing all the
precomputation table simultaneously in memory. All three precomputation tables
need to be available at the same time because the multiplicand now has three forms:
y0 , y1 , and y2 .
x0 x1 x2 x y0 y1 y2 x0 x1 x2 x y0 y1 y2 x0 x1 x2 x y0 y1 y2
x2 y2 x2 y2 x2 y2
a) x1 y2 b) x2 y1 c) x2 y1
x2 y1 x2 y0 x2 y0 P1
x2 y0 x1 y2 x1 y2
x0 y2 x1 y1 x1 y1 P2
x1 y1 x1 y0 x1 y0
x0 y1
x0 y2 x0 y2
x1 y0
x0 y1 x0 y1
P3
x0 y0 x0 y0 x0 y0
Figure 3.1: The input parameters are split into three equal parts and the following
multiplication methods are used: a) Schoolbook multiplication b) Operand-scanning
c) Optimized operand-scanning.
Karatsuba-Ofman
The Karatsuba-Ofman method reduces the complexity of the operand-scanning form
from O(n2 ) down to O(nlog2 3 ). Assume we have the input parameters x and y which
are both n bits long. Then
38
Actual algorithms used
233 233
a) b)
39
3. Implementation
The storage requirements for the lookup table of the windowed field multiplication
methods is given in Eq. 3.6. While computing the lookup table the input polynomial
is shifted left by w −1 bits. If the entire most significant word of the input polynomial
is used (i.e. it contains a number which is large enough to occupy the entire word),
a left shift of the most significant polynomial will cause an overflow into the next
word, requiring an extra word to store the result of the left shift operation. The
most significant input polynomial will always be completely populated with bits
and will always require a larger lookup table for an asymmetrical split. Fig. 3.3
illustrates this process. The possibility exists that the least significant polynomial
would not require a large lookup table because its least significant word contains
a number which occupies less than, or equal to W − (n − 1) bits. A large lookup
table requires more computations to generate the lookup table, more memory, and
additional table lookups while performing the actual multiplication. This method is
used in a simulation in section 3.3, and in an implementation in section 3.4.3.
LSB MSB
<<(w-1)
For an asymmetrical split, the total size of the lookup tables is given by
(
(2w (n/d) + (d − 1)2w ((n/d) + 1)) , if (W − m mod W ) ≥ w − 1.
(3.2)
2w (n + 1) , if (W − m mod W ) < w − 1.
where d is the number of splits, w is the window size and n is the number of words
in the input parameters. The term (W − m mod W ) indicates the number of bits
in the most significant word which are not currently being used. Here we can see
that the amount of memory required to store the lookup tables also increase as the
parameter d increases.
For an asymmetrical split, the total number of accesses to the lookup table is
given by
(
dW/we(n2 + n(d − 1)) , if (W − mmodW ) ≥ w − 1.
(3.3)
dW/we(n2 + nd) , if (W − mmodW ) < w − 1.
where d is the number of splits, w is the window size and n is the number of words in
the input parameters. Here we can see that the amount of memory accesses increase
with the parameter d.
40
Actual algorithms used
For a symmetrical split, the total size of the lookup tables is given by
(
2w (n) , if ((W − m mod W )/d) ≥ w − 1.
(3.4)
2w (n + 1) , if ((W − m mod W )/d) < w − 1.
where d is the number of splits, w is the window size and n is the number of words
in the input parameters.
For an symmetrical split, the total number of accesses to the lookup table is
given by
(
dW/we(n2 ) , if ((W − mmodW )/d) ≥ w − 1.
2 (3.5)
dW/we((n + 1) ) , if ((W − mmodW )/d) < w − 1.
where d is the number of splits, w is the window size and n is the number of words
in the input parameters.
For the remainder of this text, assume that the conditions for a small lookup
table in Eq. 3.2, 3.3, 3.4, and 3.5 are satisfied. This is a valid assumption given that
m = 233, W = 32, and d ≤ 2.
A split of more than two was considered but was found to be impractical due to
the following reasons:
• The symmetrical split method has the benefit of reducing the number of
memory operations in the multiplication, at the expense of having to split
the input parameters, and having to shift the results of the multiplications
before recombination. Each additional split leads to an increasingly larger
number of shift operations due to the increased number of splits inside the
input polynomial, as well as an increased number of shifting operation on the
multiplication results. For a split of more than two the benefits that come from
reducing the amount of memory accesses inside the multiplication is outweighed
by the disadvantages of having to split the input parameters and recombine
multiplication results.
• The asymmetrical split method has the benefit of reducing the number of
memory operations in the multiplication, at the expense of requiring a larger
lookup table for the most significant input parameters. Each additional split
leads to an increased number of large lookup tables (Eq. 3.2), more computations
to generate the lookup tables, as well as significantly more accesses to the
larger lookup tables (Eq. 3.3) during the multiplication. When performing a
first-level split by a factor of two, the number of intermediate values inside the
multiplication routine are already reduced to an amount which can be kept
completely inside the target platform’s registers. Due to the fact that each
extra split leads to a significant extra overhead due to the larger lookup tables,
it is therefore less advantageous to perform a split by a factor of more that two.
41
3. Implementation
by scanning x with w bits at a time and performing a table lookup, thereby reducing
the number of outer loop iterations to only dW/we. The lookup table is computed
with T (u) ← u(z)y(z) for all polynomials u(z) of degree lower than w. The number
of words required to store the lookup table is given by:
(
2w (n + 1) , if degree(y) > nW − (w − 1).
(3.6)
2w (n) , if degree(y) ≤ nW − (w − 1).
If the most significant word of y is smaller or equal than W − (n − 1) then the lookup
table will fit into 2w n words. This is due to the fact that while generating the lookup
table, y gets shifted by w − 1 which causes the polynomial to overflow into the next
word.
Algorithm 15 shows the López-Dahab field multiplication algorithm. For F2m
the two inputs x and y are both n words long and the result is 2n words long. This
method is used in a simulation in section 3.3, and in an section 3.4.2.
42
Actual algorithms used
43
3. Implementation
44
Actual algorithms used
loa 18: López-Dahab multiplication with fixed registers in F2m for n = 8 and s = 8.
Input: x(z) = x[0 · · · n − 1], y(z) = y[0 · · · n − 1]
Output: c(z) = c[0 · · · 2n − 1] = x(z)y(z)
Note: v denotes a vector of n − 1 memory addresses and n + 1 registers
(c[0], c[1], c[2], r0 , r1 , · · · , rn , c[12], c[13], c[14], c[15])
1: Compute T 1(u) ← u(z)y(z) for all polynomials u(z) of degree lower than w
2: Compute T 2(u) ← u(z)(y(z) 3) for all polynomials u(z) of degree lower than
w
3: c[0 · · · 2n − 1] ← 0
4: for j ← dW/se − 1 downto 0 do
5: for k ← 0 to n − 1 do
6: u = (x[k] 2.j.W ) & 0xF
7: for l ← 0 to n − 1 do
8: v[l + k] ← v[l + k] ⊕ T 1[u][l]
9: end for
10: u = (x[k] 2.(j + 1).W ) & 0xF
11: for l ← 0 to n − 1 do
12: v[l + k] ← v[l + k] ⊕ T 2[u][l]
13: end for
14: end for
15: if (k 6= 0) then
16: c(z) = c(z).z w
17: end if
18: end for
19: return c
more table lookup operations as well as needing to store two lookup tables in memory.
If we assume that n = 8 (e.g. F2233 ) then we can store the n + 1 most frequently
used words inside registers and the remaining n − 1 words are stored inside memory.
By analysing the standard López-Dahab algorithm for w = 4 it can be determined
that c[3 · · · 12] are the most frequently used n + 1 elements and are therefore stored
inside the registers. c[0 · · · 2] and c[13 · · · 15] are the least frequently used n − 1
elements inside c and are therefore be stored inside memory. This method is used in
a simulation in section 3.3.
The algorithm is listed as algorithm 18.
45
3. Implementation
3.2.3 Squaring
Squaring is done by using a 16-bit lookup table which contains all the precomputed
square values for an 8-bit input and a 16-bit output. The lookup table requires 256,
16-bit entries which requires a total of 4 kB of memory. Algorithm 20 shows this
method. This method is used in an implementation in section 3.4.2
It can be observed that the squaring operation is always followed by modular
reduction which leads to redundant memory operations. These redundant memory
operations could be reduced by interleaving the modular reduction and the squaring
functions. The lower half of the output of the squaring operation is kept inside the
registers and the upper half is expanded and then immediately reduced. The upper
half of the elements are therefore not required to be stored first and reduced later.
This leads to at least 8 less write, and 8 less read operations from memory. This
method is shown in algorithm 21. This method is used in a simulation in section 3.3,
and an implementation in section 3.4.2
Input: x(z) = x[0 · · · n − 1], y(z) = y[0 · · · n − 1]
Output: c(z) = c[0 · · · 2n − 1] = x(z)y(z)
Note: vi denotes the vector of n + 1 registers (ri , · · · , r0 , rn , · · · , ri )
46
Analysis of algorithms
3.2.4 Inversion
The Extended Euclidean Algorithm for polynomials is used to perform inversion and
is shown in algorithm 10. This method is used in an implementation in section 3.4.2.
One optimization to this algorithm is to avoid the swap operation (algorithm 22)
by means of performing code duplication and thereby eliminating a large number
of expensive memory operations. A further optimization which isn’t indicated in
algorithm 22 is to use indices to the most significant word of u and v which contains
a value larger than zero. These indices are then used to perform a fast calculation of
the degree of the polynomial and is also used to reduce the number of operations
inside the variable field shift function. This method is used in an implementation in
section 3.4.3.
47
3. Implementation
loa 21: Polynomial squaring with reduction (with word length W =32).
Input: A polynomial a(z) of degree at most m − 1
Output: c(z) = a(z)2
Note: A[i] ← (Bi,3 , Bi,2 , Bi,2 , Bi,1 , Bi,0 ) where Bi,∗ is a byte
R(r7 , r8 , r10 , r9 , r11 ) = {r7 ⊕ r11 << 23, r8 = r8 ⊕ r11 >> 9, r10 = r10 ⊕ r9 >>
31}
1: Precomputation. For each byte d = (d7 , · · · , d1 , d0 ), compute the 16-bit quantity
T (d) = (0, d7 , · · · , d1 , 0, d0 )
2: r2 ← (T (B1,1 ), T (B1,0 )) {C[2]}
3: r3 ← (T (B1,3 ), T (B1,2 )) {C[3]}
4: r4 ← (T (B2,1 ), T (B2,0 )) {C[4]}
5: r5 ← (T (B2,3 ), T (B2,2 )) {C[5]}
6: r6 ← (T (B3,1 ), T (B3,0 )) {C[6]}
7: r7 ← (T (B3,3 ), T (B3,2 )) {C[7]}
8: r8 ← (T (B4,1 ), T (B4,0 )) {C[8]}
9: r10 ← (T (B5,1 ), T (B5,0 )) {C[10]}
10: r9 ← (T (B5,3 ), T (B5,2 )) {C[11]}
11: r11 ← (T (B7,3 ), T (B7,2 )) {C[15]}
12: R(r7 , r8 , r10 , r9 , r11 ) {Reduce r11 /C[15], r11 free}
13: R(r3 , r4 , r6 , r7 , r9 ) {Reduce r9 /C[11], r9 free}
14: r9 ← (T (B4,3 ), T (B4,2 )) {C[9]}
15: r11 ← (T (B7,1 ), T (B7,0 )) {C[14]}
16: R(r6 , r7 , r9 , r10 , r11 ) {Reduce r11 /C[14], r9 free}
17: R(r2 , r4 , r5 , r6 , r10 ) {Reduce r10 , C[10], r10 free}
18: r10 ← (T (B6,3 ), T (B6,2 )) {C[13]}
19: R(r5 , r6 , r8 , r9 , r10 ) {Reduce r10 /C[13]}
20: r2 = r2 ⊕ (r9 9), r4 = r4 ⊕ (r9 1), r5 = r5 ⊕ (r9 31) {r9 /C[11] free}
21: r10 ← (T (B6,1 ), T (B6,0 )) {C[12]}
22: R(r4 , r5 , r7 , r8 , r10 ) {Reduce r10 /C[12]}
23: r3 = r3 ⊕ (r8 1), r4 = r4 ⊕ (r8 31)
24: C[4] = r4 , C[5] = r5 , C[6] = r6
25: r4 ← (T (B0,1 ), T (B0,0 )) {C[0]}
26: r5 ← (T (B0,3 ), T (B0,2 )) {C[1]}
27: r5 = r5 ⊕ (r9 23), r4 = r4 ⊕ (r8 23), r5 = r5 ⊕ (r8 9)
28: C[1] = r5
29: r10 = (r7 9)
30: r4 = r4 ⊕ r10
31: C[0] = r4
32: r2 = r2 ⊕ (r10 10)
33: C[2] = r2
34: r3 = r3 ⊕ (r10 22)
35: C[3] = r3
36: r7 = r7 & 0x1FF
37: C[7] = r7
38: return c
48
Analysis of algorithms
execution time.
First the López-Dahab algorithm and its derivatives are analyzed. Next, our
proposed algorithm, López-Dahab with fixed registers algorithm is analyzed with
different values for the window size parameter (w) in order to determine the optimal
value for this parameter on the target platform. Next, we analyze some divide-
and-conquer approaches in combination with the López-Dahab with fixed registers
algorithm.
To quantify how well any given algorithm or method performs, a cycle count
estimation or cost is calculated for each method. This is done with a weighted sum
of all the instruction counts. Since the memory read and write instructions both take
2 cycles to complete, and all the other instructions take only 1 cycle, we assign a
weight of two to all the memory read and memory write instructions and a weight of
1 to all other instructions. The cost can be seen as the theoretical lower limit of the
number of cycles which is required to perform the operation on the target platform.
It needs to be mentioned that for the sake of simplicity the AND instruction was
not added to the cost analysis. In the analysis of the LD algorithm the effect of the
AND operation on the cycle count is estimated to be 2.6% and will therefore have a
relatively small impact on the total cycle count for this estimation.
The actual execution time will typical be higher than the estimated cost due to
having to move values from the high registers to the low registers because the target
49
3. Implementation
50
Analysis of algorithms
If W = 32, w = 4, n = 8 we obtain 816 read, 368 write, 809 XOR and 315
shift instructions. The total cost is 3492 cycles and the memory consumption is
4 kB for the precomputation table. Compared to the standard LD we can see a
drastic reduction in the number of memory operations due to the implementation
of the rotating register scheme which minimizes the storing of intermediate values
in memory. We can also see a slight rise in the number of XOR instructions due to
the extra XOR instruction on line 10 of algorithm 16. A summary of the estimated
execution times is shown in Table 3.7.
51
3. Implementation
Table 3.3: Analysis of the read instructions required to perform different field
multiplications.
Table 3.4: Analysis of the write instructions required to perform different field
multiplications.
Table 3.5: Analysis of the XOR instructions required to perform different field
multiplications.
time spent generating the lookup table goes up while the time spent performing
the multiplication goes down. The memory requirements for the lookup table also
increases as the parameter w increases. The standard López-Dahab with rotating
registers requires a precomputation table of size 2w p (if the input polynomial is
shorter than n − w + 2 bits).
Fig. 3.4 and Table 3.8 show the cycle counts for a field multiplication using the
López-Dahab method with fixed registers for different values of the parameter w.
We can see that the fastest cycle count of 2904 cycles is obtained for w = 4 with a
memory consumption of 4 kB. We will therefore use w = 4.
52
Analysis of algorithms
Table 3.6: Analysis of the shift instructions required to perform different field
multiplications.
Table 3.7: Analysis of the costs associated with different field multiplications on
the target platform. This assumes that n = 8, W = 32 and w = 4. The sum column
represents the weighted sum of all the instructions with a weight of two assigned to
the reads and writes, and a weight of one assigned to the XORs and shifts.
Table 3.8: The memory requirements and cycle counts associated with different
values of the parameter w for the López-Dahab with fixed registers algorithm.
The cycle count includes precomputation and multiplication.
53
3. Implementation
11000
10000
9000
8000
Cycles
7000
6000
5000
4000
3000
2000
1 2 3 4 5 6 7 8
Window size w
required which leads to increased storage requirements and the inner loop is more
complex with lookups from multiple lookup tables. It is assumed that the parameter
s is chosen so that s/w will lead to a real number. e.g. If w = 4 then s could be
4,8,16,32 etc.
Assume that n = 8 and that n + 1 of the 2n intermediate values (C[0 · · · 2n]) are
stored inside registers. We require ds/we precomputation tables to be generated,
each requiring n reads to obtain the values of y, 2w n writes (assuming y is short),
2w n XOR instructions and n(2w − 2) shifts to generate the table. The inner loop
requires ds/wen reads from the lookup tables T , ds/we reads from x[k], ds/wen
xor and ds/we shift instructions, for each of all ndW/se loop iterations. The outer
loop requires ds/we2(n − 1) read and ds/we2(n − 1) write instructions to and from
C[0 · · · 2]) and C[12 · · · 15], for each of all dW/se loop iterations. The logical shift of
c(z) computed at the intermediary stage requires n − 1 reads, n − 1 writes, 4n − 3
shifts and 2n − 1 xor instructions, for each of all dW/se − 1 loop iterations. This
leads to a total of dW/se(ds/we(n2 + 3n − 2)) + dW/se(n − 1) + ds/wen − n + 1 reads,
dW/se(3n − 3) + n(ds/we.2w − 1) + 1 writes, ds/we.2w n + dW/se(ds/wen2 + 2n −
1) − 2n + 1 xor and dW/se(6n − 3) + 2n(2w − 2) + 3 shift instructions. A summary
of the instruction estimates are shown in Table 3.3, 3.4, 3.5 and 3.6.
If W = 32, w = 4, n = 8 we obtain 725 reads, 333 writes, 813 xor and 247 shift
operations which lead to a total estimated cost of 3176 cycles and a precomputation
table of 8 kB. This method is slightly slower than the LD with fixed registers
and w = 4, s = 4 and it requires twice the amount of memory for storing the
precomputation table. A summary of the estimated execution times is shown in
Table 3.7.
54
Analysis of algorithms
Divide-and-Conquer methods
The divide-and-conquer methods are analyzed in combination with the LD with
fixed registers algorithm. The divde and conquer methods all split the input
parameters into smaller parts and then perform the multiplications using these
smaller parts. This leads to a reduced requirement for storing the intermediate
values inside the multiplication algorithm as well as a potentially reduced storage
requirement for the precomputation table because the precomputation table now
only needs to be computed on a polynomial which is a fraction of the original length.
In order to get an accurate estimation of the number of instructions which is
required for the divide-and-conquer approach, a model is developed for four different
variants of the divide-and-conquer algorithms. These are:
Symmetrical Hybrid
A symmetrical split is performed on the two input parameters x and y to divide
them into two equal parts of 117 bits each to fit inside 4 words. The LD with fixed
registers algorithm with w = 4 is used to perform the multiplications. If the
input polynomial is split into two equal-sized parts then the number of intermediary
results inside the multiplication algorithm is reduced from 2n (For the full-sized
multiplication) to n. All the intermediates are therefore stored in registers which
leads to a reduced number of accesses to the memory.
A pre-shift operation is required in order to prepare the input data for input into
to the precomputation and the multiplication functions. A post-shift operation is
required to shift the results from the multiplication to the correct bit position for
combination into the final result. The pre-shift requires n/2 + 1 reads, n/2 writes, n
XOR and 2n shifts instructions. The post-shift requires 2n + 3 reads, 4n+3 writes,
5n XOR and 6n shift instructions. Table generation is performed twice to generate
two tables, requiring n/2 reads, 2w n/2 writes (assuming y is short), 2w n/2 XOR
instructions and n/2(2w − 2) shifts, for each of all 2 table generations. The inner loop
requires n/2 reads from T [u][l], 1 read from x[k], n/2 XOR and 1 shift instruction,
for each of all 2dW/we loop iterations. The logical shift of c(z) computed at the
intermediary stage requires n − 1 XOR and 2n − 1 shift instructions, for each of all
4dW/w−1e loop iterations. This leads to a total number of (3.5n+4)+dW/we(n2 +2)
55
3. Implementation
If W = 32, w = 4, n = 8 we obtain 672 read, 167 write, 884 XOR and 612 shift
instructions. This leads to a total estiated cost of 3174 cycles. The memory instruc-
tions have been reduced from the original López-Dahab with fixed registers
algorithm at the cost of an increased number of XOR instructions. The advantage of
this method is that all the precomputation tables do not need to be stored simulta-
neously inside the memory. This leads to the precomputation tables only requiring a
modest 2 kB of memory. A summary of the estimated execution times is shown in
Table 3.7.
Asymmetrical Hybrid
An asymmetrical split on the two input parameters x and y followed by an LD with
fixed registers multiplication on the smaller input parameters. The split into two
is done by considering the word-size of the processor and therefore both x0 and y0
are 128-bits long whereas x1 and y1 are both 105-bits long and each fits into 4 words.
Two lookup tables needs to be computed; one for y0 and the other for y1 . When y0
is shifted by w − 1 it cannot fit inside 4 words anymore and another word needs to
be assigned to hold the values for the lookup table. Therefore the lookup table of y0
contains 2w (n + 1) words and the lookup table for y1 contains only 2w n words. Four
multiplications are required with precomputation tables with two different lengths.
In total for this method we require the following operations: 2 precomputations,
2 field multiplications with a large lookup table and 2 field multiplications with
a smaller lookup table. The biggest disadvantage of this method comes from the
larger precomputation table which requires 2w extra words to be computed during
the precomputation phase, stored inside memory, and accessed again during the
multiplication.
In order to perform the asymmetrical split the input parameters x and y are
both split into two; one part is 128 bits (or four words) long and the other part is
105 bits long. The precomputation has to be executed on both these polynomials
but the 128-bit polynomial needs a lookup table of size (n + 1)2w whereas the
shorter polynomial only needs a lookup table of size n.2w . Therefore, two different
multiplication algorithms are needed to efficiently access the two precomputation
tables of different lengths.
To generate the short precomputation table requires n/2 reads, 2w n/2 writes,
w
2 n/2 XOR and n/2(2w − 2) shift instructions. To generate the long precomputation
table requires n/2+1 reads, 2w (n/2+1) writes, 2w (n/2+1) XOR and (n/2+1)(2w−2)
shift instructions. The multiplication with the short precomputation table requires for
the inner loop n/2 reads from T [u][l], 1 read for x[k], n/2 XOR and 1 shift instruction,
for each of all ndW/we loop iterations; for the logical shift of c(z) computed at the
56
Analysis of algorithms
intermediary stage requires n − 1 XOR and 2n − 1 shift instructions, for each of all
2dW/w − 1e loop iterations; and the output of the multiplication requires 2n writes.
The multiplication with the long precomputation table requires for the inner loop
n/2 + 1 reads from T [u][l], 1 read for x[k], n/2 + 1 XOR and 1 shift instructions,
for each of all ndW/we loop iterations; for the logical shift of c(z) computed at the
intermediary stage requires n XOR and 2n − 1 shift instructions, for each of all
dW/w − 1e loop iterations; at the output of the multiplication 2n writes. A summary
of the instruction estimates are shown in Table 3.9, 3.10, 3.11 and 3.12.
The recombination of the results from the four multiplications requires 2n read
and 2n XOR instructions. In total this method requires dW/we(n2 + 3n) + 3n + 1
reads, (n + 1)2w + 4n writes, (n + 1)2w + dW/we(n2 + 5n − 2) − 4n + 2 XOR and
dW/we(8n) + (n)(2w − 10) + 2w + 2 shift instructions. If we set W = 32, w = 4, n = 8
this method requires 729 reads, 176 writes, 930 XOR and 506 shift instructions. This
leads to a total estimated cost of 3246 cycles. A summary of the estimated execution
times is shown in Table 3.13.
57
3. Implementation
n/2 XOR and 1 shift instructions, for each of all (3)(n/2)dW/we loop iterations.
The logical shift of c(z) computed at the intermediary stage requires n − 1 XOR
and 2n − 1 shift instructions, for each of all (3)dW/w − 1e loop iterations. This
leads to a total of 16n + 4 + (n)dW/we((3/4)n + 3/2) reads, (3/2)2w n + 7.5n writes,
(3)2w n/2 + 10n − 3 + dW/we(n2 + 4n − 4) XOR and 3wn + 1n − 1 + dW/we(10n − 4)
shift instructions. A summary of the instruction estimates are shown in Table 3.9,
3.10, 3.11 and 3.12.
If W = 32, w = 4, n = 8 this leads to 660 read, 252 write, 1005 XOR and 711
shift instructions with a cost of 3540 cycles and a memory consumption of 2.75 kB.
A summary of the estimated execution times is shown in Table 3.13.
58
Analysis of algorithms
and 2(n + 2) + 1 shift instructions, for each of all 2dW/w − 1e loop iterations. The
write back and recombine step requires (n − 2) read, (2n + 4) write, (n − 2) XOR
instructions. A summary of the instruction estimates are shown in Table 3.9, 3.10,
3.11 and 3.12.
This leads to a total of dW/we(n2 +2n+6)+2n−5 read, 2w (n+1)+dW/we(6)+2n
write, 2w (n + 1) + dW/we(n2 + 2n + 2) − n − 4 XOR and n2w + 2w + dW/we(5n +
10) − 6n − 12 shift instructions. If W = 32, w = 4, n = 8 this leads to 699 read, 208
write, 788 XOR and 412 shift instructions and a cost of 3014 cycles. A summary of
the estimated execution times is shown in Table 3.13.
Table 3.9: Analysis of the read instructions required to perform different field
multiplications.
Table 3.10: Analysis of the write instructions required to perform different field
multiplications.
Table 3.11: Analysis of the XOR instructions required to perform different field
multiplications.
Conclusion
Table 3.14 shows a summary of all the different field multiplication approaches which
were analysed.
59
3. Implementation
Table 3.12: Analysis of the shift instructions required to perform different field
multiplications.
Table 3.13: Analysis of the instruction count required to perform different field
multiplications. This assumes that n = 8, W = 32, w = 4. The sum column
represents the estimated cycles and is calculated with the weighted sum of all the
instructions, with a weight of two assigned to the reads and writes, and a weight of
one assigned to the XORs and shifts.
The standard LD method was found to be the slowest algorithm due to very
frequent memory accesses. The LD with rotating registers is an improvement to
the standard LD method with a cycle count of 3492 cycles. The proposed method ( LD
with fixed registers) leads to the fastest execution time on the target platform
with 2904 cycles and a modest requirement of 4 kB for the lookup table. From
this we can see that the LD with fixed registers has a performance increase of
20.24% over the LD with rotating registers method.
For the divide-and-conquer methods, we stated that we will only consider a split
by a factor of two. The Symmetrical Hybrid and Asymmetrical Hybrid methods
both have very similar run-times but the asymmetric method has a smaller code size
because the code for the multiplication can be re-used for all four multiplications and
the memory requirements for its lookup tables are low because both y0 and y1 are
short polynomials. The Symmetrical Karatsuba with LD method was found the
slowest of all the divide-and-conquer methods. This method is often cited in literature
but it lead to very poor results in our analysis due to the complexity of recombining
the results of the multiplications and having to compute three precomputation tables
instead of two. The Asymmetrical Hybrid with Operand Scanning method was
found to be the fastest divide-and-conquer method. It saves on memory reads
by grouping the same multiplicands together and thereby reducing the number of
memory accesses inside the main loop. It’s only drawback is that it requires one small
lookup table and one large lookup table which leads to a total memory consumption
of 4.5 kB.
60
Analysis of algorithms
Table 3.14: A summary of the memory and cycle count costs associated with all the
different analysed methods for n = 8, W = 32, w = 4. The sum column represents
the weighted sum of all the instructions with a weight of two assigned to the reads
and writes, and a weight of one assigned to the XORs and shifts.
61
3. Implementation
because we do not have so many free registers due to the required table lookups for
the squaring operation.
3.3.3 Squaring
The polynomial squaring operation is always followed by reduction and which leads to
a large amount of memory overhead if the two operations are called as two separate
functions.
The standard squaring (Algorithm 20) operation followed by a standard fast
reduction (Algorithm 19) is analyzed. The squaring operation requires n reads from
the input parameter a[i], 2n reads for the lookups in table T , 2n writes to the result
vector c, 3n shifts to obtain u1 , u2 and u3 . For n = 8 the squaring operation requires
a total of 24 read, 16 write and 24 shift instructions. The reduction operation requires
in each loop iteration 4 read, 4 write, 4 XOR and 4 shift instructions, for each of
all 8 loop iterations; outside the loop it performs 5 read, 4 write, 3 XOR and 3
shift instructions. The reduction requires in total 37 read, 36 write, 35 XOR and
35 shift instructions. The cost of performing the squaring and reduction operations
non-interleaved is 61 read, 52 write, 35 XOR and 59 shift operations. This leads to
an estimated cost of 288 cycles.
For the interleaved squaring and reduction algorithm (Algorithm 21) the number
of read operations from the input parameter A[i] is counted as follows: Only a single
read is required from A when a read from A is performed in two consecutive lines
which both access the same element in A. e.g. line 3 and line 4 both access a[1];
otherwise, two read operations are required to access the elements of A (e.g. line 9
and 15 both access A[5] and therefore requires two reads). This algorithm requires a
total of 11 reads from the input parameter a, 16 reads for table lookups from T , 8
write, 35 XOR and 59 shift instructions. This leads to a total of 27 read, 8 write,
35 XOR and 59 shift operations and a total estimated cost of 164 cycles, and is a
75.6% improvement in the runtime from the combined (non-interleaved) squaring
and reduction operations.
3.3.4 Inversion
The extended Euclidean inversion algorithm (Algorithm 10) has a non-deterministic
run-time. Therefore we will only analyse the number of instructions inside the inner
loop. Let’s assume that the function which determines the degree is implemented
and optimized to a degree that it will only require 1 read. One loop iteration will
requires 2 reads for the two degree functions, n reads and n writes for each of the
two swap functions, 2n reads, n writes, n XOR and n shifts for line 8, and line 9
separately. This leads to a total of 6n + 2 read, 4n writes, 2n XOR and 2n shift for
a loop in the worst case scenario where the if-statement is taken.
Algorithm 22 reduces the memory operations by performing some code-duplication
in order to avoid doing the two swap operations in Algorithm 10. One loop iteration
will then require 2 reads for the two deg functions, 2n reads, n writes, n XOR and n
shifts for line 6, and line 7 or line 12 and line 13 (depending on which branch was
62
Optimizations
3.4 Optimizations
In the following we discuss all the optimizations which were performed, from the base
implementation to the final optimized implementation. First, we will describe the
setup which was used to profile the code; second, we discuss the base implementation;
third, we will discuss the effect of each of the optimizations which were done.
3.4.1 Profiling
In order to be able to measure the performance of a given algorithm a profiling tool
is used to measure the execution time of a routine. The profiling was done by means
of the SysTick timer which is a 24-bit timer that runs at the same clock speed as
the ARM Cortex-M0+ and counts down from a specified value to zero. To profile a
function, the timer is read at the start and the end of a function and the execution
time of the function is found by calculating the difference between the two values.
The duration spent inside each function is added to a global variable and another
global variable is used to record the number of times that the function was called.
The execution time for each function is printed to the debugging console after the
profiled code has finished executing.
Before starting the optimization process, random test vectors were generated and
the same vectors were used throughout the optimization process. The parameter k
can therefore be seen as a constant throughout the optimization process. All the
profiling information which is reported here is done by performing a single point
multiplication with the same k for both the fixed and random point multiplication
functions. One problem with this method is that certain functions like the inversion
and point multiplication have a variable run-time which depends on the scalar k. To
get a more accurate estimate multiple inversion operations should be performed with
randomly generated input vectors and the average executed time should be reported.
Another problem with this method is that the profiling code adds a small amount
of overhead every time it is executed. No attempt was made to compensate for this
effect. The actual runtime is a little bit faster when the profiling is removed.
Yet another problem with this method is that some variation exists in the number
of reported cycles. It sometimes happens that if one block of code (called Block A)
is followed by another block of code (called Block B) which is being profiled, and the
code in Block A is altered without making any alterations to Block B, the number
of cycles reported for Block B by the profiler changes. This probably happens due to
compiler optimizations which leads to the profiling instructions not being executed
in the intended position (in this case immediately before and after block B). This is
probably the reason that some functions have fluctuating values even though there
were no algorithmic changes made to them (The reduction optimization values in
Table 3.15 is a good example of this).
63
3. Implementation
3.4.3 Optimisations
The general approach followed during optimization was to first create a base imple-
mentation in C, and then profile the code to identify potential bottlenecks. Next, the
number of operations inside the routines with the highest cycle counts were reduced
as much as possible by reducing the number of elements which are being processed
as much as possible. The number of function calls were reduced by interleaving the
functions as much as possible. Each time an optimization attempt was made, the code
was profiled in order to determine if the optimization was successful. Subsequent, we
analysed a number of different field multiplication algorithms in order to identify
the fastest one we can find for the target platform. Once we identified the fastest
algorithm, we implemented it in C and then implemented it in assembly. Finally, we
implemented many of the field arithmetic functions in assembly.
In the following we discuss the effect of all optimizations which were done.
Table 3.15 shows the cycle counts of individual functions during the optimization
process. Please note that the squaring and multiply includes a field reduction
operation. Table 3.16 shows the number of times the individual functions were
executed the optimization process. These tables should be used together in order to
get a better understanding of what is happening inside the algorithms. A saving of
64
Optimizations
one cycle in a function that gets called many times will lead to a large total savings.
Profiling info for version 1.00 revealed that the squaring operation consumed
9236 cycles per operation, or 48% of the total execution time. The literature states
that the field multiplication should take approximately 70%-95% of the total execution
time, so we concluded that there is clearly a room for optimization in the squaring
function. The total runtime for a random point multiplication was found to be
17.6 M cycles.
In version 1.01 an improvement was made to the polynomial squaring because
the squaring table was wrongly being allocated on the stack and was populated
inside the polynomial squaring function which caused a huge overhead. To fix this
problem the lookup table was declared as a global constant with fixed values. This
reduced the runtime of the squaring to only 500 cycles (an improvement of 1847%).
Field multiplication was reduced from 6843 cycles to 4337 cycles per operation (an
improvement of 57.8%) by using shorter vectors for the precomputation table and by
also performing less add operations because of the shorter vector size. A large amount
of initializations was also removed from many of the function which further improved
runtime. The total runtime for a random point multiplication was decreased to
7.72 M cycles, which is an improvement of 127.9% from the previous version.
In version 1.02 improvements were made to the overflow handling in the shift
functions. A more efficient shifting method was used which doesn’t use any variables.
The method used for testing if a point is at infinity was improved by declaring
the point at infinity as a global constant and then comparing it with the point
being tested. The old method declared a new point and then assigned the values
of the point at infinity to this point, whereafter it gets compared the test point.
The new method leads to a more efficient test and the global constant allows for a
more efficient way to assign a point to infinity. Checking if a point is at infinity is
performed inside the Point Addition operation. The precomputation table generation
was improved by implementing a shift which operates on n words instead of 2n. The
total runtime for a random point multiplication was decreased to 6.62 M cycles,
which is an improvement of 16.6% from the previous version.
In version 1.03 the inverse function was improved by by reducing the number
of ‘if’ statements inside the loop and by improving the variable shift function to
allow it to take a different input and output vector. The old variable shift function
required the input and output parameters to be the same which lead to many n-word
copy operations inside the main loop. A more efficient way to calculate the degree of
the field was also made for the inverse function. Two indices are introduced to keep
track of the position of the most significant positive word of the polynomials u and
v. This means that every time the degree function is called we do not need to check
all the elements from the most significant to the least significant element to find the
first item which is positive. The indices point to the correct word so now all we need
to do is just find the most significant bit inside that word. By performing all these
optimizations the inverse function was improved by 232%. The indices are updated
at the end of each loop. The total runtime for a random point multiplication was
decreased to 6.62 M cycles, which is an improvement of 11.07% from the previous
version.
65
3. Implementation
66
Optimizations
as a test to see if it behaves well inside the project. The reduction is still done in a
separate function. The total runtime for a random point multiplication was decreased
to 3.71 M cycles, which is an improvement of 3.77% from the previous version.
In version 1.10 the Symmetrical Hybrid method is implemented in C. The
cycle counts given inside the table is for the complete multiplication process including
the overhead of splitting the input parameters, performing four multiplications,
recombining the results from the multiplication and for generating two precompu-
tation tables. Some optimizations were done inside the multiplication routine to
take advantage of the fact that the variable y is only 117 bits long (See below for
a further explanation). A large amount of cycles can be saved by using a switch
statement inside the outer loop to only access the most significant word of y if it
contains a value∗ . The runtime of the multiplication routine was improved by 0.5%
when compared to the previous version. The total runtime for a random point
multiplication was decreased to 3.66 M cycles, which is an improvement of 3.01%
from the previous version.
In version 1.11 the Asymmetrical Hybrid method is implemented in C and
optimizations were also done inside the multiplication routine to take advantage
of the fact that the variable y is only 117 bits long. A large amount of cycles can
be saved by using a switch statement inside the outer loop to only access the most
significant word of y if it contains a value (See below for a further explanation). The
cycle counts in Table 3.15 includes the time needed to calculate two precomputation
tables and four multiplications. We can see that the precomputation table takes
more cycles to be generated because the asymmetrical input polynomial after the
split leads to one large (2w (n/2 + 1)) precomputation table and one smaller (2w (n/2))
precomputation table. The runtime of the multiplication routine was increased by
2.16% when compared to the previous version. The total runtime for a random point
multiplication was increased to 3.70 M cycles, which is an increase of 1.08% from
the previous version.
In version 1.12 an implementation of the LD with fixed registers (Algo-
rithm 17) was done in C. Local variables were used try to simulate the registers of
the CPU. We can see that there is a slightly larger cycle count for the multiplication
compared to many of the other versions. The goal of this implementation was to
provide a reference to compare the intermediate results of the assembly implementa-
tion. The total runtime for a random point multiplication was increased to 3.81 M
cycles, which is an increase of 2.88% from the previous version.
In version 1.13 the LD with fixed registers is implemented in assembly.
The reduction algorithm is also interleaved with the multiplication routine in order
to take advantage of the results which were still stored inside the registers. This
leads to a significant decrease in the number of redundant memory operations. Since
the reduction operation and field multiplication is interleaved, the reduction was
removed from the profiling info and is noted in in Table 3.16. Some modifications
were done to the precomputation table generation which leads to less reads and
a faster cycle count which lead to a 42.7% improvement in cycles. The value for
the link register (LR) is pushed onto the stack in order to increase the number of
free registers to 14. The runtime of the multiplication routine was decreased by
67
3. Implementation
55.47% when compared to the previous version. The total runtime for a random
point multiplication was decreased to 3.01 M cycles, which is an improvement of
26.6% from the previous version.
In version 1.14 the squaring is interleaved with the reduction algorithm to
reduce the number of redundant memory accesses (Algorithm 21). Note that field
reduction is now interleaved with both the multiplication algorithm as well as the
squaring function. The cycle count, and count for the reduction field is therefore a
zero inside Table 3.15 and 3.16. We didn’t bother to try to calculate the execution
time for the reduction anymore because it wasn’t considered useful to include it
in the profiling information. The squaring function’s cycle count was improved by
17.4%. The total runtime for a random point multiplication was decreased to 2.97 M
cycles, which is an improvement of 1.35% from the previous version.
In version 1.15 the squaring function from version 1.14 is implemented in
assembly. We can see from the results that there is only a 5% improvement in the
cycle count for writing this routine in assembly. The value for the link register (LR)
is pushed onto the stack in order to increase the number of free registers. The total
runtime for a random point multiplication was decreased to 2.91 M cycles, which is
an improvement of 2.06% from the previous version.
In version 1.16 the precomputation tables are generated inside the same assem-
bly function which performs the multiplication. This leads to a 3.6% improvement in
the multiplication function’s execution time. The total runtime for a random point
multiplication was decreased to 2.81 M cycles, which is an improvement of 3.55%
from the previous version.
In version 1.17 we show the cycle count for a fixed-point multiplication. Here
a window parameter of w = 6 is used and a precomputation table of 16 points
are generated just after the microcontroller has started up. This leads to a greatly
improved execution time for multiplying a scalar p with the generator G. The total
runtime for a fixed point multiplication is 1.96 M cycles, which is an improvement
of 43.4% from the random point multiplication timing in version 1.16. We can see
in Table 3.16 that the number of point additions and point subtractions differ from
that of the random point multiplication. This is due to the different window size
parameter of the fixed point multiplication.
In the following discussion the optimizations which performed better or worse than
expected is analysed. The LD with fixed registers C implementation (version
1.12) performs worse than the LD with rotating registers C implementation.
This is probably partly due to the C compiler having a difficulty with keeping all the
intermediate values inside the registers. There is still some room for optimization of
the LD with fixed registers C implementation because it still performs a number
of initializations.
The assembly implementation of the square function had no performance im-
provement over the C implementation. However, the assembly implementation of
the squaring function interleaved with reduction has a performance improvement of
5% over the C implementation.
In the LD method a large amount of cycles can be saved by using a switch
statement inside the outer loop to only access the most significant word of x if it
68
Optimizations
Table 3.15: The cycle counts of the different functions during optimization.
Ver. Total Mul Pre Mul Add Sqr Inv Red P. Add
1.00a 17.6M 6843 16009 165 9236 847k 182 162460
1.01 7.72M 4337 11244 103 500 847k 182 123168
1.02 6.62M 2387 8796 100 500 886k 182 79906
1.03 5.96M 2387 8796 94 500 267k 182 79906
1.04 5.17M 2387 8796 94 510 267k 182 79906
1.05 4.75M 1718 8127 69 510 238k 182 70631
1.06 4.14M 999 6722 60 456 230k 182 59358
1.07b 4.05M 1069 6485 61 456 230k 182 57487
1.08 3.85M 1090 5887 59 473 191k 201 52725
1.09 3.71M 1054 5592 57 492 189k 194 50311
1.10c 3.66M 800 5563 57 492 189k 194 50103
1.11d 3.70M 870 5683 57 492 189k 194 50977
1.12e 3.81M 1106 5964 57 492 189k 197 53343
1.13f 3.01M 775 3836 57 492 189k 197 36206
1.14g 2.97M 775 3835 57 419 139k 0 35877
1.15h 2.91M 762 3805 57 399 139k 0 35456
1.16 2.81M 675 3672 68 395 139k 0 34426
1.17i 1.96M 675 3672 68 395 139k 0 34426
a Base implementation: LD
b Algorithmic change: LD with rotating registers
c Algorithmic change: Symmetrical Hybrid
d Algorithmic change: Asymmetrical Hybrid
e Algorithmic change: LD with fixed registers in C
f Algorithmic change: LD with fixed registers in assembly
g Algorithmic change: Interleaved squaring with reduction in C
h Algorithmic change: Interleaved squaring with reduction in
i Fixed-point multiplication
contains a value. Assume that x contains a 117-bit value. This means that the input
parameter fits inside 3 words and 21 bits. In line 5 of algorithm 15 the parameter u
is calculated by taking w bits of parameter x, from the most significant part to the
least significant. If W = 32, w = 4, then the first two times x[3] is read, the value
is zero because x[3] only contains 21 useful bits. The table lookup only needs to
be performed when u > 0 because T [0] only contains zeros. This optimization is
applicable to all the variants of the LD method.
Table 3.17 shows the cycle counts of the implemented field multiplication routines,
compared with the results from the simulations. The cycle counts for the C and
assembly implementations include a field reduction (which account for approximately
180 cycles), whereas the results from the simulation doesn’t include a reduction
step. The simulated values are generally much faster than the C implementations.
However, the assembly implementation of the LD with fixed registers has a cycle
69
3. Implementation
Table 3.16: The number of times the different functions were executed during
optimization.
Ver. Mul Pre Mul Add Sqr Reduce Point Add Point Sub
1.00 370 370 4911 921 1291 46 42
1.01 370 370 4911 921 1291 46 42
1.02 370 370 4911 921 1291 46 42
1.03 370 370 4911 921 1291 46 42
1.04 370 370 4911 921 1291 46 42
1.05 370 370 4911 921 1291 46 42
1.06 370 370 841 921 1291 46 42
1.07 370 370 841 921 1291 46 42
1.08 370 370 389 921 1291 46 42
1.09 370 370 389 921 1291 46 42
1.10 740 1480 389 921 1291 46 42
1.11 740 1480 389 921 1291 46 42
1.12 370 370 389 921 1291 46 42
1.13 370 370 389 921 921 46 42
1.14 370 370 389 921 0 46 42
1.15 370 370 389 921 0 46 42
1.16 370 370 389 921 0 46 42
1.17 274 274 290 867 0 34 60
count that is very close to the simulated value. The rows inside the table are sorted
according to their simulated speeds. We can see that the C implementations all
have roughly the same execution times, except for the LD with fixed registers
algorithm, which is one of the second slowest C implemented algorithm. This is
because there is still some room for optimization in the C implementation of the LD
with fixed registers algorithm.
Compared to version 1.00, version 1.16 is 526% faster. Compared to version 1.00,
version 1.17 is 978% faster.
70
Conclusion
3.4.4 Memory
Table 3.18 gives a summary of all the different items which contribute to the memory
consumption of the implementation. The generator point is stored in memory
which requires 768 bytes. The reduction polynomial requires 256 bytes. The field
multiplication precomputation table requires 4 kB of memory. The polynomial
squaring precomputation table requires 256 16-bit entries which totals 4 kB of
memory. The TNAF table for random point multiplication requires 8 precomputed
points to be stored in memory. This requires a total of 192 words or 6 kB. The
TNAF table for fixed point multiplication requires 16 precomputed points to be
stored in memory. This requires a total of 384 words or 12 kB. Point Addition uses
seven 16-element 32-bit vectors with a total memory usage of 3.5 kB. Point Double
uses two 16-element 32-bit vectors with a total memory usage of 1 kB.
Operation Memory
Generator point 0.768 kB
Reduction polynomial 0.256 kB
Field multiplication precomputation 4 kB
Polynomial squaring lookup 4 kB
TNAF random point multiplication table 6 kB
TNAF fixed point multiplication table 12 kB
Point Addition 3.5 kB
Point Double 1 kB
Total during Point Addition 30.5 kB
Total during Point Double 28 kB
3.5 Conclusion
We described the model which was used to select a curve which will be a good match
on the architecture of the target platform. The model showed that binary Koblitz
curves will lead to a slightly faster implementation with a lower power consumption
than prime curves, due to the fact that binary curve arithmetic uses instructions
which requires less energy on the target platform.
We discussed a number of different existing field multiplication algorithms, and
attempted to improve these algorithms to find an even better matching algorithm
for our architecture. In general two different types of algorithms were explored:
(1) algorithms based on splitting the input parameters, and (2) algorithms which
process the input parameters as a whole. For the split input parameter algorithms
we combined the LD algorithm with a symmetrical split, an asymmetrical split, an
operand-scanning form, and the Karatsuba-Ofman approach. For the whole input
parameter processing algorithms we tried to find optimizations to the LD, and LD with
rotating register algorithms. A new field multiplication algorithm was introduced,
71
3. Implementation
72
Chapter 4
Results
This chapter is dedicated to comparing the results that were obtained from the
proposed implementation with the results from software libraries and literature. First
we will describe the system that we designed to measure the power consumption
of the target as well as give the results from our energy measurements. Next,
we will provide a description of our results, and compare it with the results from
implementations found in literature and in software libraries. Next, we will use the
cycle estimation models from Chapter 3 to make an estimate of how well the different
field multiplication methods perform for different processor word lengths.
73
4. Results
tions on the target platform. The aim is to determine the energy consumption of
each individual instruction in order to make decisions about which instructions
and algorithms will lead to a more energy efficient implementation.
In a typical sampling scenario the test code is executed multiple times. Just
before the test code starts to run the trigger wire goes high. The moment after the
test code has terminated the trigger wire goes low. A timer is then started which
waits 100 ms before executing the next test code.
The instruction measurements are performed by placing 112 copies of the same
instruction performed on different registers (to avoid a pipeline stall) in a function
which is then executed a 10k times before making the trigger go low and waiting
a 100 ms by means of the timer. The reason for repeating the instruction so many
times is to be able to take a measurement with the Arduino which is running at the
slow sampling frequency of 21kHz.
74
Comparison with other libraries
Table 4.1: The energy used per cycle for different instructions. The clock frequency
is 48MHz.
Instruction Energy [pJ]
LDR 10.98
NOP 11.56
LSR 12.05
MUL 12.14
LSL 12.21
XOR 12.43
STR 12.58
ADD 13.45
with the Extended Euclidean algorithm and squaring is done using the table-based
method. The easy-to-understand arithmetic option is selected but this could be
replaced with calls to the GMP library. However, we were unable to get the GMP
library cross-compiled for the ARM Cortex-M0+ and were therefore unable to test
this feature.
The RELIC implementation was measured to have an average power consumption
of 600 µW while performing a random point multiplication, and 600.5 µW while
performing a fixed point multiplication. A random point multiplication takes 117.1 ms,
and therefore consumes 70.26 µJ per operation. A fixed point multiplication takes
115.7 ms, and therefore consumes 69.48 µJ per operation. These results are listed in
Table 4.3, and Table 4.2.
Our proposed implementation was measured to have an average power consump-
tion of 577.2 µW for a random point multiplication, and 519.6 µW for a fixed-point
multiplication. A random point multiplication takes 59.18 ms, and therefore con-
sumes 34.16 µJ per operation. A fixed point multiplication takes 39.70 ms, and
therefore consumes 20.63 µJ per operation. These results are listed in Table 4.3, and
Table 4.2.
Fig. 4.1 shows a measurement of the energy consumption of the ADD instruction
on the target platform. Fig. 4.2 shows a measurement of the energy consumption
of ten fixed point multiplication operations performed using the LD with fixed
registers on the ARM Cortex-M0+. Ten identical multiplications are performed
with identical keys. Notice the large variation of the peaks due to a number of
different instructions which are used by the point multiplication routine.
75
4. Results
6.8
6.7
Power Consumption (Watt)
6.6
6.5
6.4
6.3
6.2
6.1
Figure 4.1: Energy consumption of the ADD instruction on the target platform.
First we will give a short overview of several implementations that were found in
literature, followed by a number of implementations from software libraries; finally
we will present the results of the proposed implementation and make a number of
comparisons to the previously mentioned implementations.
76
Comparison with other libraries
6
Power Consumption (Watt)
4
2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
Time (s)
ATMega128L) and the TelosB (which contains the MSP430). Only a small number
of implementations were found in the literature for ARM microcontrollers like the
IMote2 (which contains the ARMv5TE based PXA271), ARM7TDMI and the ARM
Cortex-M3. As far as we know there are no known implementations for the ARM
Cortex-M0+.
Both the ARM7TDMI and the PXA271 are more powerful platforms than the
ARMv6-M based ARM Cortex-M0+. The ARMv6-M only supports execution of
Thumb instructions and can therefore never enter ARM mode (See section 2.5
for an explanation of ARM mode and Thumb mode) in which shifts instructions
can be embedded with data-processing instructions. Both the PXA271 and the
ARM7TDMI support instructions in ARM mode. The ARMv6-M also has a much
smaller instruction set than both the ARM7TDMI and the PXA271. The PXA271
wireless MMX (wMMX) has has a special set of SIMD instructions which operates
on the registers of the processor.
Table 4.3 shows the execution time of point multiplications in milliseconds.
Some of the execution times are for fixed-point multiplications and others are for
random-point multiplications. The amount of energy consumed to perform the point
multiplication is also indicated. In the cases where the amount of energy consumed
was not given in the author’s results, the amount of energy is estimated by using
the typical energy consumption values for the target platform from section 2.5 to
make an estimate. Since there is a large variation in energy consumption depending
on the type of instruction used, the energy values should be used as a rough guide
77
4. Results
only. Table 4.4 shows the execution times for modular multiplication and squaring
operations in cycles. The field multiplication and squaring operations are the two
most important operations for elliptic curve cryptography.
Table 4.3: Timings for point multiplications. All timings are given in milliseconds
and energy is given in millijoules (mJ). The ATMega128L runs at 7.37MHz except
when indicated with a , the MSP430 runs at 8.192MHz, the ARM7TDMI runs at
80MHz, and the ARM Cortex-M0+ runs at 48MHz
78
Comparison with other libraries
Table 4.4: Timings in cycles for modular multiplication and modular squaring on
different platforms.
79
4. Results
because they do not have a variable-length shift instruction on their target device.
Their optimal configuration for prime curves uses Hybrid multiplication and for
binary polynomial multiplication they used the Karatsuba-Ofman multiplication
algorithm. For binary fields they used binary Koblitz curves as no expensive doubling
operations are required. For their fixed point multiplication in prime fields they did
some pre-computation with the Comb method and w = 4. Their point multiplications
in prime fields was found to be faster than in binary.
Kargl et al. made some software implementations on the ATMega128L for prime,
binary and optimal extension fields [34]. For multiplication in the binary field they
use the LD method with w = 4. They use a sparse reduction polynomial for fast
reduction in binary. They found that the field arithmetic of prime fields are faster
than that of binary fields. For prime fields they use the Jacobian coordinate system
and for binary fields they use the López-Dahab coordinate system. To secure against a
simple timing-based side-channel attacks they use the Montgomery-ladder algorithm
which executes faster on binary fields than on prime fields.
Uhsadel et al. made a software implementation on the ATMega128L for the prime
secp160r1 curve in [42]. They implemented the hybrid multiplication and found an
optimized way to perform carry-bit handling to achieve a fast field multiplication.
Seo et al. was able to make a sub 1-second implementation for the binary
Koblitz sect163k1 curve on the ATMega128L on [40]. Their implementation is based
upon the TinyECC library which was optimized for Koblitz curves. Their fastest
implementation used windowed wTNAF with w = 5. They used the left-to-right
comb method with window width w = 4 for the field multiplication. They adapted
the algorithm somewhat to reduce the number of redundant memory accesses. They
used fast reduction due to a sparse reduction polynomial. They use the LD coordinate
system and they store their precomputed points in affine coordinates.
Aranha et al. describes the fastest known point multiplication on the ATMega128L
in [37]. For a field multiplication they propose an optimization to the LD algorithm
called LD with rotating registers. Redundant memory accesses are reduced by
interleaving the multiplication with the reduction operation. For modular squaring
they use a 16-byte lookup table and the reduction is interleaved with the squaring
so that the upper half of the words which are produced by squaring doesn’t need
to be written to memory. Inversion is done with the Extended Euclidean algorithm
and they use a fast reduction algorithm due to the sparse reduction polynomial.
For multiplication with a random point they used the 4-TNAF method for Koblitz
curves and the López-Dahab [49] method for binary curves. This point multiplication
method by López and Dahab does not make use of precomputation and has a constant
cycle count. For multiplying the generator, they employ the same 4-TNAF method
for Koblitz curves with 16 precomputed points. For generic curves they employ the
Comb method. These results were put into a software library called the RELIC
toolkit [50]. Their implementations were optimized in assembly.
S. Erdem [43] made several implementations on binary fields of different orders
for the ARM7TDMI using the operand-scanning method combined with LD with
a window of width w = 4 by López and Dahab. They achieved a polynomial
multiplication in 4359 cycles (224 bit), 5398 cycles (256-bit) and 6551 cycles (288-bit).
80
Comparison with other libraries
Reduction was done in a separate step and this they achieved in 348 cycles (224-bit),
389 cycles (256-bit) and 430 cycles (288-bit).
P. Szczechowiak et al. [44, 51] made an implementation on a binary curve which
uses an underlying field of order 2271 . Field multiplication is done using the LD with
a window of w = 8 and a 2-bit scanning. By doing this they drastically decreased the
number of loop iterations and further unrolled all the loops. Shifting c(z) by 8 in the
intermediary stage was not necessary because 2 pointers were used that allowed the
load and store operations to access the appropriate bytes. This is possible because
the architecture supports non-word aligned read and write operations. The authors
also exploited the multiple load and store operations of the ARM microcontroller.
Großschädl et al. made a software implementation on the ATMega128L on a
160-bit prime curve in [39]. They used an optimal prime field which leads to even
faster reductions than NIST prime fields. Another advantage of their curve is that
it is possible to compute efficiently computable endomorphisms which leads to a
reduced number of addition and doubling operations.
B. Oliveira et al. [45] made an implementation for the PXA27x with an underlying
binary field of order 2271 . For multiplication they used the LD with a rotating
registers and w = 4. For squaring they used a 16-byte lookup table which stores
the square of all 4-bit polynomials. They also performed immediate reduction of the
upper half of the words instead of writing them to memory for later reduction. All
their finite field arithmetic was implemented in assembly and since a bitwise shift
is free for a data-processing instruction, they performed all bitwise shifts inside the
XOR instructions. They achieved modular squaring in 187 cycles, multiplication
in 2025 cycles and multiplication using the wireless MMX (wMMX) instructions in
1411 cycles.
81
4. Results
SEC curves. It has support for ARM and is a C-only library with no optimizations
in assembly. It supports the ECDH and ECDSA protocols and some of its timings
are listed in Table 4.3.
The RELIC toolkit [50] is an open-source cryptographic library. It currently
forms the basis for both TinyPBC as well as the current speed record holder for
the ATMega128L (Aranha et al.) It supports many different architectures including
ARM, ATMega128L, MSP430, x86 and x86_64. It has support for elliptic curves
and pairings over binary fields, prime fields, and further has support for hyperelliptic
curves. Unfortunately this library has no published timings, but we were able to
get it compiled for the ARM Cortex-M0+ and we made some timing and energy
measurements for point multiplication which are shown in Table 4.3.
The total accumulated timings for kP and kG were found by taking a randomly
generated 233-bit scalar and multiplying it with the base point G. Fig. 4.3 and 4.4,
and Table 4.6 shows a breakdown of the running times for both a random and a fixed
point multiplication. The TNAF representation operation is the process of converting
an input scalar into its TNAF representation. The square operation includes both
the field squaring and point squaring operations, followed by the reduction operation.
The inverse operation is only called once when the LD coordinates are converted into
82
Comparison with other libraries
affine coordinates. The Support functions term is used to describe all the functions
which were not profiled in the code which includes: testing if a point is at infinity,
obtaining a copy of the point at infinity, checking if a field contains only zeros,
copying the values from one field to another, initialization of a field and adding two
polynomials.
There are two important differences between the processes inside the fixed-point
multiplication and random point multiplication: (1) The TNAF based fixed-point
multiplication’s precomputation table is generated from the base point G just after
the microcontroller has started up. We can see from Fig. 4.3 that 14.2% of the total
cycle count is used to perform the precomputation table generation for w = 4. The
timing for w = 6 would be substantially more. (2) The window size parameter for
the TNAF table is set to w = 6 for kG and w = 4 for kP . The larger window size
for kG will lead to a TNAF representation which contains less elements because each
element now refers to a table lookup which is 4 bits long instead of the 3 bits value
for kP . This leads to a lower number of multiplication operations which explains the
lower number of accumulated cycles for the multiply and multiply precomputation
operations. The data inside Table 4.6 can be used to determine the percentage
of execution time performed by the RELIC toolkit. The RELIC toolkit is used
to perform all the TNAF transformations; for kP this accounts for 20.5% of the
execution time and for kG it accounts for 10% of the execution time.
Table 4.6: Total accumulated timings per operation for random and fixed-point
multiplication.
Operation kP kG
TNAF Representation 178 135 185 926
TNAF Precomputation 398 387 0
Multiply 1 108 890 821 178
Multiply Precomputation 249 750 184 950
Square 362 379 342 294
Inversion 139 936 139 656
Support functions 377 350 376 392
Total 2 814 827 1 864 470
83
4. Results
6%
13%
TNAF Representation
14% TNAF Precomputation
5%
Multiply
Multiply Precomputation
Square
13% Inverse
Support functions
9%
40%
Figure 4.3: A breakdown of the execution time for a random point multiplication.
9%
18% TNAF Representation
Square
17% Inverse
Multiply
Multiply Precomputation
9%
Support functions
7%
40%
Figure 4.4: A breakdown of the execution time for a fixed point multiplication.
a modular squaring and a field multiplication operation for 128-bit security was
by TinyPBC on the PXA271. In this implementation they made heavy use of the
target’s ability to embed the shift instructions inside data-processing instructions,
which is not possible on the ARM Cortex-M0+. The implementation that requires
the lowest number of cycles to perform a modular multiplication for 112-bit security
is our proposed implementation with 3672 cycles. The implementation that requires
the lowest number of cycles to perform a modular multiplication for 80-bit security
is by Kargl et al. with 2593 cycles.
84
Analysis of the effect of different processor word sizes
is the most expensive algorithm with regards to execution time in an elliptic curve
cryptographic system, we will focus our analysis exclusively on field multiplication.
We will evaluate how well the field multiplication methods from Chapter 3 perform
on processors with different word sizes. We hope that this will ease the process of
algorithm selection for processors with different word size.
During the analysis the following assumptions were made:
• n + 1 registers are available for the rotating register scheme in the LD with
rotating registers method
• n + 1 registers are used in the fixed register scheme in the LD with fixed
registers method
• n + 1 registers are used for the fixed register scheme in the LD with fixed
registers with s = 8 method.
• All the intermediate computed values can be stored inside the available registers
for the divide-and-conquer methods.
• The Operand scanning method stores only C[0] and C[10] inside memory and
all other intermediate values are stored inside the registers.
First we will determine the optimal value for the window size parameter w in
relation to word size W for each method in order to achieve the fastest possible
run-time for each multiplication method. The word lengths which are analysed are
8, 16, 32, 64 and 128-bit. The number of words required to store the underlying
field is found with n = dm/W e which leads to a different value of n for each word
size. Next, we will use the optimal values for w to make an estimate of the cycle
counts for the different word lengths. The cycle count estimates can then be used
to determine which method will provide the fastest execution time for a given word
length.
85
4. Results
method for which the optimal parameter of w is being determined. Each curve on
each figure represents a different word size. We found that there is a variation in the
optimal value for w for the different methods. Table 4.7 shows the optimal values
for the window size parameter w for each algorithm and word size.
4.4.2 Analysis
The relative performance of the multiplication methods developed in Chapter 3 are
analysed for different processor word sizes. For each processor word size, we will use
the optimal values of window size parameter w as determined above. Fig. 4.6 shows
the results of these estimations.
Asymmetric Methods
The methods which make use of an asymmetrical split are all quite effective for a
small word size, but their efficiency relative to the symmetrical methods reduces as
the word size is increased. The reason for this performance decrease is due to the
overhead associated with the larger precomputation table for the most significant
polynomial. Consider equations 3.3 and 3.5; when the word size W increases, the
parameter n decreases; as parameter n is decreased, the number of memory accesses
relative to the symmetrical split method will also increase. Table 4.8 shows the
number of memory accesses to the lookup tables for both the asymmetrical, and
symmetrical methods for F2233 . This table clearly shows that as the word size W
increases, the overhead due to memory accesses to the lookup table for the asymmetric
split increases. For the asymmetric methods the number of instructions for read,
write and XOR decreases as W increase but the number of shift instructions increase
with W . This effect is especially noticeable for the Asymmetric Operand-scanning
with LD method where the shift instructions increase at a much higher rate than for
the Asymmetrical Hybrid method.
Symmetric Methods
The performance of the methods that make use of a symmetrical split increase
(relative to the asymmetrical methods) as the processors’ word size is increased.
Eq. 3.7 shows that the overhead due to the pre and post-shifting operations reduces
as the number of words is decreased, while the number of read, write, shift and XOR
instructions decrease as W increases.
The Symmetrical Hybrid method performs well above average for most word
sizes and it is second best for W = 128 and W = 64, and third best for W=32.
The Symmetrical Karatsuba with LD method doesn’t perform very well for a
small word size, but it shines for a large word size. The reduced value of n leads to a
lowered impact from the overhead encountered due to the third lookup table as well
as the overhead due to the pre and post-processing. This method performs below
average when W ≤ 32, average for W = 64 and above average for W = 128. Even
though it performs better for a larger word size, the Symmetrical Hybrid method
always outperforms it.
86
Analysis of the effect of different processor word sizes
Table 4.9 shows the number of memory operations on the lookup table for
the Symmetrical Hybrid and the Symmetrical Karatsuba methods. Recall that
the Karatsuba method only requires three multiplications at the cost of having to
compute three lookup tables. We can see that the Karatsuba method always requires
between 9% and 10% less accesses to the lookup table than the Symmetrical Hybrid
method.
4.10 shows the overhead which is associated with the Asymmetrical and Sym-
metrical Hybrid methods for different word sizes. The Asymmetrical overhead field
is calculated with the difference between the asymmetrical table accesses and the
symmetrical table accesses. The Symmetrical overhead field is calculated with Eq. 3.7
87
4. Results
when the word size is increased due to the dW/we term. Table 4.11 also shows
that the number of memory accesses to intermediate values for the Symmetrical
Hybrid, Asymmetrical Hybrid and Symmetrical Karatsuba methods are always
zero because all the intermediate values are stored inside registers.
The reader is reminded that the model for the different methods were constructed
under the assumption that n + 1 registers are available to store intermediate results.
This assumption is valid for W = 8 [37], and W = 32, but will quite likely be an
invalid assumption for processors with W ≥ 64. As W increases, the number of
words required to store a large number decreases, and therefore the number of words
required to store the large number also decreases. For a processor with a word size of
W = 64, only 4 registers are required to store a 233-bit number; if the processor has
a total of 13 working registers, then it might be possible to store all the intermediate
values of the field multiplication inside the processors’ registers.
Fig. 4.5 shows that there is an improvement of approximately 50% in the number
of cycles when doubling the word size. The reason for this might be due to the
following: upon doubling the word length, the number of iterations in the inner loop
of the LD based algorithms is effectively doubled; this is weighed to a smaller amount
of overhead due to precomputation and splitting of the input parameters, as well as
fewer table lookups.
4.5 Conclusion
We introduced a system to make energy consumption measurements of the ARM
Cortex-M0+ on the KL25Z development board. Results from our measurements
shows that there is a large variation (up to 22.5%) in the energy consumption
of different types of instructions. We compared the energy consumption of our
implementation and found that our implementations’ energy consumption beats
all the other known implementations on other platforms by a factor of 7.4. Our
timings for field multiplication, field squaring and point multiplication also compare
favourably to that of other implementations.
We made an analysis of the effect of different word lengths on the field multi-
plication methods discussed in Chapter 3, and we determined that LD with fixed
registers will provide the best performance on all word sizes for F2233 . If it is
desired to reduce memory consumption for the precomputation tables then the
Asymmetrical Hybrid method is recommended for W ≤ 16, and the Symmetrical
Hybrid method for W ≥ 32. In a multi-processing architecture with 2, 4, or more
available cores the Symmetrical Hybrid and Asymmetrical Hybrid methods will
provide the best performance; however, if only three processor cores are available,
then the Symmetrical Karatsuba with LD method should be considered because
each of the three multiplication operations can be assigned to a different core, which
could lead to a faster runtime than performing four multiplications (as would be the
case with the other splitting methods) in groups of two.
88
Conclusion
9.5
log(Cycles)
9
8.5
7.5
7
1 2 3 4 5 6 7 8
Window size w
[b]0.49 [b]0.49
Cycle estimates for LD with rotating registers vs window size
10.5
W=8,n=30
W=16,n=15
10 W=32,n=8
W=64,n=4
W=128,n=2
9.5
log(Cycles)
8.5
7.5
7
1 2 3 4 5 6 7 8
Window size w
8.5
7.5
7
1 2 3 4 5 6 7 8
Window size w
[b]0.49 [b]0.49
4 Cycle estimates for LD fixed, 8−bit scanning vs window size
x 10
3
W=8,n=30
W=16,n=15
W=32,n=8
2.5
W=64,n=4
W=128,n=2
2
Cycles
1.5
0.5
0
1 2 3 4 5 6 7 8
Window size w
89
log(Cycles)
8.5
7.5
7
1 2 3 4 5 6 7 8
Window size w
[b]0.49 [b]0.49
Cycle estimates for Symmetrical Hybrid scanning vs window size
4. Results
6000
4000
2000
0
8−bit 16−bit 32−bit 64−bit 128−bit
Word size
90
Conclusion
Table 4.8: Memory operations on the lookup tables required for the Asymmetrical
Hybrid, Symmetrical Hybrid and Symmetrical Karatsuba methods by using a factor
two split in F2233 .
Table 4.9: Memory operations on the lookup tables required for the Symmetrical
Hybrid vs Symmetrical Karatsuba methods by using a factor two split in F2233 .
Table 4.10: Memory overhead for Symmetrical and Asymmetrical methods in F2233 .
Symmetrical overhead is the cost of performing the pre and post shift operations
(Eq. 3.7). Asymmetrical overhead is calculated from the additional memory accesses
due to the larger lookup table.
91
4. Results
Table 4.11: Memory operations to the intermediate values stored inside memory
in F2233 . The Symm/Asymm column is used to indicate that both the Asymmetrical
Hybrid, and the Symmetrical Hybrid methods do not use any intermediate values
stored in memory.
92
Chapter 5
General Conclusion
In this dissertation we were faced with the task of implementing a PKC algorithm on
the ARM Cortex-M0+, which is an ultra low-power microcontroller which has been
on the market for less than a year. We chose to make an ECC implementation, and
we performed a careful selection of parameters in order to make an implementation
which has a good match with the architecture of our platform. We designed a system
to perform energy measurements of the ARM Cortex-M0+ which we used to measure
our implementations’ energy consumption, as well as the energy consumption of
different instructions on the processor. We found that there is a variation of up to
22.5% in the energy consumption of different instructions. We decided to use a curve
which has security equivalent to a 112-bit symmetric encryption algorithm. In order
to determine which curve will match our architecture the best, we created two models
(one for binary curves and one for prime curves) to perform a simulation of a point
multiplication on each curve. We selected binary Koblitz curves in F2233 because we
estimated that these curves will lead to a faster implementation due to replacing the
point doubling operations with squaring using the wTNAF method, and they also
do not require an ADD instruction which requires a minimum of 6.9% more energy
than any other instruction on the ARM Cortex-M0+. The target platform only
has a short multiply instruction which would be used frequently for a prime curve
implementation, and would have caused a lot of overhead.
We discussed a number of different existing field multiplication algorithms, and
attempted to improve these algorithms to find an even better matching algorithm
for our architecture. In general two different types of algorithms were explored:
(1) algorithms based on splitting the input parameters, and (2) algorithms which
process the input parameters as a whole. For the split input parameter algorithms
we combined the LD algorithm with a symmetrical split, an asymmetrical split, an
operand-scanning form, and the Karatsuba-Ofman approach. For the whole input
parameter processing algorithms we tried to find optimizations for the LD, and LD
with rotating register algorithms. A new field multiplication algorithm was in-
troduced called the López-Dahab with fixed registers, which is an optimization
of the LD algorithm. This algorithm leads to a performance improvement of 20.24%
over the LD with rotating registers algorithm. We created instruction count
93
5. General Conclusion
models for all the different methods in order to select the fastest algorithm.
We made an implementation with our LD with fixed registers field multipli-
cation algorithm that uses the wTNAF point multiplication method, with w = 4 for
random point multiplication, and w = 6 for fixed point multiplication. The RELIC
toolkit was used to perform the TNAF precomputation, and TNAF transformation of
the scalar k. All the elliptic curve arithmetic, and field arithmetic was implemented in
C and assembly. On average our implementation of the random point multiplication
requires 2814827 cycles, and 36.6 µJ, whereas our fixed point multiplication requires
1864470 cycles, and 24.6 µJ. The RELIC toolkit was used to make an implementation
on the ARM Cortex-M0+, which requires an average of 5621045 cycles and 72.5 µJ
for a random point multiplication, while requiring 5553828 cycles, and 71.6 µJ for a
fixed point multiplication. Therefore our implementation has a cycle count that is
1.99 times faster than the RELIC implementation for random point multiplication,
and has a cycle count which is 2.98 times faster than the RELIC implementation for
fixed point multiplication.
We used the instruction count models for all the algorithms to make a comparison
of the different algorithms when the word size is changed. We found that the LD
with rotating registers algorithm will always perform better than the other
algorithms, regardless of the word size.
94
Future work
algorithm for the simultaneous point multiplication routine used by the ECDSA
signature verification.
The RELIC toolkit can be used in conjunction with the GMP library, which
allows for even more efficient calculations. It is possible to compile the GMP library
for the ARM Cortex-M0+, followed by recompiling the RELIC toolkit with the
option to use the GMP library, which should lead to faster execution times.
Implementing a constant-time point multiplication algorithm based on the LD
with fixed registers method would lead to an implementation which is resistant
to a simple power-consumption based side-channel attack.
More Comparisons with other libraries can be made for the ARM Cortex-M0+.
It should be possible to make an implementation on the target platform using the
MIRACL library, as well as the OpenECC library.
All the formulas for the instruction counts in Chapter 3 can be altered to include
a parameter for the amount of available registers on the target processor. Currently,
some of the formulas assume that n + 1 registers are available, which might be an
invalid assumption for processors with W ≥ 64. If all the formulas were developed
with a parameter for the register count, this would further have allowed us to make
an estimate of the performance of these algorithms for different security parameters.
95
Bibliography
[2] ECRYPT, II, “Yearly Report on Algorithms and Keysizes (2011-2012), D.SPA.20
Rev. 1.0,” tech. rep., ICT-2007-216676, 2011.
[9] A. K. Lenstra, “Key Lengths,” Handbook of Information Security, vol. 2, pp. 617–
635, 2004.
[11] H. Orman and P. Hoffman, “Determining Strengths For Public Keys Used For
Exchanging Symmetric Keys, RFC 3766.” http://www.ietf.org/rfc/rfc3766.
txt, 2004. Accessed: 2013-06-06.
97
Bibliography
[20] P. Barrett, “Implementing the Rivest Shamir and Adleman public key encryption
algorithm on a standard digital signal processor,” in Advances in Cryptology–
CRYPTO, vol. 86, pp. 311–323, Springer, 1987.
[26] Moteiv Corporation, “Tmote sky - ultra low power IEEE 802.15.4 compli-
ant wireless sensor module.” http://insense.cs.st-andrews.ac.uk/files/
2013/04/tmote-sky-datasheet.pdf, 2006. Accessed: 2013-06-06.
98
Bibliography
[28] D. G. Chinnery and K. Keutzer, “Closing the power gap between ASIC and
custom: an ASIC perspective,” in Proceedings of the 42nd annual Design
Automation Conference, pp. 275–280, ACM, 2005.
[33] Çetin Kaya Koç and C. Paar, eds., Cryptographic Hardware and Embedded
Systems - CHES 2000, Second International Workshop, Worcester, MA, USA,
August 17-18, 2000, Proceedings, vol. 1965 of Lecture Notes in Computer Science,
Springer, 2000.
[34] A. Kargl, S. Pyka, and H. Seuschek, “Fast arithmetic on ATmega128 for elliptic
curve cryptography,” International Association for Cryptologic Research Eprint
archive, 2008.
99
Bibliography
[40] S. C. Seo, D.-G. Han, H. C. Kim, and S. Hong, “TinyECCK: Efficient Elliptic
Curve Cryptography Implementation over GF(2m) on 8-Bit Micaz Mote,” IEICE
transactions on information and systems, vol. E91-D, pp. 1338–1347, May 2008.
[49] J. López and R. Dahab, “Fast multiplication on elliptic curves over GF (2m)
without precomputation,” in Cryptographic Hardware and Embedded Systems,
pp. 316–327, Springer, 1999.
100
Bibliography
[53] X. Fan, “An Open Source Library for Elliptic Curve .” http://openecc.org/,
2013. Accessed: 2013-05-28.
101
KU Leuven Faculty of Engineering 2013 – 2014
103