Keccak Implementation 3.1
Keccak Implementation 3.1
overview
Guido B 1
Joan D 1
Michaël P 2
Gilles V A 1
Ronny V K 1
http://keccak.noekeon.org/
2 / 51
Contents
1 General aspects 7
1.1 Specifications summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Bit and byte numbering conventions . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Some justification for our choice . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Operation count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Implementation techniques 13
2.1 Bit interleaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 The lane complementing transform . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Extending the state for smoother scheduling . . . . . . . . . . . . . . . . . . . 15
2.4 Plane-per-plane processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Early parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Combining with bit interleaving . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Efficient in-place implementations . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 Combining with bit interleaving . . . . . . . . . . . . . . . . . . . . . . 19
3 So ware 23
3.1 PC and high-end platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Using SIMD instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 SIMD instructions and tree hashing . . . . . . . . . . . . . . . . . . . . 24
3.1.3 Batch or tree hashing on a graphics processing unit . . . . . . . . . . . 25
3.2 Small 32-bit platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Implementation on a ARM Cortex-M3 . . . . . . . . . . . . . . . . . . . 25
3.3 Small 8-bit platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Estimation on an Intel 8051 processor . . . . . . . . . . . . . . . . . . . 26
3.3.2 Implementation on a Atmel AVR processor . . . . . . . . . . . . . . . . 27
4 Hardware 29
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 High-speed core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Variants of the high-speed core . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.1 K [r = 1024, c = 576] . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 K [r = 40, c = 160] . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Low-area coprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.1 K [r = 1024, c = 576] . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.2 K [r = 40, c = 160] . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 FPGA implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5.1 High-speed core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 / 51
K implementation overview CONTENTS
4 / 51
Introduction
Acknowledgments
We wish to thank (in no particular order) Joachim Strömbergson for useful comments on our
FPGA implementation, Joppe Bos for reporting a bug in the optimized implementation, all
people who contributed to implementations or benchmarks of K in hardware or so -
ware, Virgile Landry Nguegnia Wandji for his work on DPA-resistant K implementa-
tions, Joris Delclef and Jean-Louis Modave for kindly lending us fast hardware, Daniel O e
and Christian Wenzel-Benner for their support on embedded platforms, Renaud Bauvin for
his implementation in Python and Francesco Regazzoni for fruitful discussions.
5 / 51
K implementation overview CONTENTS
6 / 51
Chapter 1
General aspects
In this chapter, we first briefly review the K specifications formally defined in [10]. We
then detail the bit and byte numbering conventions, followed by general statements about
the operation count and memory usage of K .
K - f [b]( A)
for i in 0 . . . nr − 1
A = Round[b]( A, RC[i ])
return A
7 / 51
K implementation overview 1. General aspects
Round[b]( A, RC)
θ
C [ x ] = A[ x, 0] ⊕ A[ x, 1] ⊕ A[ x, 2] ⊕ A[ x, 3] ⊕ A[ x, 4], ∀ x in 0 . . . 4
D [ x ] = C [ x − 1] ⊕ ROT(C [ x + 1], 1), ∀ x in 0 . . . 4
A[ x, y] = A[ x, y] ⊕ D [ x ], ∀( x, y) in (0 . . . 4, 0 . . . 4)
ρ π
B[y, 2x + 3y] = ROT( A[ x, y], r [ x, y]), ∀( x, y) in (0 . . . 4, 0 . . . 4)
χ
A[ x, y] = B[ x, y] ⊕ ((NOT B[ x + 1, y]) AND B[ x + 2, y]), ∀( x, y) in (0 . . . 4, 0 . . . 4)
ι
A[0, 0] = A[0, 0] ⊕ RC
return A
Here the following conventions are in use. All the operations on the indices are done
modulo 5. A denotes the complete permutation state array and A[ x, y] denotes a particular
lane in that state. B[ x, y], C [ x ] and D [ x ] are intermediate variables. The symbol ⊕ denotes the
bitwise exclusive OR, NOT the bitwise complement and AND the bitwise AND operation.
Finally, ROT(W, r ) denotes the bitwise cyclic shi operation, moving bit at position i into
position i + r (modulo the lane size).
The constants r [ x, y] are the cyclic shi offsets and are specified in the following table.
The constants RC[i ] are the round constants. The following table specifies their values in
hexadecimal notation for lane size 64. For smaller sizes they must be truncated.
8 / 51
1. General aspects K implementation overview
We obtain the K [r, c] sponge function, with parameters capacity c and bitrate r, if
we apply the sponge construction to K - f [r + c] and perform specific padding on the
message input. The following pseudocode is restricted to the case of messages that span a
whole number of bytes and where the bitrate r is a multiple of the lane size.
K [r, c]( M)
P
P = M||0x01||0x00|| . . . ||0x00
P = P ⊕ 0x00|| . . . ||0x00||0x80
I
S[ x, y] = 0, ∀( x, y) in (0 . . . 4, 0 . . . 4)
A
for every block Pi in P
S[ x, y] = S[ x, y] ⊕ Pi [ x + 5y], ∀( x, y) such that x + 5y < r/w
S=K - f [r + c](S)
S
Z = empty string
while output is requested
Z = Z ||S[ x, y], ∀( x, y) such that x + 5y < r/w
S=K - f [r + c](S)
return Z
Here S denotes the state as an array of lanes. The padded message P is organised as
an array of blocks Pi , themselves organized as arrays of lanes. The || operator denotes byte
string concatenation.
9 / 51
K implementation overview 1. General aspects
2. The bytes within a CPU word are numbered from zero onwards, with the byte at offset i
being the coefficient of 256i when a words needs to be represented as an integer (“li le-
endian” convention).
The consequence of the first convention is that, as in ρ, the operation consisting of mov-
ing bits of coordinates ( x, y, z) to ( x, y, z + δ mod w) becomes a rotation “to the le ” by δ
positions if the lane is in a CPU word.
The consequence of the two conventions jointly is that the bytes consisting an input block
do not have to be shuffled before being XORed into the state when the state is represented as
an array of lanes on a li le-endian processor. Similarly, the bytes consisting an output block
can be taken as is from the state in the same representation.
Note that the SHA-3 API defined by NIST [28] follows the opposite convention for bit
numbering and this is the reason behind the formal bit reordering described in [11].
An example of the formal bit reordering can be found in the files KeccakSpongeInterme-
diateValues_*.txt in [8].
10 / 51
1. General aspects K implementation overview
r c Relative performance
576 1024 ÷1.778
832 768 ÷1.231
1024 576 1
1088 512 ×1.063
1152 448 ×1.125
1216 384 ×1.188
1280 320 ×1.250
1344 256 ×1.312
1408 192 ×1.375
For an output length n smaller than or equal to the bitrate, the squeezing phase does not
imply any additional processing. However, if more output bits are needed, the additional
number of calls to K - f for an n-bit output is ⌈ nr ⌉ − 1.
When evaluating K , the processing time is dominantly spent in the evaluation of
K - f . In good approximation, the throughput of K for long messages is therefore
proportional to r for a given permutation width b. For instances with r + c = 1600, we will
o en write performance figures for the default bitrate r = 1024. To estimate the performance
for another bitrate, Table 1.1 provides the performance relative to the default bitrate. This
is valid for long messages; for short messages, the processing time is determined by the
required number of calls to K - f (e.g., one when l ≤ r − 2).
On a platform supporting operations on w-bit words, each lane can be mapped to such
a word. If the platform supports only smaller, m-bit words, each lane has to be mapped to
b/m such words. There are different kinds of mappings. As long as the same mapping is
applied to all lanes, each bitwise Boolean operation on a lane is translated as b/m instructions
on CPU words. The most straightforward mapping is to take as CPU word i the lane bits
with z = mi . . . m(i + 1) − 1. In that case, the b-bit rotations need to be implemented using a
number of shi s and bitwise Boolean instructions. Another possible mapping, that translates
the b-bit rotations into a series of m-bit CPU word rotation instructions, is introduced in
Section 2.1.
If we use a b-bit platform, the evaluation of K - f [b] uses in terms of lane operations
• 76nr XORs,
• 25nr ANDs and 25nr NOTs, and
• 29nr b-bit rotations.
For K - f [1600] specifically, this becomes
• 1824 XORs,
• 600 ANDs and 600 NOTs, and
• 696 64-bit rotations.
Some instruction sets propose an instruction that combines the AND and NOT operations
in one operation as in χ, i.e., c ← (NOT a) AND b. If such an instruction is not available on
a CPU, almost 80% of the NOT operations can be removed by applying a lane complement-
ing transform as explained in Section 2.2, turning a subset of the AND operations into OR
operations.
11 / 51
K implementation overview 1. General aspects
In [9], we provide a simple implementation (called Simple) that maps a lane to a CPU
word of 8, 16, 32 or 64 bits, and hence is suitable for K - f [200], K - f [400],
K - f [800] and K - f [1600].
1.4 Memory
In terms of memory usage, K has no feedforward loop, as opposed to many other
constructions, and the message block can be directly XORed into the state. This can benefit
applications for which the message is forma ed on the fly or does not need to be kept a er
being hashed. This also applies where the hashing API has to implement a message queue.
In general a message queue must be allocated, which can be avoided for sponge functions
or similar.
The amount of working memory is limited to the state, the round number and some
extra working memory for θ and χ. Five w-bit words of extra working memory allow the
implementation of θ to compute the XOR of the sheets, while they can hold the five lanes of
a plane when χ is being computed.
An efficient memory-constrained implementation technique is presented in Section 2.5.
Examples of implementations that use a limited amount of memory can be found in the
Compact, Compact8, Inplace and Inplace32BI implementations in [9]. Some of these
implementations also provide an API with a message queue such as Init, Update and
Final, for which no extra memory needs to be allocated.
12 / 51
Chapter 2
Implementation techniques
In this chapter, we introduce techniques that can be used to optimize so ware and hardware
implementations, namely bit interleaving, lane complementing, plane-per-plane processing
and efficient in-place evaluation.
13 / 51
K implementation overview 2. Implementation techniques
may present the input in interleaved form and use the output in this form too. The resulting
function will only differ from K by the order of bits in its input and output.
In hashing applications, the output is usually kept short and some overhead is caused by
applying the bit-interleaving transform to the input. Such a transform can be implemented
using shi s and bitwise operations. For implementing K - f [1600] on a 32-bit CPU, we
must distribute the 64 bits of a lane to two 32-bit words. On some platforms, look-up tables
can speed up this process, although the gain is not always significant.
Several examples of implementations making use of bit interleaving can be found in [9],
notably the Reference32BI and Simple32BI implementations. Also, the Optimized32
implementation can be set to use look-up tables by defining the UseInterleaveTables
symbol in KeccakF-1600-opt32-settings.h. Intermediate values to help debug a bit-
interleaved implementation can be found in the file KeccakPermutationIntermediat-
eValues32BI.txt in [8]. Finally, K T can generate code that make use of bit
interleaving for any lane size and any (smaller) CPU word size [7].
The equation for the bits of A[1] now becomes A[1] = a[1] ⊕ ( a[2] ∧ a[3]). This results in the
cancellation of two NOT operations and A[0] being represented by its complement. Simi-
larly, representing a[4] by its complement cancels two more NOT operations. We have
In the computation of the bits of A[4] the NOT operation cannot be avoided without intro-
ducing NOT operations in other places. We do however have two options:
Hence one can choose between computing A[4] and A[4]. In each of the two cases a NOT
operation must be performed on either a[0] or on a[1]. These can be used to compute A[0]
rather than A[0] or A[1] rather than A[1], respectively, adding another degree of freedom
for the implementer. In the output some lanes are represented by their complement and the
implementer can choose from 4 output pa erns. In short, representing lanes a[2] and a[4]
by their complement reduces the number of NOT operations from 5 to 1 and replaces 2 or 3
AND operations by OR operations. It is easy to see that complementing any pair of lanes a[i ]
14 / 51
2. Implementation techniques K implementation overview
and a[(i + 2) mod 5] will result in the same reduction. Moreover, this is also the case when
complementing all lanes except a[i ] and a[(i + 2) mod 5]. This results in 10 possible input
pa erns in total.
Clearly, this can be applied to all 5 planes of the state, each with its own input and output
pa erns. We apply a complementing pa ern (or mask) p at the input of χ and choose for each
plane an output pa ern resulting in P. This output mask P propagates through the linear
steps θ, ρ, π (and ι) as a symmetric difference pa ern, to result in yet another (symmetric)
mask p′ = π (ρ(θ ( P))) at the input of χ of the next round. We have looked for couples of
masks ( p, P) that are round-invariant, i.e., with p = π (ρ(θ ( P))), and found one that comple-
ments the lanes in the following 6 ( x, y) positions at the output of χ or input of θ:
P : {(1, 0), (2, 0), (3, 1), (2, 2), (2, 3), (0, 4)}.
The Optimized32 and Optimized64 implementations can be set to use lane complement-
ing by defining the UseBebigokimisa symbol in KeccakF-1600-opt*-settings.h [9].
K T can generate code that make use of lane complementing for any lane size
[7].
15 / 51
K implementation overview 2. Implementation techniques
may compute sets of rows in parallel by making use of the bitwise Boolean instructions and
cyclic shi s. The simplest such grouping is by taking all the rows in a plane. Here ρ consists
of five cyclic shi s and π requires addressing. One can also apply bit-interleaving and group
all rows in a plane with even or odd z coordinates.
The computation of the θ-effect from a state can done bit-by-bit. For a given column,
one can initialize its column parity to zero and accumulate the bits of a column to it one by
one. This can again be grouped: lane-by-lane or using bit-interleaving. This computation
can however also be integrated in a single phase together with the computation of the round
output, as in Section 2.4.1. A er χ has been applied to a row, one accumulates each of the
bits in the corresponding column parity. A er all rows of a slice have been computed, the
corresponding column parities will have their correct value. Now these column parities
must simply be combined into the θ-effect. Note that this scheduling implies the storage of
the θ-effect of the round input and storage for the θ-effect of the round output that is being
computed.
The five bits of the round input used in the computation of a row of the round output are
not used in the computation of any other row. So the registers containing these round input
bits may be overwri en by the computed round output bits. This allows reducing working
memory for storing intermediate computation results to the working state, the θ-effect of the
round input and possible the column parity effect being computed. This is interesting in
so ware implementations on platforms with restricted RAM. It goes at the expense of more
complex addressing as the position of statebits depends on the round number.
for a fixed y. Then, the plane y at the output of the round is obtained as
without forge ing to XOR E[0, 0] with the round constant to implement ι. Note that, since
y is fixed, the variable B is a plane and can be indexed only by its x coordinate. To get the
parities of the columns before θ, denoted C [·], the simplest solution is to compute them as
a first phase in the round. All the steps are summarized in Algorithm 1. Note that all lane
addresses can be fixed by loop unrolling. The output variables E[·, ·] can be used as input
for the next round.
Most software implementations available actually use the plane-per-plane processing [9].
The code generated with K T ’s KeccakFCodeGen::genCodePlanePerPlane()
16 / 51
2. Implementation techniques K implementation overview
follows this idea too [7]. The code generation can be combined with lane complementing
if desired.
17 / 51
K implementation overview 2. Implementation techniques
18 / 51
2. Implementation techniques K implementation overview
When processing
( ) ( a plane ) y as in Algorithm 1, χ needs five lanes coming from the coordi-
x x + 3y
nates M−1 = for all x ∈ Z5 at the input of the round. This determines a line
y x
with slope 1 going through (0, 2y) (input variety). The five lanes at the output of χ are per
construction on a line with slope 0 going through (0, y) (output variety). To avoid overwrit-
ing other variables, we must store the five lanes of the output variety in the memory location
that held the input variety. We choose that the relationship between the lane coordinates and
the memory location is represented by a linear matrix N ∈ Z5 × Z5 . At the end of the first
round, the lane at ( x, y)T is stored at location N ( x, y)T .(To satisfy
) the mapping of the output
1 a
variety onto the input variety, N must be of the form with a + b = 2. We choose to
1 b
set a = 0 so as to leave the x coordinate unchanged during the mapping, hence
( )
1 0
N= .
1 2
At the beginning of the second round, the input variety appears rotated by N in memory
so the lane at ( x, y)T is stored at location N 2 ( x, y)T at the end of the second round. In general
at round i, the lanes with coordinates ( x, y)T are stored at location N i ( x, y)T .
The input of χ with coordinates ( x, y)T requires the lane with coordinates M−1 ( x, y)T be-
fore π, which is located at coordinates N i M−1 ( x, y)T in memory. Notice that N i M−1 ( x, y)T =
( x + 3y, . . . ), so there is a shi in the x coordinate inherent to the lane-per-lane(processing.
)
1 2
This can be explicitly compensated for by multiplying all coordinates by MN = and
0 1
integrating this shi before or a er χ, as χ is translation-invariant. This leads to a simpler
expression: The input of χ with coordinates ( x + 2y, y)T requires the lane with coordinates
N ( x, y)T before π, which is located at coordinates N i+1 ( x, y)T in memory. The resulting al-
gorithm can be found in Algorithm 4.
Note the following properties of N:
• The matrix N has order 4, so a er every 4 rounds the variable A will contain the state
without transposition. Unrolling 4 rounds implies that all the coordinates can be con-
stants.
Finally, note that Algorithm 4 can also be combined with early parity, lane complement-
ing and bit interleaving techniques, the la er being detailed below.
The Inplace and Inplace32BI implementations explicitly use the techinque presented in
this section on K - f [1600], one using 64-bit rotations and the other one using bit in-
terleaving for 32-bit rotations [9]. K T ’s KeccakFCodeGen::genCodeInPlace()
can generate code for 4 unrolled rounds using this in-place technique, with factor-2 inter-
leaving or without it [7].
19 / 51
K implementation overview 2. Implementation techniques
coordinate ζ when necessary, with ζ ∈ Zs and s = w/m the interleaving factor. All the
operations on the x, y (resp. ζ) coordinates are considered modulo 5 (resp. modulo s).
While the memory location of a given lane vary from round to round according to the
N i ( x, y)T rule, the ζ coordinate of words varies as well. This is because, five words are
fetched and processed from a given set of five memory locations, and the resulting five
words must be stored into the same five locations, for the implementation to be in-place.
Let us first illustrate this with an example using interleaving with factor s = 4 (e.g., to com-
pute K - f [1600] with 16-bit operations) and using the notations of Algorithm 3. At the
beginning of the first round i = 0, words are stored in memory locations with the identical
coordinates. Assume that we want to process the quarter-plane y = 2 and ζ = 0. At some
point, we need to compute the word B[0], i.e., word ( x, y, ζ ) = (0, 2, 0) at the input of χ. To
do so, we need the word A[1, 0, 3] since ( x ′ , y′ )T = (1, 0)T = M−1 ( x, y)T and r [0, 0] = 1 so
ζ ′ = ζ − 1 = 3 (mod s). At the beginning of the round, word (1, 0, 3) is stored in memory
at (1, 0, 3). We will need to reuse the memory locations that we just processed. So, to respect
the N i ( x, y)T rule, all words belonging to lane (1, y) = (1, 2) have to be stored at memory
locations (1, 0, ·) since N (1, 2)T = (1, 0)T . In particular, word (1, y, ζ ) = (1, 2, 0) needs to
be stored at memory location (1, 0, 3) since it is the only memory location free at this point
within (1, 0, ·). So the third coordinate of the word (0 in this example) can become different
from the third coordinate of the memory location to store it (3 in this example). For the other
words of the lane, there will be a constant shi between the third coordinates.
In general, the relationship between the third coordinate of words and that of the memory
location depends on the rotation constants. We model this relationship as follows. At the be-
ginning of round i, word ζ of lane ( x, y)T is stored at memory location ( x ′ , y′ , ζ + O(i, x ′ , y′ ))
with ( x ′ , y′ )T = N i ( x, y)T . Here O(i, x, y) is viewed as a property of the memory cells ( x, y, ·):
it tells by how many positions the words are shi ed.
Using the notations of Algorithm 3, word ( x ′ , y′ , ζ ′ ) is read in order to evaluate out-
put word ( x, y, ζ ). Word ( x ′ , y′ , ζ ′ ) is located in memory at ( x ′′ , y′′ , ζ ′ + O(i, x ′′ , y′′ )), with
( x ′′ , y′′ )T = N i ( x ′ , y′ )T , which can now be reused. Hence some output word within the
part of the plane being processed (·, y, ζ ) is stored at the same memory position ( x ′′ , y′′ , ζ ′ +
O(i, x ′′ , y′′ )) at the end of round i. Looking only at the third coordinate, we use memory
location ζ ′ + O(i, x ′′ , y′′ ) = ζ − r [ x ′ , y′ ] + O(i, x ′′ , y′′ ) (mod s) to store word ζ at the be-
20 / 51
2. Implementation techniques K implementation overview
i −1
O(i, x, y) = − ∑ r [ N − j ( x, y)T ] (mod s).
j =0
The resulting algorithm is displayed in Algorithm 5.Note that O(i, 0, 0) = 0 for all rounds
i, so the last line (evaluation of ι) can be simplified.
end for
for x = 0 to 4 do
Let ( x ′′ , y′′ )T = N i+1 ( x, y)T
A[ x ′′ , y′′ , ζ + O(i + 1, x ′′ , y′′ )] = B[ x ] ⊕ ((NOT B[ x + 1]) AND B[ x + 2])
end for
end for
A[0, 0, ζ + O(i + 1, 0, 0)] = A[0, 0, ζ + O(i + 1, 0, 0)] ⊕ RCs [i, ζ ] for ζ = 0 to s − 1
• The matrix N and its powers leave the set x + y = 0 unchanged, i.e., the set x + y = 0
is an eigenspace with eigenvalue 1.
21 / 51
K implementation overview 2. Implementation techniques
• Except for the points in the set x + y = 0, the points N i ( x, y)T stay in column x, take
all the positions such that y ̸= − x and come back to their initial position a er a cycle
of length 4.
22 / 51
Chapter 3
So ware
In [9], we provide an implementation suitable for 64-bit CPUs called Optimized64. The
code uses only plain C instructions, without assembly nor SIMD instructions. If needed,
we have applied lane complementing to reduce the number of NOTs. The operations in
the round function have been expanded in macros generated by K T [7]. We
have tried to interleave lines that apply on different variables to enable pipelining, while
grouping sets of lines that use a common precomputed value to avoid reloading the reg-
isters too often. The order of operations is centered on the evaluation of χ on each plane,
preceded by the appropriate XORs for θ and rotations for ρ, and accumulating the parity
of sheets for θ in the next round.
• For x86-64, the fastest option is to unroll 24 rounds (#define Unrolling 24) and to
use lane complementing (#define UseBebigokimisa).
• For IA64, unrolling 6 rounds and not using lane complementing give the fastest re-
sults.
23 / 51
K implementation overview 3. So ware
24 / 51
3. So ware K implementation overview
enabling to absorb long messages below 10 cycles/byte. Furthermore, one may consider the
case r = 1088 and C = c = 512, for which the claimed security level is 2256 . While losing the
power of two for the rate, the final node needs to absorb only one block (DC < r) and the
overhead remains reasonable: one extra evaluation of K - f per message. This benefits
also to long messages, for which the number of cycles per byte is further reduced to about 9
cycles/byte.
For 32-bit platforms in general, good starting points are the following.
25 / 51
K implementation overview 3. So ware
For small 8-bit platforms in general, the target Compact8 in [9] is a good starting point to
implement K - f [1600] on an 8-bit processor. Also, the target Simple can be used to
instantiate a simple yet optimized implementation of K - f [200], mapping a lane to a
byte (see file Keccak-simple.c and set cKeccakB to 200 in Keccak-simple-settings.h),
although this implementation was not optimized for size.
26 / 51
3. So ware K implementation overview
Step Performance
θ 34560 cycles
ρ 20808 cycles
π 0 cycles
χ 30048 cycles
ι 384 cycles
Words re-ordering 15000 cycles
Total 100800 cycles
the exception of 4-bits rotation which can be done by swapping the byte digits). However
using efficient memory transfer instructions like XCH that exchanges the accumulator with a
byte in memory, it is actually possible to reduce the average number of cycles for rotation to
only 4.3 cycles/byte.
The performance estimates for K including the details for each step are given in
Table 3.1, for 24 rounds. One cycle refers to the number of controller clock oscillation periods,
which is 12 in the case of our selected variant. It must be noted that the figures are the result
of a best-effort paper exercise. Figures for an actual implementation might vary, in particular
if it uses specific manufacturer improvements available on the platform.
27 / 51
K implementation overview 3. So ware
28 / 51
Chapter 4
Hardware
4.1 Introduction
K allows to trade off area for speed and vice versa. Different architectures reflect dif-
ferent trade-offs. The two architectures we have investigated and implemented reflect the
two ends of the spectrum: a high-speed core and a low-area coprocessor. Thanks to the sym-
metry and simplicity of its round function, K allows trading off area for speed and vice
versa. Different architectures reflect different trade-offs.
We have coded our architectures in VHDL for implementation in ASIC and FPGA [4]. For
more details on the VHDL code, refer to the readme.txt file in the VHDL directory.
It should be noted that during the design of K particular effort has been put to facil-
itate the hardware implementation. The round function is based only on simple Boolean
expressions and there is no need for adders or S-boxes with complex logic (typically used
in many cryptographic primitives). Avoiding these complex sub-blocks allow having a very
short critical path for reaching very high frequencies. Another beneficial aspect of K
is that, unless intentionally forced, a general architecture implementing K - f and the
sponge construction can easily support all variants (rates, capacities) and use cases (MAC,
MGF, KDF, PRG) for a given lane size.
29 / 51
K implementation overview 4. Hardware
computation. The CPU can program a direct memory access (DMA) for transferring chunks of
the message to be hashed, and the CPU can be assigned to a different task while the core is
computing the hash.
The architecture of the high-speed core design is depicted in Figure 4.1. It is based on the
plain instantiation of the combinational logic for computing one K - f round, and use it
iteratively.
The core is composed of three main components: the round function, the state register
and the input/output buffer. The use of the input/output buffer allows decoupling the core
from a typical bus used in a system-on-chip (SoC).
In the absorbing phase, the I/O buffer allows the simultaneous transfer of the input
through the bus and the computation of K - f for the previous input block. Similarly, in
the squeezing phase it allows the simultaneous transfer of the output through the bus and
the computation of K - f for the next output block.
These buses typically come in widths of 8, 16, 32, 64 or 128 bits. We have decided to
fix its width to the lane size w of the underlying K - f permutation. This limits the
throughput of the sponge engine to w per cycle. This imposes only a restriction if r/w (i.e.,
the rate expressed in number of lanes) is larger than the number of rounds of the underlying
K -f.
In a first phase the high-speed core has been coded in VHDL. Test benches for K -f
and the hash function are provided together with C code allowing the generation of test
vectors for the test benches. We were able to introduce the lane size as a parameter, allowing
us to generate VHDL for all the lane sizes supported by K .
These first VHDL implementations have been tested on different FPGAs by J. Strömberg-
son [31], highlighting some possible improvements and problems with the tools available
from FPGA vendors. We have improved the VHDL code for solving the problems, and this
30 / 51
4. Hardware K implementation overview
31 / 51
K implementation overview 4. Hardware
The critical path of the core is 1.8 nanoseconds, of which 1.1 nanoseconds in the combi-
natorial logic of the round function. This results in a maximum clock frequency of 555MHz
and throughput of 1.23 Gbit/s. The area needed for having the core running at this frequency
is 6.5 kgate, composed of 3 kgate for the round function, 3.1 kgate for the state register and
control logic and less than 400 gates for the I/O buffer.
An alternative without separate I/O buffer allows saving about 400 gate and decreases
the throughput to 0.96 Gbit/s at 555MHz.
• First the sheet parities are computed, and the 5 lanes are stored in a dedicated area of
the memory.
• The second phase consists in computing the θ transformation, reading all the lanes of
the state, and computing the XOR with the corresponding sheet parities. A er com-
puting a lane in this way, it is rotated according to ρ and wri en to the position defined
by π. Now the intermediate state is completely stored in the buffer B.
• The last step is to compute χ and add the round constant, ι, to the lane in position
(0, 0). For doing this the coprocessor reads 3 lanes of a plane from the intermediate
state, computes χ and writes the result to the buffer A, reads another element of the
intermediate value and writes the new χ, and so on for the 5 elements of the plane.
The computation of one round of the K - f permutation takes 215 clock cycles. Out
of these, 55 are bubbles where the core is computing internally and not transferring data to
or from the memory.
In a variant with memory words half the lane size, the number of clock cycles doubles but
only for the part relative to read and write, not for the bubbles. In such an implementation
one round of K - f requires 375 clock cycles.
The buffer A, where the input of the permutation is wri en and where the output of the
permutation is wri en and the end of the computation has the size of the state (25 times
the lane size), while the memory space for storing temporary values has the size of the state
times 1.2.
32 / 51
4. Hardware K implementation overview
33 / 51
K implementation overview 4. Hardware
Table 4.2: Performance estimation of the high speed core of K [r = 1024, c = 576] on
different FPGAs, and in brackets the resources available in the different cases.
The low-area coprocessor has been coded in VHDL and simulated using Modelsim. As
the core depicted in Section 4.2, the coprocessor has been synthesized using ST technology
at 130 nm.
34 / 51
4. Hardware K implementation overview
Table 4.3: Performance estimation of the high speed core of K [r = 40, c = 160] on dif-
ferent FPGAs, and in brackets the resources available in the different cases.
Table 4.4: Performance estimation of the low area coprocessor of K [r = 1024, c = 576]
on different FPGAs, and in brackets the resources available in the different cases.
Table 4.5: Performance estimation of the low area coprocessor of K [r = 40, c = 160] on
different FPGAs, and in brackets the resources available in the different cases.
35 / 51
K implementation overview 4. Hardware
36 / 51
Chapter 5
If the input to K includes secret data or keys, side channel a acks may pose a threat.
In this chapter, we report on K implementations that offer a high level of resistance
against power analysis by using the technique of masking (secret sharing).
5.1 Introduction
Sponge functions, among which K , can be used in a wide range of modes covering
the full range of symmetric cryptography functions. We refer to[10, 3] for examples. This
includes functions that take as argument a secret key. Some other functions do not take a
secret key but take as input data that should remain secret such as pseudorandom sequence
generators or commit-challenge-response protocols. If such functionality is desired on de-
vices to which an adversary has some kind of physical or logical access, protection against
side channel and fault a acks is appropriate [2].
Side channel and fault a acks are a acks that do not exploit an inherent weakness of
an algorithm, but rather a characteristics of the implementation. For their security crypto-
graphic primitives inevitably rely on the fact that an adversary does not have access to in-
termediate computation results. As a consequence, even partial knowledge of intermediate
computation results can give a complete breakdown of security, e.g., by allowing compu-
tation of the key. However, actual implementations may leak information on these results
via characteristics such as computation time, power consumption or electromagnetic radia-
tion. Another condition for the security of cryptographic primitives is that they are executed
without faults. An implementation that makes faults, or can be manipulated to make faults,
may be completely insecure.
We here concentrate on countermeasures against power analysis and electromagnetic
radiation that make use of the algebraic properties of the step functions of K . In the
remainder of this chapter we will speak only about power analysis implying also electro-
magnetic analysis. The main difference between power analysis and electromagnetic anal-
ysis is that in the la er the adversary can make more sophisticated measurements and that
the physical and electronic countermeasures are different. The countermeasures at the algo-
rithmic level are however the same for both.
As far as timing a acks are concerned, it is straightforward to implement K in such
a way that its execution time is independent of the input it processes, both in so ware as
in hardware. Moreover, K - f does not make use of large lookup tables so that cache
timing a acks pose no problem. The interleaving method of Section 2.1 can be implemented
with table lookups, which depend on the input message only. Of course, it can also be im-
plemented without any table at all.
37 / 51
K implementation overview 5. Protection against side-channel a acks
Protection against fault a acks can be achieved by countermeasures that are independent
of the cryptographic primitive being protected: fault detection at so ware level, at hardware
level and by performing computations multiple times and verifying that the results are equal.
Particularly good sources of information on side channel a acks and countermeasures are
the proceedings of the yearly Cryptographic Hardware and Embedded Systems (CHES) con-
ferences (http://www.chesworkshop.org/) and the text book [29]. A good source of informa-
tion on fault a acks and countermeasures are the proceedings of the yearly Fault Diagno-
sis and Tolerance in Cryptography (FDTC) workshops (http://conferenze.dei.polimi.it/
FDTC10/index.html).
The general set-up of power (and electromagnetic) analysis is that the a acker gets one or
more traces of the measured power consumption. If only a single trace suffices to mount
an a ack, one speaks about simple power analysis (SPA). However, the dependence of the
signal in the variables being computed is typically small and obscured by noise. This can be
compensated for by taking many traces, each one representing an execution of the crypto-
graphic primitive with different input values. These many traces are then subject to statistical
methods to retrieve the key information. These a acks are called differential power analy-
sis (DPA) [22]. An important aspect in these a acks is that the traces must be aligned: they
must be combined in the time-domain such that corresponding computation steps coincide
between the different traces.
In DPA one distinguishes between first order DPA and higher order DPA. In first-order, the
a acker is limited to considering single time offsets of the traces. In m-th order the a acker
may incorporate up to m time offsets in the analysis. Higher-order a acks are in principle
more powerful but also much harder to implement [29].
In correlation power analysis (CPA) [12] one exploits the fact that the power consumption
may be correlated to the value of bits (or bitwise differences of bits) being processed at any
given moment: there is a difference in the expected value of the power consumption. In
short, one exploits this by taking many traces and partitioning them in two subsets: in one
set, a particular bit, the target bit, is systematically equal to 0 and in the other it is equal
to 1. Then one adds the traces in each of the two sets and subtracts the results giving the
compound trace. If now at any given time the power consumption is correlated to the target
bit, one sees a high value in the compound trace. One can use this to retrieve key information
by taking a target bit that depends on part of the key and trying different partitions based
on the hypothesis for that part of the key. If wrong key guesses result in a partition where
the bits of intermediate results are more or less balanced, the compound trace of the correct
key guess will stand out. Note that if the power consumption is correlated to bit values
or differences, it is also correlated to the Hamming values of words or Hamming distances
between words.
Later, more advanced ways to measure the distance between distributions were intro-
duced. In particular, mutual information analysis (MIA) [19] is a generalization of CPA in
the sense that instead of just exploiting the fact that different bit values may result in different
expected power consumption values, it is able to exploit the difference between the distri-
butions of the power consumption for a bit being 0 or 1 respectively. So in short, when the
power consumption distributions of a bit equal to 0 or 1 have equal mean values but different
shapes, CPA will not work while MIA may still be able to distinguish the two distributions.
38 / 51
5. Protection against side-channel a acks K implementation overview
Transistor level Logical gates and circuits are built in such a way that the information leak-
age is reduced;
Platform level The platform supports features such as irregular clocking (clock ji er), ran-
dom insertion of dummy cycles and addition of noise to power consumption;
Program level The order of operations can be randomized or dummy instructions can be
inserted randomly to make the alignment of traces more difficult;
Algorithmic level The operations of the cryptographic algorithm are computed in such a
way that the information leakage is reduced;
Protocol level The protocol is designed such that it limits the number of computations an
a acker can conduct with a given key.
39 / 51
K implementation overview 5. Protection against side-channel a acks
cryptographic primitive is relevant. One of the countermeasures of this type is that of secret
sharing (or masking) and K is particularly well suited for it.
x i ← x i + ( x i +1 + 1 ) x i +2 . (5.1)
a i ← a i + ( a i + 1 + 1 ) a i + 2 + a i + 1 bi + 2
(5.2)
b i ← bi + ( b i + 1 + 1 ) b i + 2 + b i + 1 a i + 2 .
To achieve independence from native variables, the order in which the operations are exe-
cuted is important. If the expressions are evaluated le to right, it can be shown that all in-
termediate variables are independent from native variables. The computations of all terms
except the rightmost one involve only variables of a single share, hence here independence
from x is automatic. For the addition of the mixed term to the intermediate result of the
40 / 51
5. Protection against side-channel a acks K implementation overview
computation, the presence of a[i ] (or b[i ]) as a linear term in the intermediate variable results
in independence.
At the algorithm level, the effect of the introduction of two shares in the computation
of the K - f round function is rather simple. The linear part λ can be executed on the
two shares separately, roughly doubling the workload. In the nonlinear step mapping χ the
computation of a state word according to Equation (5.1), taking a XOR, AND and NOT in-
struction, is replaced by the computation of the shares according to Equations (5.2), taking in
total 4 XOR, 4 AND and 2 NOT instructions. The addition of round constants ι and addition
of input blocks can be performed on one share only.
As the order of execution is important, it is not sufficient to write a program in C or
some other high-level language and compile it. The compiler may optimize away the desired
order. An option is to compile and inspect the machine code a erwards, but the method that
provides the highest level of assurance is to program in assembly language. It is however
not sufficient to check only the sequencing of instructions. In general, the operations on two
shares of the same variable are preferably executed in registers physically isolated from each
other. More particularly, the program shall be such that at any given moment there is no
register (or bus) content or transition that is correlated to a native variable. For example,
if the number of shares is two and a register containing x0 is loaded with x1 , the power
consumption can depend on the number of bits switched (e.g., on the Hamming weight of
x0 ⊕ x1 ) and it is likely to leak information on x. This can be solved by se ing the register to
zero in between. Another example is the simultaneous processing of two shares of a variable
in different parts of the CPU. As the power consumption at any given moment depends on
all processing going on, it depends on both shares and and hence there may be a dependence
on the native variable. Clearly, care must be taken when a empting to build side-channel
resistant implementations. So if sufficient care is taken, this provide provable resistance
against first-order DPA. Higher-order DPA is in principle still possible but as explained in
[17, Appendix A] very sensitive to noise and clock ji er. On smart cards, the addition of
noise, dummy cycles and clock ji er are typically supported by the platform.
41 / 51
K implementation overview 5. Protection against side-channel a acks
a i ← b i + ( b i + 1 + 1 ) b i + 2 + bi + 1 c i + 2 + c i + 1 b i + 2
bi ← c i + ( c i + 1 + 1 ) c i + 2 + c i + 1 a i + 2 + a i + 1 c i + 2 (5.3)
c i ← a i + ( a i + 1 + 1 ) a i + 2 + a i + 1 bi + 2 + bi + 1 a i + 2 .
Clearly, the computation of each share takes as input only components of the other two shares
and it provides provable security against first order CPA, even in the presence of glitches.
The three lines of Equation (5.3) are equivalent and only differ in the input shares and
output share. We denote the computation of the nonlinear step χ′ resulting in a share given
the two other shares by, e.g., a ← χ′ (b, c).
In the following subsections we present two architectures that make use of three shares.
Other architectures can be derived based on different partitioning or sequences of the com-
putational steps composing the round. For instance a low area coprocessor, as those pre-
sented in Section 4.4, can be protected using the secret sharing techniques.
42 / 51
5. Protection against side-channel a acks K implementation overview
Figure 5.1: Protected architecture computing one round in one clock cycle.
43 / 51
K implementation overview 5. Protection against side-channel a acks
cycle R0 R1 R2 R3
0 - - - -
1 input, A - - -
2 input, B λ( A + Din ) - -
3 input, C λ( B) λ( A + Din ) -
4 χ ′ ( R2 , R1 ), C ′ λ(C ) λ( B) λ( A + Din )
5 χ ′ ( R2 , R1 ), A ′ λ(C ′ ) λ(C ) λ( A + Din )
6 χ ′ ( R2 , R3 ), B ′ λ( A′ ) λ (C ′ ) λ(C )
7 χ′ ( R2 , R1 ), B′′ λ( B′ ) λ( A′ ) λ(C ′ )
8 χ′ ( R2 , R1 ), C ′′ λ( B′′ ) λ( B′ ) λ(C ′ )
9 χ′ ( R2 , R3 ), A′′ λ(C ′′ ) λ( B′′ ) λ( B′ )
10 χ′ ( R2 , R1 ), A′′′ λ( A′′ ) λ(C ′′ ) λ( B′′ )
11 χ′ ( R2 , R1 ), B′′′ λ( A′′′ ) λ( A′′ ) λ( B′′ )
12 χ′ ( R2 , R3 ), C ′′′ λ( B′′′ ) λ( A′′′ ) λ( A′′ )
as inputs. The three initial values of the shares are indicated as A, B and C. The values a er
the first round by A′ , B′ and C ′ , a er the second round by A′′ , etc. At clock cycle zero the
registers do not yet contain any relevant data, as indicated with a −. Instead of loading the
three shares in parallel, as done in the previous architecture, one share per clock cycle is
loaded into register R0 during the three initial clock cycle.
The content of R1 is in general the result of λ applied to R0 , with the Din enabled and
XORed at the input of λ( R0 ) before an initial K - f round.
The content of R0 is the result of χ′ applied to either the couple R2 and R1 or R2 and R3 .
While in the one-cycle round architecture ι was applied only to one of the three shares, here
it is applied to all of the three shares. This does not change the result of the computation and
simplifies the control logic.
The registers R2 and R3 are used for maintaining the values required for the computation
of χ′ . In the second clock cycle B is loaded into R0 , and R1 receives λ( A). In the third clock
cycle C is loaded into R0 , R1 receives λ( A) and λ( A) is moved from R1 to R2 . In the fourth
clock cycle no more shares need to be loaded. Instead R0 receives χ′ applied to the content
of R1 and R2 , which means λ( A) and λ( B). It follows that R0 now contains the share C ′ a er
the first round. R1 receives λ(C ), λ( B) is moved from R1 to R2 and λ( A) is moved from R2 to
R3 . In the fi h clock cycle R0 receives χ′ applied to the content of R1 and R2 , being λ( B) and
λ(C ) and hence contains the share A a er the first round. R1 receives λ(C ′ ), λ(C ) is moved
from R1 to R2 , while λ( A) remains in R3 . Since shares A′ and C ′ as input of the second round
have already been computed, in the next clock cycle share B′ must be computed, requiring
λ( A) and λ(C ), and they are in R1 and R3 respectively before clock cycle five. Thus we have
described also what will be computed in the next clock cycle and this is basically the end of
a round. Starting from the next clock cycle there will be a loop of three clock cycles where
the alternation of data contained in the registers are those for computing a round.
Using this circuit, a K computation takes 3 initial cycles plus 24 cycles per execution
of K -f.
44 / 51
5. Protection against side-channel a acks K implementation overview
Figure 5.2: Protected architecture computing one round in three clock cycles.
45 / 51
K implementation overview 5. Protection against side-channel a acks
46 / 51
5. Protection against side-channel a acks K implementation overview
depends on the native variable being stored and hence is vulnerable to MIA. In this section,
we explain that the introduction of noise has a much stronger impact on the feasibility of this
a ack for a two-share or three-share implementation than for an unmasked one, and that the
qualitative difference is similar to the one between first-order and higher-order DPA. Fur-
thermore, we argue that a masked implementation has per construction a higher algorithmic
noise level than an unmasked one.
With three shares, a native bit equal to 0 (resp. 1) can be represented as 000, 011, 101 or
110 (resp. 001, 010, 101 or 111). If the three shares are processed simultaneously, the power
consumption can leak the Hamming weight of the shares, which means 0 or 2 for a native
bit 0, or 1 or 3 for a native bit 1. Clearly, these two distributions are different and can be
distinguished.
We start the discussion by constructing a simple model where the power consumption
is equal to the Hamming weight plus an additive Gaussian noise of standard deviation σ,
expressed in the same unit as the impact of the Hamming weight on the power consumption.
In this model, the distribution of the power consumption for three shares is as in Figure 5.3(c),
with σ = 0.2. Similarly, we can look at an unmasked implementation, where the power
consumption is 0 or 1 plus the Gaussian noise, and at masking with two shares. In this last
case, the Hamming weight is 0 or 2 for a native bit 0 (represented as 00 or 11) or 1 for a
native bit 1 (represented as 01 or 10). The two cases can be found in Figures 5.3(a) and 5.3(b),
respectively.
Figure 5.3: Distribution of the power consumption for a simple model. The solid line shows
the distribution for a bit with native value 0 and the dashed line for a bit 1. Sub-figures (a),
(b) and (c) show the case of one, two and three shares, respectively, with a noise level of
σ = 0.2. Sub-figures (d), (e) and (f) follow the same model with σ = 2.
Following this model, we compute the number of samples that are needed to distinguish
between the distribution for a native bit 0 and the distribution for a native bit 1. We follow
the same reasoning as in [17, Appendix A]. The number z of samples needed to distinguish
one distribution over the other is inversely proportional to the Kullback-Leibler divergence
47 / 51
K implementation overview 5. Protection against side-channel a acks
In this model, the scaling of z as a function of σ is different for one, two or three shares.
For one share, z ∼ 2σ2 samples are needed to distinguish a native bit 0 from a native bit 1
from unmasked values, whereas about z2 samples are needed for the same noise level when
two shares are used, and this number grows to about z3 samples for three-share masking.
Hence, the difference between the one-share and two-share and three-share implementation
is qualitatively the same as the one between first-order and second-order and third-order
DPA.
The real-world behavior is likely to differ from this simple model. Nevertheless, we ex-
pect a significantly higher sensitivity to noise for three shares than for one. Qualitatively,
the three pairs of distributions are different. For one share, the mean is different for native
bits 0 and 1. For two shares, the two distributions have the same mean but a different vari-
ance. For three shares, the two distributions have the same mean and variance; they differ
only starting from their third moment. Figures 5.3(d), 5.3(e) and 5.3(f) illustrate this with the
simple model and a higher noise level σ = 2.
So far, we have assumed that the three levels of masking are subject to the same noise
level σ. However, the masking by itself introduces noise, as m − 1 shares are randomly and
independently chosen for every execution of the algorithm. The very dependence of the
processing of a bit in the power consumption that an a acker exploits turns against her, as
it becomes an additional source of noise due the randomization. For instance, in the one-
cycle-one-round implementation of K - f [1600] with three shares, the noise due to the
Hamming weight of 3200 random bits must be compared to a small number of unmasked
bit values that a differential power analysis a empts at recovering.
48 / 51
Bibliography
[2] G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche, Note on side-channel a acks and
their countermeasures, Comment on the NIST Hash Competition Forum, May 2009, http:
//keccak.noekeon.org/NoteSideChannelAttacks.pdf.
[8] , Known-answer and Monte Carlo test results, 2011, available from http://keccak.
noekeon.org/.
[12] E. Brier, C. Clavier, and F. Olivier, Correlation power analysis with a leakage model, CHES
(M. Joye and J.-J. Quisquater, eds.), Lecture Notes in Computer Science, vol. 3156,
Springer, 2004, pp. 16–29.
49 / 51
K implementation overview BIBLIOGRAPHY
[17] J. Daemen, M. Peeters, and G. Van Assche, Bitslice ciphers and power analysis a acks, Fast
So ware Encryption 2000 (B. Schneier, ed.), Lecture Notes in Computer Science, vol.
1978, Springer, 2000, pp. 134–149.
[18] G. Gielen and J. Figueras (eds.), 2004 design, automation and test in Europe conference and
exposition (DATE 2004), 16-20 February 2004, Paris, France, IEEE Computer Society, 2004.
[19] B. Gierlichs, L. Batina, P. Tuyls, and B. Preneel, Mutual information analysis, CHES (E. Os-
wald and P. Rohatgi, eds.), Lecture Notes in Computer Science, vol. 5154, Springer,
2008, pp. 426–442.
[20] S. Guilley, P. Hoogvorst, Y. Mathieu, R. Pacalet, and J. Provost, CMOS structures suitable
for secured hardware, in Gielen and Figueras [18], pp. 1414–1415.
[22] P. C. Kocher, J. Jaffe, and B. Jun, Differential power analysis, Advances in Cryptology –
Crypto ’99 (M. Wiener, ed.), Lecture Notes in Computer Science, vol. 1666, Springer,
1999, pp. 388–397.
[23] S. Kullback and R. A. Leibler, On information and sufficiency, Ann. Math. Statist. 22 (1951),
no. 1, 79–86, http://projecteuclid.org/euclid.aoms/1177729694.
[24] S. Mangard, N. Pramstaller, and E. Oswald, Successfully a acking masked AES hardware
implementations, CHES (J.R. Rao and B. Sunar, eds.), Lecture Notes in Computer Science,
vol. 3659, Springer, 2005, pp. 157–171.
[25] S. Nikova, V. Rijmen, and M. Schläffer, Secure hardware implementation of nonlinear func-
tions in the presence of glitches, ICISC (P. J. Lee and J. H. Cheon, eds.), Lecture Notes in
Computer Science, vol. 5461, Springer, 2008, pp. 218–234.
[27] NIST, Announcing request for candidate algorithm nominations for a new cryptographic hash
algorithm (SHA-3) family, Federal Register Notices 72 (2007), no. 212, 62212–62220, http:
//csrc.nist.gov/groups/ST/hash/index.html.
[28] , ANSI C cryptographic API profile for SHA-3 candidate algorithm submissions, re-
vision 5, February 2008, available from http://csrc.nist.gov/groups/ST/hash/sha-3/
Submission_Reqs/crypto_API.html.
[29] E. Oswald S. Mangard and T. Popp, Power analysis a acks — revealing the secrets of smart-
cards, Springer-Verlag, 2007.
[30] G. Sevestre, K tree hashing on GPU, using Nvidia Cuda API, 2010, http://sites.
google.com/site/keccaktreegpu/.
[31] J. Strömbergson, Implementation of the Keccak hash function in FPGA devices, http://www.
strombergson.com/files/Keccak_in_FPGAs.pdf.
[32] K. Tiri and I. Verbauwhede, A logic level design methodology for a secure DPA resistant ASIC
or FPGA implementation, in Gielen and Figueras [18], pp. 246–251.
50 / 51
BIBLIOGRAPHY K implementation overview
[33] C. Wenzel-Benner and J. Gräf, XBX: keeping two eyes on embedded hashing, http://xbx.
das-labor.org/, accessed 21 December 2010.
51 / 51