0% found this document useful (0 votes)

41 views51 pages

Keccak Implementation 3.1

This document provides an overview of KECCAK implementation aspects in software and hardware, with and without protection against side-channel attacks. It discusses specifications, bit numbering conventions, operation counts, memory usage, implementation techniques, software platforms, hardware designs, and countermeasures against power analysis attacks.

Uploaded by

asena vision

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views51 pages

Keccak Implementation 3.1

Uploaded by

asena vision

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

K implementation

overview

Guido B 1

Joan D 1

Michaël P 2

Gilles V A 1

Ronny V K 1

http://keccak.noekeon.org/

Version 3.1 1 STMicroelectronics

September 5, 2011 2 NXP Semiconductors
K implementation overview

2 / 51
Contents

1 General aspects 7
1.1 Speciﬁcations summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Bit and byte numbering conventions . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Some justiﬁcation for our choice . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Operation count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Implementation techniques 13
2.1 Bit interleaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 The lane complementing transform . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Extending the state for smoother scheduling . . . . . . . . . . . . . . . . . . . 15
2.4 Plane-per-plane processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Early parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Combining with bit interleaving . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Eﬃcient in-place implementations . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 Combining with bit interleaving . . . . . . . . . . . . . . . . . . . . . . 19

3 So ware 23
3.1 PC and high-end platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Using SIMD instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 SIMD instructions and tree hashing . . . . . . . . . . . . . . . . . . . . 24
3.1.3 Batch or tree hashing on a graphics processing unit . . . . . . . . . . . 25
3.2 Small 32-bit platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Implementation on a ARM Cortex-M3 . . . . . . . . . . . . . . . . . . . 25
3.3 Small 8-bit platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Estimation on an Intel 8051 processor . . . . . . . . . . . . . . . . . . . 26
3.3.2 Implementation on a Atmel AVR processor . . . . . . . . . . . . . . . . 27

4 Hardware 29
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 High-speed core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Variants of the high-speed core . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.1 K [r = 1024, c = 576] . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 K [r = 40, c = 160] . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Low-area coprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.1 K [r = 1024, c = 576] . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.2 K [r = 40, c = 160] . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 FPGA implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5.1 High-speed core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 / 51
K implementation overview CONTENTS

4.5.2 Low-area coprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Protection against side-channel a acks 37

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Power analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Diﬀerent types of countermeasures . . . . . . . . . . . . . . . . . . . . 39
5.2.2 Secret sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 So ware implementation using two-share masking . . . . . . . . . . . . . . . 40
5.4 Hardware using three-share masking . . . . . . . . . . . . . . . . . . . . . . . 41
5.4.1 One-cycle round architecture . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4.2 Three-cycle round architecture . . . . . . . . . . . . . . . . . . . . . . . 42
5.4.3 Synthesis results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 Computing in parallel or sequentially? . . . . . . . . . . . . . . . . . . . . . . . 46

4 / 51
Introduction

This document gives an overview of the implementation aspects of K , in so ware and

hardware, with or without protection against side-channel a acks.

Acknowledgments
We wish to thank (in no particular order) Joachim Strömbergson for useful comments on our
FPGA implementation, Joppe Bos for reporting a bug in the optimized implementation, all
people who contributed to implementations or benchmarks of K in hardware or so -
ware, Virgile Landry Nguegnia Wandji for his work on DPA-resistant K implementa-
tions, Joris Delclef and Jean-Louis Modave for kindly lending us fast hardware, Daniel O e
and Christian Wenzel-Benner for their support on embedded platforms, Renaud Bauvin for
his implementation in Python and Francesco Regazzoni for fruitful discussions.

5 / 51
K implementation overview CONTENTS

6 / 51
Chapter 1

General aspects

In this chapter, we first briefly review the K specifications formally defined in [10]. We
then detail the bit and byte numbering conventions, followed by general statements about
the operation count and memory usage of K .

1.1 Speciﬁcations summary

This section offers a summary of the K specifications using pseudocode, sufficient to
understand its structure and building blocks. In no way should this introductory text be
considered as a formal and reference description of K . For the formal definition of
K , we refer to [10].
Any instance of the K sponge function family makes use of one of the seven K -f
permutations, denoted K - f [b], where b ∈ {25, 50, 100, 200, 400, 800, 1600} is the width
of the permutation. These K - f permutations are iterated constructions consisting of
a sequence of almost identical rounds. The number of rounds nr depends on the permu-
tation width, and is given by nr = 12 + 2ℓ, where 2ℓ = b/25. This gives 24 rounds for
K - f [1600].

K - f [b]( A)
for i in 0 . . . nr − 1
A = Round[b]( A, RC[i ])
return A

AK - f round consists of a sequence of invertible steps each operating on the state,

organized as an array of 5 × 5 lanes, each of length w ∈ {1, 2, 4, 8, 16, 32, 64} (b = 25w). When
implemented on a 64-bit processor, a lane of K - f [1600] can be represented as a 64-bit
CPU word.

7 / 51
K implementation overview 1. General aspects

Round[b]( A, RC)
θ
C [ x ] = A[ x, 0] ⊕ A[ x, 1] ⊕ A[ x, 2] ⊕ A[ x, 3] ⊕ A[ x, 4], ∀ x in 0 . . . 4
D [ x ] = C [ x − 1] ⊕ ROT(C [ x + 1], 1), ∀ x in 0 . . . 4
A[ x, y] = A[ x, y] ⊕ D [ x ], ∀( x, y) in (0 . . . 4, 0 . . . 4)

ρ π
B[y, 2x + 3y] = ROT( A[ x, y], r [ x, y]), ∀( x, y) in (0 . . . 4, 0 . . . 4)

χ
A[ x, y] = B[ x, y] ⊕ ((NOT B[ x + 1, y]) AND B[ x + 2, y]), ∀( x, y) in (0 . . . 4, 0 . . . 4)

ι
A[0, 0] = A[0, 0] ⊕ RC

return A

Here the following conventions are in use. All the operations on the indices are done
modulo 5. A denotes the complete permutation state array and A[ x, y] denotes a particular
lane in that state. B[ x, y], C [ x ] and D [ x ] are intermediate variables. The symbol ⊕ denotes the
bitwise exclusive OR, NOT the bitwise complement and AND the bitwise AND operation.
Finally, ROT(W, r ) denotes the bitwise cyclic shi operation, moving bit at position i into
position i + r (modulo the lane size).
The constants r [ x, y] are the cyclic shi oﬀsets and are speciﬁed in the following table.

x=3 x=4 x=0 x=1 x=2

y =2 25 39 3 10 43
y =1 55 20 36 44 6
y =0 28 27 0 1 62
y =4 56 14 18 2 61
y =3 21 8 41 45 15

The constants RC[i ] are the round constants. The following table speciﬁes their values in
hexadecimal notation for lane size 64. For smaller sizes they must be truncated.

RC[ 0] 0x0000000000000001 RC[12] 0x000000008000808B

RC[ 1] 0x0000000000008082 RC[13] 0x800000000000008B
RC[ 2] 0x800000000000808A RC[14] 0x8000000000008089
RC[ 3] 0x8000000080008000 RC[15] 0x8000000000008003
RC[ 4] 0x000000000000808B RC[16] 0x8000000000008002
RC[ 5] 0x0000000080000001 RC[17] 0x8000000000000080
RC[ 6] 0x8000000080008081 RC[18] 0x000000000000800A
RC[ 7] 0x8000000000008009 RC[19] 0x800000008000000A
RC[ 8] 0x000000000000008A RC[20] 0x8000000080008081
RC[ 9] 0x0000000000000088 RC[21] 0x8000000000008080
RC[10] 0x0000000080008009 RC[22] 0x0000000080000001
RC[11] 0x000000008000000A RC[23] 0x8000000080008008

8 / 51
1. General aspects K implementation overview

We obtain the K [r, c] sponge function, with parameters capacity c and bitrate r, if
we apply the sponge construction to K - f [r + c] and perform speciﬁc padding on the
message input. The following pseudocode is restricted to the case of messages that span a
whole number of bytes and where the bitrate r is a multiple of the lane size.

K [r, c]( M)
P
P = M||0x01||0x00|| . . . ||0x00
P = P ⊕ 0x00|| . . . ||0x00||0x80

I
S[ x, y] = 0, ∀( x, y) in (0 . . . 4, 0 . . . 4)

A
for every block Pi in P
S[ x, y] = S[ x, y] ⊕ Pi [ x + 5y], ∀( x, y) such that x + 5y < r/w
S=K - f [r + c](S)

S
Z = empty string
while output is requested
Z = Z ||S[ x, y], ∀( x, y) such that x + 5y < r/w
S=K - f [r + c](S)

return Z

Here S denotes the state as an array of lanes. The padded message P is organised as
an array of blocks Pi , themselves organized as arrays of lanes. The || operator denotes byte
string concatenation.

1.2 Bit and byte numbering conventions

Internally, the state of K - f [b] is organized in three dimensions and its bits are identiﬁed
with coordinates x, y ∈ Z5 and z ∈ Zw , for b = 25w and w ∈ {1, 2, 4, . . . , 64}. Externally,
the sponge and duplex constructions require to have a linear numbering of the bits from 0
to b − 1. Here, the bit index i = z + w(5y + x ) externally corresponds to the coordinates
( x, y, z) internally.
A lane (i.e., bits with the same coordinates ( x, y)) contains bits with w consecutive indices.
If an input or output block is organized as lanes, the outer part of the state spans ⌈ wr ⌉ lanes.
In a typical so ware implementation, the bits in a lane are packed together in a w-bit CPU
word. Thus, this allows to organize the input and output in terms of CPU words. In addition,
one can implement the operation ρ as a set of CPU word rotations (if the processor supports
it).
When it comes to expressing the bit positions within a CPU word or a byte, or the byte
positions within a CPU word, we take the following conventions.
1. The bits within a byte or within a CPU word are numbered from zero onwards, with the
bit number i being the coeﬃcient of 2i when a byte or a word needs to be represented
as an integer.

9 / 51
K implementation overview 1. General aspects

2. The bytes within a CPU word are numbered from zero onwards, with the byte at offset i
being the coefficient of 256i when a words needs to be represented as an integer (“li le-
endian” convention).
The consequence of the first convention is that, as in ρ, the operation consisting of mov-
ing bits of coordinates ( x, y, z) to ( x, y, z + δ mod w) becomes a rotation “to the le ” by δ
positions if the lane is in a CPU word.
The consequence of the two conventions jointly is that the bytes consisting an input block
do not have to be shuffled before being XORed into the state when the state is represented as
an array of lanes on a li le-endian processor. Similarly, the bytes consisting an output block
can be taken as is from the state in the same representation.
Note that the SHA-3 API defined by NIST [28] follows the opposite convention for bit
numbering and this is the reason behind the formal bit reordering described in [11].

An example of the formal bit reordering can be found in the ﬁles KeccakSpongeInterme-
diateValues_*.txt in [8].

1.2.1 Some justiﬁcation for our choice

In this subsection, we detail our choice of conventions concerning the mapping between the
bits of the K - f [b] permutation and their representation in terms of w-bit CPU words
and in the SHA-3 API defined by NIST [28]. (This part can be safely skipped for readers not
interested.)
For the ρ operation to be translated into rotation instructions in the processor, the num-
bering z must be either an increasing or a decreasing function of the bit numbering in the
processor’s conventions. So, up to a constant offset, either z = 0 is the most significant bit
(MSB) and z = w − 1 is the least significant bit (LSB), or vice-versa.
The input bits of the hash function come organized as a sequence of bytes. Within each
block, the message bit i = ibit + 8ibyte is going to be XORed with the state bit i. To avoid re-
ordering bits or bytes and to allow a word-wise XORing, the message bit numbering should
follow the same convention as the state bit numbering. In particular, if z = 0 indicates the
MSB (resp. LSB), ibyte = 0 should indicate the most (resp. least) significant byte within a
word.
Hence, the choice is bound to either follow the li le endian or the big endian convention.
We found numbering the bits (or bytes) with increasing powers of 2 (or 256) a bit easier to
express than to follow a decreasing rule. Furthermore, in its call for SHA-3, NIST defined a
reference platform that happens to be li le endian [27]. So we decided to follow the conven-
tions detailed above.
The convention in the Update function of NIST’s API is different, and this is the reason
for applying the formal bit reordering described in [11]. It formalizes the chosen translation
between the two conventions, while having a tiny impact on the implementation. In practice,
only the bits of the last byte (when incomplete) of the input message need to be shi ed.

1.3 Operation count

For K , the bulk of the processing goes into the K - f permutation and the XOR of
the message blocks into the state. For an input message of l bits, the number of blocks to
process, or in other words, the number of calls to K - f , is given by:
⌈ ⌉
l+2
.
r

10 / 51
1. General aspects K implementation overview

r c Relative performance
576 1024 ÷1.778
832 768 ÷1.231
1024 576 1
1088 512 ×1.063
1152 448 ×1.125
1216 384 ×1.188
1280 320 ×1.250
1344 256 ×1.312
1408 192 ×1.375

Table 1.1: Relative performance of K - f [r + c = 1600] with respect to K [].

For an output length n smaller than or equal to the bitrate, the squeezing phase does not
imply any additional processing. However, if more output bits are needed, the additional
number of calls to K - f for an n-bit output is ⌈ nr ⌉ − 1.
When evaluating K , the processing time is dominantly spent in the evaluation of
K - f . In good approximation, the throughput of K for long messages is therefore
proportional to r for a given permutation width b. For instances with r + c = 1600, we will
o en write performance figures for the default bitrate r = 1024. To estimate the performance
for another bitrate, Table 1.1 provides the performance relative to the default bitrate. This
is valid for long messages; for short messages, the processing time is determined by the
required number of calls to K - f (e.g., one when l ≤ r − 2).
On a platform supporting operations on w-bit words, each lane can be mapped to such
a word. If the platform supports only smaller, m-bit words, each lane has to be mapped to
b/m such words. There are different kinds of mappings. As long as the same mapping is
applied to all lanes, each bitwise Boolean operation on a lane is translated as b/m instructions
on CPU words. The most straightforward mapping is to take as CPU word i the lane bits
with z = mi . . . m(i + 1) − 1. In that case, the b-bit rotations need to be implemented using a
number of shi s and bitwise Boolean instructions. Another possible mapping, that translates
the b-bit rotations into a series of m-bit CPU word rotation instructions, is introduced in
Section 2.1.
If we use a b-bit platform, the evaluation of K - f [b] uses in terms of lane operations
• 76nr XORs,
• 25nr ANDs and 25nr NOTs, and
• 29nr b-bit rotations.
For K - f [1600] specifically, this becomes
• 1824 XORs,
• 600 ANDs and 600 NOTs, and
• 696 64-bit rotations.
Some instruction sets propose an instruction that combines the AND and NOT operations
in one operation as in χ, i.e., c ← (NOT a) AND b. If such an instruction is not available on
a CPU, almost 80% of the NOT operations can be removed by applying a lane complement-
ing transform as explained in Section 2.2, turning a subset of the AND operations into OR
operations.

11 / 51
K implementation overview 1. General aspects

In [9], we provide a simple implementation (called Simple) that maps a lane to a CPU
word of 8, 16, 32 or 64 bits, and hence is suitable for K - f [200], K - f [400],
K - f [800] and K - f [1600].

1.4 Memory
In terms of memory usage, K has no feedforward loop, as opposed to many other
constructions, and the message block can be directly XORed into the state. This can benefit
applications for which the message is forma ed on the fly or does not need to be kept a er
being hashed. This also applies where the hashing API has to implement a message queue.
In general a message queue must be allocated, which can be avoided for sponge functions
or similar.
The amount of working memory is limited to the state, the round number and some
extra working memory for θ and χ. Five w-bit words of extra working memory allow the
implementation of θ to compute the XOR of the sheets, while they can hold the five lanes of
a plane when χ is being computed.
An efficient memory-constrained implementation technique is presented in Section 2.5.

Examples of implementations that use a limited amount of memory can be found in the
Compact, Compact8, Inplace and Inplace32BI implementations in [9]. Some of these
implementations also provide an API with a message queue such as Init, Update and
Final, for which no extra memory needs to be allocated.

12 / 51
Chapter 2

Implementation techniques

In this chapter, we introduce techniques that can be used to optimize so ware and hardware
implementations, namely bit interleaving, lane complementing, plane-per-plane processing
and eﬃcient in-place evaluation.

2.1 Bit interleaving

The technique of bit interleaving consists in coding an w-bit lane as an array of s = w/m
CPU words of m bits each, with word ζ containing the lane bits with z ≡ ζ (mod s). Here, s
is called the interleaving factor. This technique can be applied to any version of K - f to
any CPU with word length m that divides its lane length w.
We first treat the concrete case of 64-bit lanes and 32-bit CPU words, so a factor-2 inter-
leaving. A 64-bit lane is coded as two 32-bit words, one containing the lane bits with even z-
coordinate and the other those with odd z-coordinates. More exactly, a lane L[z] = a[ x ][y][z]
is mapped onto words U0 and U1 with U0 [ j] = L[2j] and U1 [ j] = L[2j + 1]. If all the lanes of
the state are coded this way, the bitwise Boolean operations can be simply implemented with
bitwise Boolean instructions operating on the words. The main benefit is that the lane trans-
lations in ρ and θ can now be implemented with 32-bit word rotations. A translation of L with
an even offset 2τ corresponds to the translation of the two corresponding words with offset
τ. A translation of L with an odd offset 2τ + 1 corresponds to U0 ← ROT32 (U1 , τ + 1) and
U1 ← ROT32 (U0 , τ ). On a 32-bit processor with efficient rotation instructions, this may give
an important advantage compared to the straightforward mapping of lanes to CPU words.
Additionally, a translation with an offset equal to 1 or −1 results in only a single CPU word
rotation. For K - f [1600] specifically, this is the case for 6 out of the 29 lane translations
in each round (5 in θ and 1 in ρ).
More generally, let us consider an interleaving factor s. The words Uζ , with ζ ∈ Zs ,
represent the lane L. All the operations on the ζ coordinate are considered modulo s. A
translation of L by τ positions causes each word Uζ to be moved to Uζ ′ with ζ ′ = ζ + τ, and
to be rotated by ⌊ τs ⌋ + 1 bits if ζ ′ < τ mod s or by ⌊ τs ⌋ bits if ζ ′ ≥ τ mod s.
The bit-interleaving representation can be used in all of K - f where the input and
output of K - f assume this representation. This implies that during the absorbing the
input blocks must be presented in bit-interleaving representation and during the squeezing
the output blocks are made available in bit-interleaving representation. When implementing
K in strict compliance to the specifications [10], the input blocks must be transformed
to the bit-interleaving representation and the output blocks must be transformed back to
the standard representation. However, one can imagine applications that require a secure
sponge function but no interoperability with respect to the exact coding. In that case one

13 / 51
K implementation overview 2. Implementation techniques

may present the input in interleaved form and use the output in this form too. The resulting
function will only diﬀer from K by the order of bits in its input and output.
In hashing applications, the output is usually kept short and some overhead is caused by
applying the bit-interleaving transform to the input. Such a transform can be implemented
using shi s and bitwise operations. For implementing K - f [1600] on a 32-bit CPU, we
must distribute the 64 bits of a lane to two 32-bit words. On some platforms, look-up tables
can speed up this process, although the gain is not always signiﬁcant.

Several examples of implementations making use of bit interleaving can be found in [9],
notably the Reference32BI and Simple32BI implementations. Also, the Optimized32
implementation can be set to use look-up tables by deﬁning the UseInterleaveTables
symbol in KeccakF-1600-opt32-settings.h. Intermediate values to help debug a bit-
interleaved implementation can be found in the ﬁle KeccakPermutationIntermediat-
eValues32BI.txt in [8]. Finally, K T can generate code that make use of bit
interleaving for any lane size and any (smaller) CPU word size [7].

2.2 The lane complementing transform

The mapping χ applied to the 5 lanes in a plane requires 5 XORs, 5 AND and 5 NOT oper-
ations. The number of NOT operations can be reduced to 1 by representing certain lanes by
their complement. In this section we explain how this can be done.
For the sake of clarity we denote the XOR operation by ⊕, the AND operation by ∧,
the OR operation by ∨ and the NOT operation by ⊕1. Assume that the lane with x = 2 is
represented its bitwise complement a[2]. The equation for the bits of A[0] can be transformed
using the law of De Morgan (a ∧ b = a ∨ b):

A[0] = a[0] ⊕ ( a[1] ⊕ 1) ∧ ( a[2] ⊕ 1) = a[0] ⊕ ( a[1] ∨ a[2]) .

The equation for the bits of A[1] now becomes A[1] = a[1] ⊕ ( a[2] ∧ a[3]). This results in the
cancellation of two NOT operations and A[0] being represented by its complement. Simi-
larly, representing a[4] by its complement cancels two more NOT operations. We have

A [0] = a[0] ⊕ ( a[1] ∨ a[2]),

A [1] = a[1] ⊕ ( a[2] ∧ a[3]),
A [2] = a[2] ⊕ ( a[3] ∨ a[4]),
A [3] = a[3] ⊕ ( a[4] ∧ a[0]).

In the computation of the bits of A[4] the NOT operation cannot be avoided without intro-
ducing NOT operations in other places. We do however have two options:

A[4] = a[4] ⊕ (( a[0] ⊕ 1) ∧ a[1]), or

A[4] = a[4] ⊕ ( a[0] ∨ ( a[1] ⊕ 1)).

Hence one can choose between computing A[4] and A[4]. In each of the two cases a NOT
operation must be performed on either a[0] or on a[1]. These can be used to compute A[0]
rather than A[0] or A[1] rather than A[1], respectively, adding another degree of freedom
for the implementer. In the output some lanes are represented by their complement and the
implementer can choose from 4 output pa erns. In short, representing lanes a[2] and a[4]
by their complement reduces the number of NOT operations from 5 to 1 and replaces 2 or 3
AND operations by OR operations. It is easy to see that complementing any pair of lanes a[i ]

14 / 51
2. Implementation techniques K implementation overview

and a[(i + 2) mod 5] will result in the same reduction. Moreover, this is also the case when
complementing all lanes except a[i ] and a[(i + 2) mod 5]. This results in 10 possible input
pa erns in total.
Clearly, this can be applied to all 5 planes of the state, each with its own input and output
pa erns. We apply a complementing pa ern (or mask) p at the input of χ and choose for each
plane an output pa ern resulting in P. This output mask P propagates through the linear
steps θ, ρ, π (and ι) as a symmetric diﬀerence pa ern, to result in yet another (symmetric)
mask p′ = π (ρ(θ ( P))) at the input of χ of the next round. We have looked for couples of
masks ( p, P) that are round-invariant, i.e., with p = π (ρ(θ ( P))), and found one that comple-
ments the lanes in the following 6 ( x, y) positions at the output of χ or input of θ:

P : {(1, 0), (2, 0), (3, 1), (2, 2), (2, 3), (0, 4)}.

A round-invariant mask P can be applied at the input of K - f . The beneﬁt is that in

all rounds 80% of the NOT operations are cancelled. The output of K - f can be corrected
by applying the same mask P. The overhead of this method comes from applying the masks
at the input and output of K - f . This overhead can be reduced by redeﬁning K -f
as operating on the masked state. In that case, P must be applied to the root state 0b and
during the squeezing phase some lanes (e.g., 4 when r = 16w) must be complemented prior
to being presented at the output.

The Optimized32 and Optimized64 implementations can be set to use lane complement-
ing by deﬁning the UseBebigokimisa symbol in KeccakF-1600-opt*-settings.h [9].
K T can generate code that make use of lane complementing for any lane size
[7].

2.3 Extending the state for smoother scheduling

A naive serialized implementation of the K - f round function would perform the dif-
ferent steps sequentially: one phase per step. In such a scheduling the full state is processed
four times per round, typically resulting in high load-and-store overhead. One may reduce
this number by combining multiple steps in a single phase. For example, the two bit trans-
positions ρ and π can be computed in a single phase. In this section we describe a technique
to reduce the number of phases to two. The main idea is to adopt a redundant state repre-
sentation that includes the column parities. The main phase updates the state bits and can
be serialized down to the level of rows. The other phase updates the column parity bits and
can be serialized down to the level of bits. At the expense of some additional storage over-
head, the two phases can be combined into one. This allows efficient and compact serialized
hardware architectures and so ware scheduling.
Let us consider the computation of a row of the round output given the round input
in extended representation. The last step is ι that simply consists of possibly flipping a bit
of the row. Before that, there is χ that can be applied to the row. This is preceded by the
bit transpositions π and ρ that can be implemented by taking the bits from the appropriate
positions of the state. The step θ requires adding two parity bits to each of the five bits of
the row. If one extends the state with the θ-effect rather than the column parity, the step
θ requires the addition of a single bit of the θ-effect to each bit of the row. It follows that
the computation of a single row only requires as input 5 bits of the state and 5 bits of the
θ-effect. In hardware, in principle one can compute rows serially by a small circuit that
implements χ restricted to a row and the addition of 5 θ-effect bits. ρ and π are materialized
by RAM addressing. In so ware there is no point in computing single rows. However, one

15 / 51
K implementation overview 2. Implementation techniques

may compute sets of rows in parallel by making use of the bitwise Boolean instructions and
cyclic shi s. The simplest such grouping is by taking all the rows in a plane. Here ρ consists
of five cyclic shi s and π requires addressing. One can also apply bit-interleaving and group
all rows in a plane with even or odd z coordinates.
The computation of the θ-effect from a state can done bit-by-bit. For a given column,
one can initialize its column parity to zero and accumulate the bits of a column to it one by
one. This can again be grouped: lane-by-lane or using bit-interleaving. This computation
can however also be integrated in a single phase together with the computation of the round
output, as in Section 2.4.1. A er χ has been applied to a row, one accumulates each of the
bits in the corresponding column parity. A er all rows of a slice have been computed, the
corresponding column parities will have their correct value. Now these column parities
must simply be combined into the θ-effect. Note that this scheduling implies the storage of
the θ-effect of the round input and storage for the θ-effect of the round output that is being
computed.
The five bits of the round input used in the computation of a row of the round output are
not used in the computation of any other row. So the registers containing these round input
bits may be overwri en by the computed round output bits. This allows reducing working
memory for storing intermediate computation results to the working state, the θ-effect of the
round input and possible the column parity effect being computed. This is interesting in
so ware implementations on platforms with restricted RAM. It goes at the expense of more
complex addressing as the position of statebits depends on the round number.

2.4 Plane-per-plane processing

This section describes a general technique to schedule the order of operations for high effi-
ciency, both in so ware and serialized hardware architectures. The main idea is to process
one plane at a time centered around χ, i.e., y = 0, 1, . . . , 4 from the point of view of the
coordinates at the output of the round.
Let us take the point of view of the evaluation of χ applied to the five lanes of a plane y,
denoted B[·, y]. We can track each lane B[ x, y] at the input of χ to a lane at the beginning of
the round A[ x ′ , y′ ], with ( x ′ , y′ )T = M−1 ( x, y)T and M−1 the inverse of the matrix used in π.
Between A[ x ′ , y′ ] and B[ x, y] are the application of θ and a cyclic rotation ρ.
Assuming the parity effect of θ is already computed, as the values D [·] in [10, Algo-
rithm 3], we have:
( ′) ( )
′ ′ ′ ′ ′ x −1 x
B[ x, y] = ROT(( A[ x , y ] ⊕ D [ x ]), r [ x , y ]), with =M ,
y′ y

for a ﬁxed y. Then, the plane y at the output of the round is obtained as

E[ x, y] = B[ x, y] ⊕ ((NOT B[ x + 1, y]) AND B[ x + 2, y]),

without forge ing to XOR E[0, 0] with the round constant to implement ι. Note that, since
y is fixed, the variable B is a plane and can be indexed only by its x coordinate. To get the
parities of the columns before θ, denoted C [·], the simplest solution is to compute them as
a first phase in the round. All the steps are summarized in Algorithm 1. Note that all lane
addresses can be fixed by loop unrolling. The output variables E[·, ·] can be used as input
for the next round.

Most software implementations available actually use the plane-per-plane processing [9].
The code generated with K T ’s KeccakFCodeGen::genCodePlanePerPlane()

16 / 51
2. Implementation techniques K implementation overview

Algorithm 1 Plane-per-plane processing

for x = 0 to 4 do
C [ x ] = A[ x, 0] ⊕ A[ x, 1] ⊕ A[ x, 2] ⊕ A[ x, 3] ⊕ A[ x, 4]
end for
for x = 0 to 4 do
D [ x ] = C [ x − 1] ⊕ ROT(C [ x + 1], 1)
end for
for y = 0 to 4 do
for x = 0 to 4 do ( ′) ( )
′ ′ ′ ′ ′ x − x
B[ x ] = ROT(( A[ x , y ] ⊕ D [ x ]), r [ x , y ]), with =M 1
y′ y
end for
for x = 0 to 4 do
E[ x, y] = B[ x ] ⊕ ((NOT B[ x + 1]) AND B[ x + 2])
end for
end for
E[0, 0] = E[0, 0] ⊕ RC[i ]

follows this idea too [7]. The code generation can be combined with lane complementing
if desired.

2.4.1 Early parity

To compute the parities of the columns, an alternate option is to compute C [·] ahead during
a round for the next one: C [ x ] accumulates the values E[ x, y] on the ﬂy as they are computed.
We call this early parity and it is described in Algorithm 2. Of course, when using the early
parity, the parity C must still be computed before the ﬁrst round and does not need to be
computed during the last round.

Algorithm 2 Plane-per-plane processing with early parity (ι not shown)

Assuming C [·] already computed
for x = 0 to 4 do
D [ x ] = C [ x − 1] ⊕ ROT(C [ x + 1], 1)
end for
C = (0, 0, 0, 0, 0)
for y = 0 to 4 do
for x = 0 to 4 do ( ′) ( )
′ ′ ′ ′ ′ x − x
B[ x ] = ROT(( A[ x , y ] ⊕ D [ x ]), r [ x , y ]), with ′ =M 1
y y
end for
for x = 0 to 4 do
E[ x, y] = B[ x ] ⊕ ((NOT B[ x + 1]) AND B[ x + 2])
C [ x ] = C [ x ] ⊕ E[ x, y]
end for
end for

K T ’s function KeccakFCodeGen::genCodePlanePerPlane() has a parameter

prepareTheta that tells whether to include the computation of C [·] ahead for early parity.

17 / 51
K implementation overview 2. Implementation techniques

2.4.2 Combining with bit interleaving

Plane-per-plane processing can be combined with bit interleaving. We adapt Algorithm 1
by writing it in terms of operations on m-bit words. All the variables are now m-bit words,
instead of w-bit lanes, and we add an extra coordinate ζ when necessary, with ζ ∈ Zs and
s = w/m the interleaving factor. All the operations on the x, y (resp. ζ) coordinates are
considered modulo 5 (resp. modulo s).
The resulting algorithm is displayed in Algorithm 3. In the main loop (starting from
line 10), the coordinates x, y and ζ are from the point of view of the output of the round. The
coordinates x ′ , y′ and ζ ′ express which input word has to be fetched.

Algorithm 3 Plane-per-plane processing with bit interleaving of factor s

for x = 0 to 4 and ζ = 0 to s − 1 do
C [ x, ζ ] = A[ x, 0, ζ ] ⊕ A[ x, 1, ζ ] ⊕ A[ x, 2, ζ ] ⊕ A[ x, 3, ζ ] ⊕ A[ x, 4, ζ ]
end for
for x = 0 to 4 do
D [ x, 0] = C [ x − 1, 0] ⊕ ROT(C [ x + 1, s − 1], 1)
for ζ = 1 to s − 1 do
D [ x, ζ ] = C [ x − 1, ζ ] ⊕ C [ x + 1, ζ − 1]
end for
end for
for y = 0 to 4 and ζ = 0 to s − 1 do
for x = 0 to 4 do
Let ( x ′ , y′ )T = M−1 ( x, y)T and ζ ′ = ζ − r [ x ′ , y′ ] mod s
r [ x ′ ,y′ ] r [ x ′ ,y′ ]
Let r = ⌊ s ⌋ + 1 if ζ < r [ x ′ , y′ ] mod s, or r = ⌊ s ⌋ otherwise
B[ x ] = ROT(( A[ x ′ , y′ , ζ ′ ] ⊕ D [ x ′ , ζ ′ ]), r )
end for
for x = 0 to 4 do
E[ x, y, ζ ] = B[ x ] ⊕ ((NOT B[ x + 1]) AND B[ x + 2])
end for
end for
E[0, 0, ζ ] = E[0, 0, ζ ] ⊕ RCs [i, ζ ] for ζ = 0 to s − 1

The code generated with K T ’s function KeccakFCodeGen::genCodePlanePerPlane()

can be combined with bit interleaving [7].

2.5 Eﬃcient in-place implementations

In this section, we adapt the addressing in the plane-per-plane processing to minimize the
memory needed to compute K -f.
In the previous section, the computation of the round function extracted its input from
variables denoted A and stored the output into variables E. These two sets of variables, each
of the size of b bits, must be distinct to avoid overwriting variables required for subsequent
computations; hence, 2b bits are needed to store both A and E. In this section, we propose a
way to do plane-per-plane processing using only the A set of variables, hereby reducing the
memory consumption by b bits. To do this, we allow A[ x, y] to store lanes of coordinates not
necessarily ( x, y)T .

18 / 51
2. Implementation techniques K implementation overview

When processing
( ) ( a plane ) y as in Algorithm 1, χ needs five lanes coming from the coordi-
x x + 3y
nates M−1 = for all x ∈ Z5 at the input of the round. This determines a line
y x
with slope 1 going through (0, 2y) (input variety). The five lanes at the output of χ are per
construction on a line with slope 0 going through (0, y) (output variety). To avoid overwrit-
ing other variables, we must store the five lanes of the output variety in the memory location
that held the input variety. We choose that the relationship between the lane coordinates and
the memory location is represented by a linear matrix N ∈ Z5 × Z5 . At the end of the first
round, the lane at ( x, y)T is stored at location N ( x, y)T .(To satisfy
) the mapping of the output
1 a
variety onto the input variety, N must be of the form with a + b = 2. We choose to
1 b
set a = 0 so as to leave the x coordinate unchanged during the mapping, hence
( )
1 0
N= .
1 2

At the beginning of the second round, the input variety appears rotated by N in memory
so the lane at ( x, y)T is stored at location N 2 ( x, y)T at the end of the second round. In general
at round i, the lanes with coordinates ( x, y)T are stored at location N i ( x, y)T .
The input of χ with coordinates ( x, y)T requires the lane with coordinates M−1 ( x, y)T be-
fore π, which is located at coordinates N i M−1 ( x, y)T in memory. Notice that N i M−1 ( x, y)T =
( x + 3y, . . . ), so there is a shi in the x coordinate inherent to the lane-per-lane(processing.
)
1 2
This can be explicitly compensated for by multiplying all coordinates by MN = and
0 1
integrating this shi before or a er χ, as χ is translation-invariant. This leads to a simpler
expression: The input of χ with coordinates ( x + 2y, y)T requires the lane with coordinates
N ( x, y)T before π, which is located at coordinates N i+1 ( x, y)T in memory. The resulting al-
gorithm can be found in Algorithm 4.
Note the following properties of N:

• The matrix N has order 4, so a er every 4 rounds the variable A will contain the state
without transposition. Unrolling 4 rounds implies that all the coordinates can be con-
stants.

• Expressions like N j ( x, y)T are of the form ( x, f i ( x, y))T . So in an implementation that

does not unroll the rounds, so only the y coordinate changes compared to Algorithms 1
and 2.

Finally, note that Algorithm 4 can also be combined with early parity, lane complement-
ing and bit interleaving techniques, the la er being detailed below.

The Inplace and Inplace32BI implementations explicitly use the techinque presented in
this section on K - f [1600], one using 64-bit rotations and the other one using bit in-
terleaving for 32-bit rotations [9]. K T ’s KeccakFCodeGen::genCodeInPlace()
can generate code for 4 unrolled rounds using this in-place technique, with factor-2 inter-
leaving or without it [7].

2.5.1 Combining with bit interleaving

Similarly to Section 2.4.2, we adapt Algorithm 4 by writing it in terms of operations on m-bit
words. All the variables are now m-bit words, instead of w-bit lanes, and we add an extra

19 / 51
K implementation overview 2. Implementation techniques

Algorithm 4 In-place processing of round i

for x = 0 to 4 do
C [ x ] = A[ N i ( x, 0)T ] ⊕ A[ N i ( x, 1)T ] ⊕ A[ N i ( x, 2)T ] ⊕ A[ N i ( x, 3)T ] ⊕ A[ N i ( x, 4)T ]
end for
for x = 0 to 4 do
D [ x ] = C [ x − 1] ⊕ ROT(C [ x + 1], 1)
end for
for y = 0 to 4 do
for x = 0 to 4 do
B[ x + 2y] = ROT(( A[ N i+1 ( x, y)T ] ⊕ D [ x ]), r [ N ( x, y)T ])
end for
for x = 0 to 4 do
A[ N i+1 ( x, y)T ] = B[ x ] ⊕ ((NOT B[ x + 1]) AND B[ x + 2])
end for
end for
A[0, 0] = A[0, 0] ⊕ RC[i ]

coordinate ζ when necessary, with ζ ∈ Zs and s = w/m the interleaving factor. All the
operations on the x, y (resp. ζ) coordinates are considered modulo 5 (resp. modulo s).
While the memory location of a given lane vary from round to round according to the
N i ( x, y)T rule, the ζ coordinate of words varies as well. This is because, five words are
fetched and processed from a given set of five memory locations, and the resulting five
words must be stored into the same five locations, for the implementation to be in-place.
Let us first illustrate this with an example using interleaving with factor s = 4 (e.g., to com-
pute K - f [1600] with 16-bit operations) and using the notations of Algorithm 3. At the
beginning of the first round i = 0, words are stored in memory locations with the identical
coordinates. Assume that we want to process the quarter-plane y = 2 and ζ = 0. At some
point, we need to compute the word B[0], i.e., word ( x, y, ζ ) = (0, 2, 0) at the input of χ. To
do so, we need the word A[1, 0, 3] since ( x ′ , y′ )T = (1, 0)T = M−1 ( x, y)T and r [0, 0] = 1 so
ζ ′ = ζ − 1 = 3 (mod s). At the beginning of the round, word (1, 0, 3) is stored in memory
at (1, 0, 3). We will need to reuse the memory locations that we just processed. So, to respect
the N i ( x, y)T rule, all words belonging to lane (1, y) = (1, 2) have to be stored at memory
locations (1, 0, ·) since N (1, 2)T = (1, 0)T . In particular, word (1, y, ζ ) = (1, 2, 0) needs to
be stored at memory location (1, 0, 3) since it is the only memory location free at this point
within (1, 0, ·). So the third coordinate of the word (0 in this example) can become different
from the third coordinate of the memory location to store it (3 in this example). For the other
words of the lane, there will be a constant shi between the third coordinates.
In general, the relationship between the third coordinate of words and that of the memory
location depends on the rotation constants. We model this relationship as follows. At the be-
ginning of round i, word ζ of lane ( x, y)T is stored at memory location ( x ′ , y′ , ζ + O(i, x ′ , y′ ))
with ( x ′ , y′ )T = N i ( x, y)T . Here O(i, x, y) is viewed as a property of the memory cells ( x, y, ·):
it tells by how many positions the words are shi ed.
Using the notations of Algorithm 3, word ( x ′ , y′ , ζ ′ ) is read in order to evaluate out-
put word ( x, y, ζ ). Word ( x ′ , y′ , ζ ′ ) is located in memory at ( x ′′ , y′′ , ζ ′ + O(i, x ′′ , y′′ )), with
( x ′′ , y′′ )T = N i ( x ′ , y′ )T , which can now be reused. Hence some output word within the
part of the plane being processed (·, y, ζ ) is stored at the same memory position ( x ′′ , y′′ , ζ ′ +
O(i, x ′′ , y′′ )) at the end of round i. Looking only at the third coordinate, we use memory
location ζ ′ + O(i, x ′′ , y′′ ) = ζ − r [ x ′ , y′ ] + O(i, x ′′ , y′′ ) (mod s) to store word ζ at the be-

20 / 51
2. Implementation techniques K implementation overview

ginning of round i + 1. Since r [ x ′ , y′ ] = r [ N −i ( x ′′ , y′′ )T ], we identify O(i + 1, x ′′ , y′′ ) =

O(i, x ′′ , y′′ ) − r [ N −i ( x ′′ , y′′ )T ] and we obtain:

i −1
O(i, x, y) = − ∑ r [ N − j ( x, y)T ] (mod s).
j =0

The resulting algorithm is displayed in Algorithm 5.Note that O(i, 0, 0) = 0 for all rounds
i, so the last line (evaluation of ι) can be simpliﬁed.

Algorithm 5 In-place processing of round i with bit interleaving of factor s

for x = 0 to 4 and ζ = 0 to s − 1 do
Let ζ y = ζ + O(i, N i ( x, y)T )
C [ x, ζ ] = A[ N i ( x, 0)T , ζ 0 ] ⊕ A[ N i ( x, 1)T , ζ 1 ] ⊕ A[ N i ( x, 2)T , ζ 2 ] ⊕ A[ N i ( x, 3)T , ζ 3 ] ⊕
A[ N i ( x, 4)T , ζ 4 ]
end for
for x = 0 to 4 do
D [ x, 0] = C [ x − 1, 0] ⊕ ROT(C [ x + 1, s − 1], 1)
for ζ = 1 to s − 1 do
D [ x, ζ ] = C [ x − 1, ζ ] ⊕ C [ x + 1, ζ − 1]
end for
end for
for y = 0 to 4 and ζ = 0 to s − 1 do
for x = 0 to 4 do
Let ( x ′′ , y′′ )T = N i+1 ( x, y)T and ζ ′ = ζ − r [ N ( x, y)T ] mod s
r [ N ( x,y)T ] r [ N ( x,y)T ]
Let r = ⌊ s ⌋ + 1 if ζ < r [ N ( x, y)T ] mod s, or r = ⌊ s ⌋ otherwise
′′ ′′ ′ ′′ ′′
B[ x + 2y] = ROT(( A[ x , y , ζ + O(i, x , y )] ⊕ D [ x, ζ ]), r ) ′

end for
for x = 0 to 4 do
Let ( x ′′ , y′′ )T = N i+1 ( x, y)T
A[ x ′′ , y′′ , ζ + O(i + 1, x ′′ , y′′ )] = B[ x ] ⊕ ((NOT B[ x + 1]) AND B[ x + 2])
end for
end for
A[0, 0, ζ + O(i + 1, 0, 0)] = A[0, 0, ζ + O(i + 1, 0, 0)] ⊕ RCs [i, ζ ] for ζ = 0 to s − 1

2.5.1.1 Special case: interleaving factor 2

We now look at the speciﬁc case of an in-place implementation using an interleaving of

factor s = 2, e.g., to implement K - f [1600] using 32-bit operations or to implement
K - f [400] using 8-bit operations. A er 4 rounds, one would expect all the lanes to come
back to their initial positions but not necessarily the word oﬀsets within the lanes, requiring
another 4 rounds for O(8, x, y) to be zero (modulo 2). However, the nice thing about this
special case is that O(4, x, y) = 0 (mod 2) for all coordinates x, y. This implies that a er 4
rounds, all the words within the lanes are also back to their initial positions.
To verify that O(4, x, y) = − ∑3j=0 r [ N − j ( x, y)T ] = 0 (mod 2) for all x, y coordinates, we
start with two properties of N:

• The matrix N and its powers leave the set x + y = 0 unchanged, i.e., the set x + y = 0
is an eigenspace with eigenvalue 1.

21 / 51
K implementation overview 2. Implementation techniques

• Except for the points in the set x + y = 0, the points N i ( x, y)T stay in column x, take
all the positions such that y ̸= − x and come back to their initial position a er a cycle
of length 4.

For coordinates satisfying x + y = 0, r [ N − j ( x, y)T ] is a constant so clearly 2 divides O(4, x, y).

For the other coordinates, it suﬃces to check that the sum of the round constants over the 4
coordinates y ̸= − x in each column is zero modulo 2.

The Inplace32BI implementation exploits this by unrolling 4 rounds [9].

2.5.1.2 Special case: interleaving factor 4

We now look at the speciﬁc case of an in-place implementation using an interleaving of
factor s = 4, e.g., to implement K - f [1600] using 16-bit operations or to implement
K - f [800] using 8-bit operations. Extending the same reasoning from the previous case
s = 2, we see that O(4, x, y) ∈ {0, 2} (mod 4) for all coordinates x, y. This implies that a er
8 rounds, we have O(8, x, y) = 2O(4, x, y) = 0 (mod 4) and all the words within the lanes
are back to their initial positions.
If unrolling 8 rounds is a problem, we can unroll 4 rounds and compute O(4, x, y) to see
how words are stored. Fortunately, O(4, x, y) = 0 (mod 4) for all x, y coordinates except
(0, 1), (0, 2), (0, 3), (0, 4), for which O(4, x, y) = 2 (mod 4). So a er 4 rounds it suﬃces to
interchange A[ x, y, ζ ] with A[ x, y, ζ + 2] for the x, y coordinates in the set above.

22 / 51
Chapter 3

So ware

In this chapter, we give an overview of so ware implementations. Please note that we do

not provide extensive benchmark results of K on various platforms. Instead, we refer
to eBASH and XBX for detailed and up-to-date benchmarking data on a wide range of plat-
forms [1, 33]. In addition, a summary of the data coming from eBASH and XBX can be found
on our web page [6].

3.1 PC and high-end platforms

The platforms supporting 64-bit instructions are in general well-suited for K - f [1600],
as a lane can be mapped onto a CPU word. The x86-64 instruction set present in the CPU
of most recent PCs is a widely-used example. This instruction set does not have a combined
AND+NOT instruction, so lane complementing can be used to reduce the number of NOT
instructions. At the time of writing, the permutation K - f [1600] plus XORing 1024 bits
takes slightly more than 1600 cycles on a typical x86-64-based machine, hence enabling to
absorb long messages with K [] at about 12.6 cycles/byte. K [] performs very well
on IA64-based machines too, at around 6-7 cycles/byte [6, 1].

In [9], we provide an implementation suitable for 64-bit CPUs called Optimized64. The
code uses only plain C instructions, without assembly nor SIMD instructions. If needed,
we have applied lane complementing to reduce the number of NOTs. The operations in
the round function have been expanded in macros generated by K T [7]. We
have tried to interleave lines that apply on diﬀerent variables to enable pipelining, while
grouping sets of lines that use a common precomputed value to avoid reloading the reg-
isters too often. The order of operations is centered on the evaluation of χ on each plane,
preceded by the appropriate XORs for θ and rotations for ρ, and accumulating the parity
of sheets for θ in the next round.

In KeccakF-1600-opt64-settings.h, the number of rounds unrolled can be set to any

number dividing 24, and the use of lane complementing can be turned on or oﬀ.

• For x86-64, the fastest option is to unroll 24 rounds (#define Unrolling 24) and to
use lane complementing (#define UseBebigokimisa).

• For IA64, unrolling 6 rounds and not using lane complementing give the fastest re-
sults.

23 / 51
K implementation overview 3. So ware

3.1.1 Using SIMD instructions

Some platforms do not have plain 64-bit instructions but have single-instruction multiple-
data (SIMD) units capable of processing 64-bit or 128-bit data units.
In the family of Intel, AMD and compatible CPUs, a majority supports SIMD instruction
sets known as MMX, SSE and their successors. These include bitwise operations on 64-bit
and 128-bit registers. Thanks to the symmetry of the operations in K - f , the perfor-
mance of K can benefit from these instruction sets. For instance, the pandn instruction
performs the AND NOT operation bitwise on 128-bit registers (or one register and one mem-
ory location), which can be used to implement χ. Such an instruction replaces four 64-bit
instructions or eight 32-bit instructions. Similarly, the pxor instruction computes the XOR of
two 128-bit registers (or one register and one memory location), replacing two 64-bit XORs
or four 32-bit XORs.
While instructions on 128-bit registers work well for θ, π, χ and ι, the lanes are rotated
by different amounts in ρ. Consequently, the rotations in ρ cannot fully benefit from the 128-
bit registers. However, the rotations in θ are all of the same amount and can be combined
efficiently.
If a modern 64-bit CPU such as an Intel Core 2 Duo or an AMD Athlon 64 is restricted
to 32-bit instructions (e.g., if the installed operating system uses only a legacy 32-bit mode),
using the 128-bit registers of SSE yields faster throughput than using plain 32-bit instructions
on such platforms. In such a case, the evaluation of K - f [1600] and XORing 1024 bits
takes about 2500 cycles, hence enabling to absorb long messages with K [] at about 20
cycles/byte.
Some older PCs have 32-bit CPUs and a MMX unit with 64-bit SIMD registers. In such
a case, it is also interesting to use SIMD instructions. E.g., on an Intel Pentium 3, the eval-
uation of K - f [1600] and XORing 1024 bits takes about 5200 cycles, hence enabling to
absorb long messages with K [] at about 41 cycles/byte. This represents a 40% speedup
compared to an implementation using only 32-bit instructions.
In [9], we provide implementations for 64-bit MMX and 128-bit SSE instructions. This can
be set in KeccakF-1600-opt64-settings.h, by defining either UseMMX or UseSSE.

3.1.2 SIMD instructions and tree hashing

Parallel evaluations of two instances of K can also benefit from SIMD instructions, for
example in the context of tree hashing. In particular, a tree hashing mode as defined in [3]
using K [] and parameters ( G = LI, H = 1, D = 2, B = 64, C = c = 576) can directly
take advantage of two independent sponge functions running in parallel, each taking 64 bits
of input alternatively. The final node then combines their outputs.
We have implemented the K - f [1600] permutation with SSE2 instructions using only
64 bits of the 128-bit registers. By definition of these instructions, the same operations are ap-
plied to the other 64 bits of the same registers. It is thus possible to evaluate two independent
instances of K - f [1600] in parallel on a single core of the CPU.
In this instance of the tree hashing mode, the message bits can be directly input into the
128-bit SSE2 registers without any data shuffling. Using leaf interleaving and a block size
B of 64 bits, 64 bits of message are used alternatively by the first sponge function then by
the second sponge function. This matches how the data are organized in the SSE2 registers,
where 64 bits are used to compute one instance of K - f [1600] and the other 64 bits to
compute the second instance.
On a PC with an Intel Core 2 Duo, we have measured the speed of the double evaluation
of K - f [1600] and the XOR of two blocks of 1024 bits. This takes about 2520 cycles, hence

24 / 51
3. So ware K implementation overview

enabling to absorb long messages below 10 cycles/byte. Furthermore, one may consider the
case r = 1088 and C = c = 512, for which the claimed security level is 2256 . While losing the
power of two for the rate, the ﬁnal node needs to absorb only one block (DC < r) and the
overhead remains reasonable: one extra evaluation of K - f per message. This beneﬁts
also to long messages, for which the number of cycles per byte is further reduced to about 9
cycles/byte.

3.1.3 Batch or tree hashing on a graphics processing unit

Using a graphics processing unit (GPU) allows for a high number of instances of K to
be computed in parallel. This can be useful in the case of the batch evaluation of several hash
functions or of several key streams, or in the case of tree hashing. We are aware of at least
two implementations of K on a graphics processing unit [21, 30].

3.2 Small 32-bit platforms

In this section, we consider the implementation of K on a 32-bit platform without a
SIMD unit.
A ﬁrst question to ask ourselves—at least if this choice is possible due to interoper-
ability constraints—is to consider using either K - f [800] or K - f [1600]. On the
one hand, for a given security level—hence a given capacity c—a sponge function using
K - f [1600] allows processing 800 more input bits per call to the permutation than one
using K - f [800]. On the other hand, the memory footprint of K - f [800] is smaller
as its the state uses only 25 words of 32 bits, while this number raises to 50 for K - f [1600].
A lower number of words usually means a lower number of load/store operations from/to
memory.
If one goes for K - f [1600], the best is to make use of the bit interleaving technique
(see Section 2.1). This way, all the operations are done on 32-bit words, including the rota-
tions. If one goes for K - f [800], each lane is simply mapped to a CPU word.

For 32-bit platforms in general, good starting points are the following.

• The target Simple32BI contains a simple yet optimized implementation of K - f [1600]

in C using bit interleaving (see file Keccak-simple32BI.c).
• Alternatively, the target Inplace32BI is almost as simple as Simple32BI, with the
added benefit of using less RAM.
• The target Simple can be used to instantiate a simple yet optimized implementation
of K - f [800] in C, mapping a lane to a CPU word (see file Keccak-simple.c
and set cKeccakB to 800 in Keccak-simple-settings.h).

3.2.1 Implementation on a ARM Cortex-M3

The ARM Cortex-M3 is a 32-bit RISC microcontroller core with 16 registers. Its instruction
set contains a combined AND+NOT instruction called bic, so that lane complementing is not
necessary. The 32-bit rotations are natively supported and such rotations can be combined
with other operations thanks to its barrel shi er. With our best implementation on this pro-
cessor, K [] takes about 95 cycles/byte for long messages and 272 bytes of RAM on the
stack, according to our measurements.

25 / 51
K implementation overview 3. So ware

In [9], we provide an implementation in assembly of K - f [1600] for ARM Cortex-M3;

see the ﬁle Keccak-inplace32BI-armgcc.s. It uses bit interleaving and takes advantage
of the barrel shifter to combine some XORs and rotations in one instruction.

3.3 Small 8-bit platforms

In this section, we consider the implementation of K on a small 8-bit platform. As for
small 32-bit platforms, the first question to ask ourselves is which instance of K - f to
implement. Generally speaking, larger instances allow having larger rates and processing
more input bits for a given capacity c. However, the required memory also increases.
The bit interleaving technique, described in Section 2.1, could be used to represent the
64-bit, 32-bit and 16-bit lanes of K - f [1600], K - f [800] and K - f [400], respec-
tively, with bytes. Byte-level rotations, however, are not o en present on 8-bit platforms.
Rotations that propagate a bit through the carry are usually natively supported and can be
used to perform rotations over lanes of any size, by chaining such rotations through all the
bytes of a lane. So, in practice, we did not find bit interleaving to be advantageous in this
case, although it may become so on some specific platforms.
If a small capacity of c = 160 bits or so is enough for a given application, one can use
K - f [200], which uses only 8-bit operations and can be made very compact, as the state
only uses 25 bytes.

For small 8-bit platforms in general, the target Compact8 in [9] is a good starting point to
implement K - f [1600] on an 8-bit processor. Also, the target Simple can be used to
instantiate a simple yet optimized implementation of K - f [200], mapping a lane to a
byte (see ﬁle Keccak-simple.c and set cKeccakB to 200 in Keccak-simple-settings.h),
although this implementation was not optimized for size.

3.3.1 Estimation on an Intel 8051 processor

We have estimated the performance of K [] on the Intel 8051 microprocessor. The 8051
is an 8-bit processor equipped with an 8-bit data bus and a 16-bit address bus. Initially the
8051 could only address 128 bytes of internal RAM memory, but later versions were released
to allow accessing more internal RAM using a technique called Expanded RAM [15]. Nowa-
days many manufacturers propose variants of the original 8051 with various levels of im-
provements, including even more powerful addressing modes, security features and shorter
instruction cycles. The variant we have selected for our estimation is the 80C51RA/RB/RC
microcontroller from Intel [14], with 512 bytes of on-chip RAM, split in 3 segments: 128 bytes
of low internal RAM (direct and indirect access), 128 bytes of high internal RAM (indirect
access only), and finally 256 bytes of external RAM (indirect access only).
The first problem to solve when implementing K on such a constrained platform
like the 80C51RA/RB/RC is the memory mapping. The performance of memory accesses
depends on which memory segment is addressed, and so a careful mapping must be done
to ensure that most operations are done in the low internal segment. This is particularly
difficult for θ for which in-place evaluation requires additional temporary storage. However
by following a tight schedule of operations, it is possible to maintain the complete state of
K in internal RAM, hence maximizing the performance.
Another problem is the implementation of 64-bit rotations in ρ using 8-bit rotate oper-
ations. At first the 8051 does not offer specific instructions to optimize this step, and so
rotations are merely done by iterating several times rotate-through-carry instructions (with

26 / 51
3. So ware K implementation overview

Step Performance
θ 34560 cycles
ρ 20808 cycles
π 0 cycles
χ 30048 cycles
ι 384 cycles
Words re-ordering 15000 cycles
Total 100800 cycles

Table 3.1: Performance estimates of the K - f [1600] permutation on the 8XC51RA/RB/RC

the exception of 4-bits rotation which can be done by swapping the byte digits). However
using efficient memory transfer instructions like XCH that exchanges the accumulator with a
byte in memory, it is actually possible to reduce the average number of cycles for rotation to
only 4.3 cycles/byte.
The performance estimates for K including the details for each step are given in
Table 3.1, for 24 rounds. One cycle refers to the number of controller clock oscillation periods,
which is 12 in the case of our selected variant. It must be noted that the figures are the result
of a best-effort paper exercise. Figures for an actual implementation might vary, in particular
if it uses specific manufacturer improvements available on the platform.

3.3.2 Implementation on a Atmel AVR processor

The AVR platform uses an 8-bit RISC processor with 32 single-byte registers. With our best
implementation on this processor, K [] takes about 1110 cycles/byte for long messages
and 281 bytes of RAM on the stack, according to our measurements.

In [9], we provide an implementation in assembly of K - f [1600] for Atmel AVR; see

the ﬁles Keccak-avr8*.

27 / 51
K implementation overview 3. So ware

28 / 51
Chapter 4

Hardware

In this chapter we report on our hardware implementations of K without protection

against side-channel a acks. For an overview and links to third-party hardware implemen-
tations of K we refer to [5].

4.1 Introduction
K allows to trade off area for speed and vice versa. Different architectures reflect dif-
ferent trade-offs. The two architectures we have investigated and implemented reflect the
two ends of the spectrum: a high-speed core and a low-area coprocessor. Thanks to the sym-
metry and simplicity of its round function, K allows trading off area for speed and vice
versa. Different architectures reflect different trade-offs.

We have coded our architectures in VHDL for implementation in ASIC and FPGA [4]. For
more details on the VHDL code, refer to the readme.txt ﬁle in the VHDL directory.

In these eﬀorts we have focused on two instances of K :

K [r = 1024, c = 576] : the instance of K with default parameter values. It is built

on top of K - f [1600], the largest instance of the K - f family.

K [r = 40, c = 160] : the instance of K with the smallest instance of the K -f

family such that the capacity still provides a security level that is suﬃcient for many
applications. It makes use of K - f [200].

It should be noted that during the design of K particular eﬀort has been put to facil-
itate the hardware implementation. The round function is based only on simple Boolean
expressions and there is no need for adders or S-boxes with complex logic (typically used
in many cryptographic primitives). Avoiding these complex sub-blocks allow having a very
short critical path for reaching very high frequencies. Another beneﬁcial aspect of K
is that, unless intentionally forced, a general architecture implementing K - f and the
sponge construction can easily support all variants (rates, capacities) and use cases (MAC,
MGF, KDF, PRG) for a given lane size.

4.2 High-speed core

The core presented in this section operates in a stand-alone fashion. The input block is trans-
ferred to the core, and the core does not use other resources of the system for performing the

29 / 51
K implementation overview 4. Hardware

Figure 4.1: The high-speed core

computation. The CPU can program a direct memory access (DMA) for transferring chunks of
the message to be hashed, and the CPU can be assigned to a different task while the core is
computing the hash.
The architecture of the high-speed core design is depicted in Figure 4.1. It is based on the
plain instantiation of the combinational logic for computing one K - f round, and use it
iteratively.
The core is composed of three main components: the round function, the state register
and the input/output buffer. The use of the input/output buffer allows decoupling the core
from a typical bus used in a system-on-chip (SoC).
In the absorbing phase, the I/O buffer allows the simultaneous transfer of the input
through the bus and the computation of K - f for the previous input block. Similarly, in
the squeezing phase it allows the simultaneous transfer of the output through the bus and
the computation of K - f for the next output block.
These buses typically come in widths of 8, 16, 32, 64 or 128 bits. We have decided to
fix its width to the lane size w of the underlying K - f permutation. This limits the
throughput of the sponge engine to w per cycle. This imposes only a restriction if r/w (i.e.,
the rate expressed in number of lanes) is larger than the number of rounds of the underlying
K -f.
In a first phase the high-speed core has been coded in VHDL. Test benches for K -f
and the hash function are provided together with C code allowing the generation of test
vectors for the test benches. We were able to introduce the lane size as a parameter, allowing
us to generate VHDL for all the lane sizes supported by K .
These first VHDL implementations have been tested on different FPGAs by J. Strömberg-
son [31], highlighting some possible improvements and problems with the tools available
from FPGA vendors. We have improved the VHDL code for solving the problems, and this

30 / 51
4. Hardware K implementation overview

Number of round instances Size Critical Path Frequency Throughput

n=1 48 kgates 1.9 ns 526 MHz 22.44 Gbit/s
n=2 67 kgates 3.0 ns 333 MHz 28.44 Gbit/s
n=3 86 kgates 4.1 ns 244 MHZ 31.22 Gbit/s
n=4 105 kgates 5.2 ns 192 MHz 32.82 Gbit/s
n=6 143 kgates 6.3 ns 135 MHZ 34.59 Gbit/s

Table 4.1: Performance estimation of variants of the high speed core of

K [r = 1024, c = 576].

has resulted in be er results in ASIC as well.

The core has been tested using ModelSim tools. In order to evaluate the silicon area and
the clock frequency, the core has been synthesized using Synopsys Design Compiler and a
130 nm general purpose ST technology library, worst case 105◦ C.

4.3 Variants of the high-speed core

The high-speed core can be modified to optimize for different aspects. In many systems the
clock frequency is fixed for the entire chip. So even if the hash core can reach a high frequency
it has to be clocked at a lower frequency. In such a scenario K allows instantiating two,
three, four or even six rounds in combinatorial logic and compute them in one clock cycle.
An alternative for saving area is to XOR the lanes composing the input blocks directly
into the state register and to extract the lanes composing the output blocks directly from it.

4.3.1 K [r = 1024, c = 576]

In this instantiation the width of the bus is 64 bits. The bitrate of 1024 bits and the number
of rounds of K - f [1600] being 24 implies a maximum rate of 43 bits per cycle.
The critical path of the core is 1.9 nanoseconds, of which 1.1 nanoseconds in the combi-
natorial logic of the round function. This results in a maximum clock frequency of 526MHz
and throughput of 22.4 Gbit/s. The area needed for having the core running at this frequency
is 48 kgate, composed of 19 kgate for the round function, 9 kgate for the I/O buffer and 21
kgate for the state register and control logic.
An alternative without separate I/O buffer allows saving about 8 kgate and decreases the
throughput to 12.8 Gbit/s at 500MHz.
Thanks to the low critical path in the combinational logic, it is possible to instantiate two
or more rounds per clock cycle. For instance, implementing two rounds gives a critical path
of 3 nanoseconds, allowing to run the core at 333MHz reaching a throughput of 28Gbit/s.
Such a core will consume 1024 bits every 12 clock cycle, thus the bus width must grow too to
keep up with the throughput per cycle. Note that contrary to many cryptographic algorithm,
in K the processing does not impose the bo leneck in term of hardware implementa-
tion. Table 4.1 summarizes the throughputs for different variants.

4.3.2 K [r = 40, c = 160]

In this instantiation the width of the bus is 8 bits. The bitrate of 40 bits and the number of
rounds of K - f [200] being 18 implies a maximum rate of 2.2 bits per cycle.

31 / 51
K implementation overview 4. Hardware

The critical path of the core is 1.8 nanoseconds, of which 1.1 nanoseconds in the combi-
natorial logic of the round function. This results in a maximum clock frequency of 555MHz
and throughput of 1.23 Gbit/s. The area needed for having the core running at this frequency
is 6.5 kgate, composed of 3 kgate for the round function, 3.1 kgate for the state register and
control logic and less than 400 gates for the I/O buﬀer.
An alternative without separate I/O buﬀer allows saving about 400 gate and decreases
the throughput to 0.96 Gbit/s at 555MHz.

4.4 Low-area coprocessor

A different approach can be taken in the design of the hardware accelerator: the core can
use the system memory instead of having all the storage capabilities internally. The state of
K will be stored in memory and the coprocessor is equipped with registers for storing
only temporary variables.
This kind of coprocessor is suitable for smart cards or wireless sensor networks where
area is particularly important since it determines the cost of the device and there is no rich
operating system allowing to run different processes in parallel.
The architecture is depicted in figure 4.2 where memory buffer labeled with A is reserved
for the state, and B is reserved for temporary values. For the width of the data bus for per-
forming memory access different values can be taken. We consider it equal to the lane size
as a first assumption, and discuss later the implications if a smaller width is taken.
Internally the coprocessor is divided in two parts, a finite state machine (FSM) and a
data path. The data path is equipped with 3 registers for storing temporary values. The FSM
computes the address to be read and set the control signals of the data path. The round is
computed in different phases.

• First the sheet parities are computed, and the 5 lanes are stored in a dedicated area of
the memory.

• The second phase consists in computing the θ transformation, reading all the lanes of
the state, and computing the XOR with the corresponding sheet parities. A er com-
puting a lane in this way, it is rotated according to ρ and wri en to the position deﬁned
by π. Now the intermediate state is completely stored in the buﬀer B.

• The last step is to compute χ and add the round constant, ι, to the lane in position
(0, 0). For doing this the coprocessor reads 3 lanes of a plane from the intermediate
state, computes χ and writes the result to the buﬀer A, reads another element of the
intermediate value and writes the new χ, and so on for the 5 elements of the plane.

The computation of one round of the K - f permutation takes 215 clock cycles. Out
of these, 55 are bubbles where the core is computing internally and not transferring data to
or from the memory.
In a variant with memory words half the lane size, the number of clock cycles doubles but
only for the part relative to read and write, not for the bubbles. In such an implementation
one round of K - f requires 375 clock cycles.
The buﬀer A, where the input of the permutation is wri en and where the output of the
permutation is wri en and the end of the computation has the size of the state (25 times
the lane size), while the memory space for storing temporary values has the size of the state
times 1.2.

32 / 51
4. Hardware K implementation overview

Figure 4.2: The low area coprocessor

33 / 51
K implementation overview 4. Hardware

Device Logic Registers Max Freq. Throughput

Altera StratixIII 4684 (38000) ALUTs 2641 (38000) 206 MHz 8.7 Gbit/s
EP3SE50F484C2
Altera Cyclone III 5770 (10320) LEs 2641 (10320) 145 MHz 6.1 Gbit/s
EP3C10F256C6
Virtex 5 1330 (7200) slices 2640 (28800) 122 MHz 5.2 Gbit/s
XC5VLX50FF324-3

Table 4.2: Performance estimation of the high speed core of K [r = 1024, c = 576] on
diﬀerent FPGAs, and in brackets the resources available in the diﬀerent cases.

The low-area coprocessor has been coded in VHDL and simulated using Modelsim. As
the core depicted in Section 4.2, the coprocessor has been synthesized using ST technology
at 130 nm.

4.4.1 K [r = 1024, c = 576]

In this instantiation the computation of the K - f permutation takes 5160 clock cycles.
The coprocessor has a critical path of 1.5 nanoseconds and can run up to 666.7 MHz resulting
in a throughput of 132 Mbit/s. The area needed for a aining this clock frequency is 6.5 kgate.
If the core is synthesized for a clock frequency limited to 200MHz, the area requirement is
reduced to 5 kgate and the corresponding throughput is 39 Mbit/s. In both cases the amount
of area needed for the registers is about 1 kgate.
It is interesting to note that the low area coprocessor is capable of reaching higher fre-
quencies than the high speed core.

4.4.2 K [r = 40, c = 160]

In this instantiation the computation of the K - f permutation takes 3870 clock cycles.
The coprocessor has a critical path of 1.4 nanoseconds and can run up to 714 MHz resulting
in a throughput of of 6.87 Mbit/s. The area for a aining this clock frequency is 1.6 kgate, If the
core is synthesized for a clock frequency limited to 200MHz (500MHz), the area requirement
is reduced to 1.3 (1.4) kgate and the corresponding throughput is 1.9 (4.8) Mbit/s. In both
cases the amount of area needed for the registers is in the order of 100 gates.

4.5 FPGA implementations

We have used Altera Quartus II Web Edition version 9 [13] and Xilinx ISE WebPACK version
11.1 [16] to evaluate VHDL with the tools for FPGA. These tools provide estimations of the
amount of resources needed and the maximum clock frequency reached.

4.5.1 High-speed core

We report in Tables 4.2 and 4.3 the estimation of the completed place and route for the high-
speed core for the large and small versions respectively.

4.5.2 Low-area coprocessor

In the case of FPGA, the estimations are reported in tables 4.4 and 4.5.

34 / 51
4. Hardware K implementation overview

Device Logic Registers Max Freq. Throughput

Altera StratixIII 594 (38000) ALUTs 333 (38000) 206 MHz 412 Mbit/s
EP3SE50F484C2
Altera Cyclone III 732 (10320) LEs 333 (10320) 145 MHz 290 Mbit/s
EP3C10F256C6
Virtex 5 190 (7200) slices 340 (28800) 122 MHz 244 Mbit/s
XC5VLX50FF324-3

Table 4.3: Performance estimation of the high speed core of K [r = 40, c = 160] on dif-
ferent FPGAs, and in brackets the resources available in the diﬀerent cases.

Device Logic Registers Max Freq. Throughput

Altera StratixIII 855 (38000) ALUTs 242 (38000) 359 MHz 71.2 Mbit/s
EP3SE50F484C2
Altera Cyclone III 1570 (5136) LEs 242 (5136) 183 MHz 36.3 Mbit/s
EP3C5F256C6
Virtex 5 448 (7200) slices 244 (28800) 265 MHz 52.5 Mbit/s
XC5VLX50FF324-3

Table 4.4: Performance estimation of the low area coprocessor of K [r = 1024, c = 576]
on diﬀerent FPGAs, and in brackets the resources available in the diﬀerent cases.

Device Logic Registers Max Freq. Throughput

Altera StratixIII 131 (38000) ALUTs 32 (38000) 359 MHz 3.7 Mbit/s
EP3SE50F484C2
Altera Cyclone III 205 (5136) LEs 30 (5136) 183 MHz 1.9 Mbit/s
EP3C5F256C6
Virtex 5 62 (7200) slices 30 (28800) 265 MHz 2.7 Mbit/s
XC5VLX50FF324-3

Table 4.5: Performance estimation of the low area coprocessor of K [r = 40, c = 160] on
diﬀerent FPGAs, and in brackets the resources available in the diﬀerent cases.

35 / 51
K implementation overview 4. Hardware

36 / 51
Chapter 5

Protection against side-channel a acks

If the input to K includes secret data or keys, side channel a acks may pose a threat.
In this chapter, we report on K implementations that oﬀer a high level of resistance
against power analysis by using the technique of masking (secret sharing).

5.1 Introduction
Sponge functions, among which K , can be used in a wide range of modes covering
the full range of symmetric cryptography functions. We refer to[10, 3] for examples. This
includes functions that take as argument a secret key. Some other functions do not take a
secret key but take as input data that should remain secret such as pseudorandom sequence
generators or commit-challenge-response protocols. If such functionality is desired on de-
vices to which an adversary has some kind of physical or logical access, protection against
side channel and fault a acks is appropriate [2].
Side channel and fault a acks are a acks that do not exploit an inherent weakness of
an algorithm, but rather a characteristics of the implementation. For their security crypto-
graphic primitives inevitably rely on the fact that an adversary does not have access to in-
termediate computation results. As a consequence, even partial knowledge of intermediate
computation results can give a complete breakdown of security, e.g., by allowing compu-
tation of the key. However, actual implementations may leak information on these results
via characteristics such as computation time, power consumption or electromagnetic radia-
tion. Another condition for the security of cryptographic primitives is that they are executed
without faults. An implementation that makes faults, or can be manipulated to make faults,
may be completely insecure.
We here concentrate on countermeasures against power analysis and electromagnetic
radiation that make use of the algebraic properties of the step functions of K . In the
remainder of this chapter we will speak only about power analysis implying also electro-
magnetic analysis. The main diﬀerence between power analysis and electromagnetic anal-
ysis is that in the la er the adversary can make more sophisticated measurements and that
the physical and electronic countermeasures are diﬀerent. The countermeasures at the algo-
rithmic level are however the same for both.
As far as timing a acks are concerned, it is straightforward to implement K in such
a way that its execution time is independent of the input it processes, both in so ware as
in hardware. Moreover, K - f does not make use of large lookup tables so that cache
timing a acks pose no problem. The interleaving method of Section 2.1 can be implemented
with table lookups, which depend on the input message only. Of course, it can also be im-
plemented without any table at all.

37 / 51
K implementation overview 5. Protection against side-channel a acks

Protection against fault a acks can be achieved by countermeasures that are independent
of the cryptographic primitive being protected: fault detection at so ware level, at hardware
level and by performing computations multiple times and verifying that the results are equal.
Particularly good sources of information on side channel a acks and countermeasures are
the proceedings of the yearly Cryptographic Hardware and Embedded Systems (CHES) con-
ferences (http://www.chesworkshop.org/) and the text book [29]. A good source of informa-
tion on fault a acks and countermeasures are the proceedings of the yearly Fault Diagno-
sis and Tolerance in Cryptography (FDTC) workshops (http://conferenze.dei.polimi.it/
FDTC10/index.html).

5.2 Power analysis

The general set-up of power (and electromagnetic) analysis is that the a acker gets one or
more traces of the measured power consumption. If only a single trace suffices to mount
an a ack, one speaks about simple power analysis (SPA). However, the dependence of the
signal in the variables being computed is typically small and obscured by noise. This can be
compensated for by taking many traces, each one representing an execution of the crypto-
graphic primitive with different input values. These many traces are then subject to statistical
methods to retrieve the key information. These a acks are called differential power analy-
sis (DPA) [22]. An important aspect in these a acks is that the traces must be aligned: they
must be combined in the time-domain such that corresponding computation steps coincide
between the different traces.
In DPA one distinguishes between first order DPA and higher order DPA. In first-order, the
a acker is limited to considering single time offsets of the traces. In m-th order the a acker
may incorporate up to m time offsets in the analysis. Higher-order a acks are in principle
more powerful but also much harder to implement [29].
In correlation power analysis (CPA) [12] one exploits the fact that the power consumption
may be correlated to the value of bits (or bitwise differences of bits) being processed at any
given moment: there is a difference in the expected value of the power consumption. In
short, one exploits this by taking many traces and partitioning them in two subsets: in one
set, a particular bit, the target bit, is systematically equal to 0 and in the other it is equal
to 1. Then one adds the traces in each of the two sets and subtracts the results giving the
compound trace. If now at any given time the power consumption is correlated to the target
bit, one sees a high value in the compound trace. One can use this to retrieve key information
by taking a target bit that depends on part of the key and trying different partitions based
on the hypothesis for that part of the key. If wrong key guesses result in a partition where
the bits of intermediate results are more or less balanced, the compound trace of the correct
key guess will stand out. Note that if the power consumption is correlated to bit values
or differences, it is also correlated to the Hamming values of words or Hamming distances
between words.
Later, more advanced ways to measure the distance between distributions were intro-
duced. In particular, mutual information analysis (MIA) [19] is a generalization of CPA in
the sense that instead of just exploiting the fact that different bit values may result in different
expected power consumption values, it is able to exploit the difference between the distri-
butions of the power consumption for a bit being 0 or 1 respectively. So in short, when the
power consumption distributions of a bit equal to 0 or 1 have equal mean values but different
shapes, CPA will not work while MIA may still be able to distinguish the two distributions.

38 / 51
5. Protection against side-channel a acks K implementation overview

5.2.1 Diﬀerent types of countermeasures

In the light of power analysis a acks, one must a empt implementing the cryptographic
primitives such that the eﬀort (or cost) of the adversary for retrieving the key is too high
for her to be interesting. An important countermeasure is implementing the cryptographic
primitives such that the power consumption and electromagnetic radiation leak as li le as
possible on the secret keys or data. Countermeasures can be implemented at several levels:

Transistor level Logical gates and circuits are built in such a way that the information leak-
age is reduced;

Platform level The platform supports features such as irregular clocking (clock ji er), ran-
dom insertion of dummy cycles and addition of noise to power consumption;

Program level The order of operations can be randomized or dummy instructions can be
inserted randomly to make the alignment of traces more diﬃcult;

Algorithmic level The operations of the cryptographic algorithm are computed in such a
way that the information leakage is reduced;

Protocol level The protocol is designed such that it limits the number of computations an
a acker can conduct with a given key.

As opposed to protection against cryptographic a acks, protection against side channel

a acks is never expected to be absolute: a determined a acker with a massive amount of re-
sources will sooner or later be able to break an implementation. The engineering challenge
is to put in enough countermeasures such that the a ack becomes too expensive to be inter-
esting. Products that offer a high level of security typically implement countermeasures on
multiple levels.
The countermeasures at transistor level are independent of the algorithm to be imple-
mented. Examples are wave dynamic differential logic (WDDL) [32] or SecLib [20]. These
types of logic are evolutions of the dual rail logic, where a bit is coded using two lines in
such a way that all the logic gates consume the same amount of energy independently of the
values. They imply dedicated hardware for cryptography which takes more area, requires
dedicated industrialization processes and is in general more expensive. Moreover, while
these countermeasures may significantly reduce the information leakage, there always re-
mains some leakage.
The countermeasures at platform level are also independent of the algorithm to be im-
plemented. By randomizing the execution timing, alignment of power traces is made more
difficult. The addition of noise to the power consumption increases the required number of
traces. The timing randomization is particularly efficient against higher-order DPA as there
the signals must be aligned in multiple places and any misalignment severely limits the ef-
fectiveness of a ack. The addition of noise is also very efficient against a acks that look for
dependencies in higher moments of the distributions, such as MIA (see Section 5.5).
The countermeasures at program level are partially dependent on the algorithm to be
implemented. Insertion of dummy instructions is possible in any algorithm while changing
the order of operations may be easier for some algorithms than for others.
The countermeasures at protocol level are also independent of the algorithm. However,
what can be done at this level depends on the requirements of the application and in many
cases the possibilities are limited.
Finally, the countermeasures at algorithmic level depend on the basic operations used
in the algorithm. This is the type of countermeasures where the choice of operations in the

39 / 51
K implementation overview 5. Protection against side-channel a acks

cryptographic primitive is relevant. One of the countermeasures of this type is that of secret
sharing (or masking) and K is particularly well suited for it.

5.2.2 Secret sharing

Masking is a countermeasure that offers protection against DPA at the algorithmic level. It
consists of representing variables processed by a cryptographic primitive by two or more
shares (as in secret sharing) where the (usually bitwise) sum of the shares is equal to the
native variable. Subsequently the program or circuit computes the cryptographic primitive
using the shares in such a way that the processed variables are independent from the native
variables. Whether this is possible depends on the details of the cryptographic primitive and
the type of masking. In any case, to achieve independence for a native variable, all but one
of its shares must be generated (pseudo-)randomly for each execution of the cryptographic
primitive. Clearly, the generation of the shares, the masking operation of the input words
and unmasking operation of output, usually considered out of scope of DPA a acks, must
also be carefully implemented to limit information leakage. For masking to be effective, the
adversary shall have as li le information as possible on the value of the shares.
Taking two shares offers protection against first-order DPA under the condition that all
processed bits and their joint behavior at any time are independent from native variables.
Providing protection against m-th order DPA requires at least m + 1 shares.
Computing a linear function λ on the shares of a variable is straightforward. If we rep-
resent a native variable x by its shares xi with x = ∑i xi we can compute the shares yi of
y = λ( x ) by simply applying λ on the individual shares: yi = λ( xi ). A function that consists
of the addition of a constant can be performed by adding it to a single share. As all separate
operations are performed on shares that are independent of native variables, this provides
protection against first-order DPA.
Computing a nonlinear function on the shares assuring all variables processed are inde-
pendent of native variables depends on the nonlinear function at hand. This is in general a
non-trivial problem. We refer to [2] for some examples. In this chapter we limit ourselves to
the nonlinear step mapping χ in the round function of K -f:

x i ← x i + ( x i +1 + 1 ) x i +2 . (5.1)

5.3 So ware implementation using two-share masking

For so ware implementations, we have studied the application of masking with two shares,
denoted by a and b. The step mapping χ is very similar to nonlinear step mapping γ in
B K , the cipher that is the subject of [17] and hence the techniques shown there can be
readily applied. We must now compute the shares of x at the le hand side of Equation (5.1)
in such a way that all intermediate variables are independent of native variables. This can
be realized by implementing following equations:

a i ← a i + ( a i + 1 + 1 ) a i + 2 + a i + 1 bi + 2
(5.2)
b i ← bi + ( b i + 1 + 1 ) b i + 2 + b i + 1 a i + 2 .

To achieve independence from native variables, the order in which the operations are exe-
cuted is important. If the expressions are evaluated le to right, it can be shown that all in-
termediate variables are independent from native variables. The computations of all terms
except the rightmost one involve only variables of a single share, hence here independence
from x is automatic. For the addition of the mixed term to the intermediate result of the

40 / 51
5. Protection against side-channel a acks K implementation overview

computation, the presence of a[i ] (or b[i ]) as a linear term in the intermediate variable results
in independence.
At the algorithm level, the effect of the introduction of two shares in the computation
of the K - f round function is rather simple. The linear part λ can be executed on the
two shares separately, roughly doubling the workload. In the nonlinear step mapping χ the
computation of a state word according to Equation (5.1), taking a XOR, AND and NOT in-
struction, is replaced by the computation of the shares according to Equations (5.2), taking in
total 4 XOR, 4 AND and 2 NOT instructions. The addition of round constants ι and addition
of input blocks can be performed on one share only.
As the order of execution is important, it is not sufficient to write a program in C or
some other high-level language and compile it. The compiler may optimize away the desired
order. An option is to compile and inspect the machine code a erwards, but the method that
provides the highest level of assurance is to program in assembly language. It is however
not sufficient to check only the sequencing of instructions. In general, the operations on two
shares of the same variable are preferably executed in registers physically isolated from each
other. More particularly, the program shall be such that at any given moment there is no
register (or bus) content or transition that is correlated to a native variable. For example,
if the number of shares is two and a register containing x0 is loaded with x1 , the power
consumption can depend on the number of bits switched (e.g., on the Hamming weight of
x0 ⊕ x1 ) and it is likely to leak information on x. This can be solved by se ing the register to
zero in between. Another example is the simultaneous processing of two shares of a variable
in different parts of the CPU. As the power consumption at any given moment depends on
all processing going on, it depends on both shares and and hence there may be a dependence
on the native variable. Clearly, care must be taken when a empting to build side-channel
resistant implementations. So if sufficient care is taken, this provide provable resistance
against first-order DPA. Higher-order DPA is in principle still possible but as explained in
[17, Appendix A] very sensitive to noise and clock ji er. On smart cards, the addition of
noise, dummy cycles and clock ji er are typically supported by the platform.

5.4 Hardware using three-share masking

At first sight, masking with two shares may offer protection against first order CPA. How-
ever, in dedicated hardware the occurrence of glitches [24] may result in correlation between
the power consumption and native variables. In a combinatorial circuit computation is typ-
ically not monotonous but intermediate signals may switch several times per clock cycle
resulting in the (possible transient) appearance of native variables. Hence, due to glitches,
two-share masking cannot provide provable protection against first order CPA.
A solution to this problem was proposed in [25]. We refer to [25, 26] for an in-depth
treatment and limit ourselves here to the explanation of the basic concept. The simple but
powerful idea is to take as many shares as needed such that in any computation at least one
of the shares is not taken as input. In this way, all intermediate variables are guaranteed to be
independent from native variables for the same reason that the one-time pad offers perfect
secrecy. For linear functions two shares are sufficient to realize this. For nonlinear functions,
the required number of shares depends on the particular function.
As can be read in [25, 26], for most cryptographic algorithms, the mere number of shares
required makes the application of this technique very expensive. Fortunately, χ with its
simple structure is well suited for applying this technique and only three shares are required.
This suitability is quite unique among cryptographic primitives. If we denote the shares by

41 / 51
K implementation overview 5. Protection against side-channel a acks

a, b and c, a possible computation of χ applied to three shares is given by:

a i ← b i + ( b i + 1 + 1 ) b i + 2 + bi + 1 c i + 2 + c i + 1 b i + 2
bi ← c i + ( c i + 1 + 1 ) c i + 2 + c i + 1 a i + 2 + a i + 1 c i + 2 (5.3)
c i ← a i + ( a i + 1 + 1 ) a i + 2 + a i + 1 bi + 2 + bi + 1 a i + 2 .
Clearly, the computation of each share takes as input only components of the other two shares
and it provides provable security against first order CPA, even in the presence of glitches.
The three lines of Equation (5.3) are equivalent and only differ in the input shares and
output share. We denote the computation of the nonlinear step χ′ resulting in a share given
the two other shares by, e.g., a ← χ′ (b, c).
In the following subsections we present two architectures that make use of three shares.
Other architectures can be derived based on different partitioning or sequences of the com-
putational steps composing the round. For instance a low area coprocessor, as those pre-
sented in Section 4.4, can be protected using the secret sharing techniques.

5.4.1 One-cycle round architecture

A first architecture computes one round in one clock cycle and is depicted in Figure 5.1.
Before processing, the three shares a, b and c are generated from a random source. As the
initial state of K should be set equal to zero implying a + b + c = 0, the shares a and b
can be generated randomly and c computed as c = a + b. The hardware for generating the
three shares is out of the scope of our study, we just consider them as input to the core.
The combinatorial logic implements the round function and input data block absorbing.
It has a layered structure. In a first (linear) layer, the absorbing of the input data block, Din ,
is implemented by adding it to one of the shares and then λ is applied to the three shares
by three separate combinatorial blocks. In a second layer, the nonlinear step mapping χ
is computed on the output of the first layer according to Equations (5.3) by three separate
combinatorial blocks implementing χ′ . Each block takes as input two shares and generates
one share. The blocks only differ in the presence of ι in the le most block as ι only needs to
be added to a single share.
We can estimate the cost of the secret sharing technique in terms of silicon area by com-
paring this architecture with our unprotected one-cycle round architecture in Section 4.2,
based on Figure 5.1 and Equations (5.3). The number of registers required for storing the
state is three times larger than the unprotected version. The cost of the linear part is three
times larger as well. Regarding the nonlinear part, we have three blocks χ′ instead of one χ
and the cost of every χ′ is also larger than that of χ. While χ requires basically an AND gate,
a NOT and a XOR for each bit of the state, χ′ requires three AND gates, one NOT and three
XOR for a single share. So roughly, the protected nonlinear part is expected to be nine times
larger than the unprotected χ.

5.4.2 Three-cycle round architecture

A second architecture reduces the amount of silicon at the cost of performance and is de-
picted in Figure 5.2. Since the round logic is composed of three equal blocks for λ and three
equal blocks for χ′ , we instantiate only one block of each and try to use them as much as
possible.
Instead of three registers this architecture requires four registers, some multiplexing and
a careful schedule. The schedule has some similarity to pipelining techniques.
For explaining the schedule we refer to Table 5.1. We use λ( R0 ) to denote the application
of λ to register R0 , and χ′ ( R1 , R2 ) to denote the application of χ′ using registers R1 and R2

42 / 51
5. Protection against side-channel a acks K implementation overview

Figure 5.1: Protected architecture computing one round in one clock cycle.

43 / 51
K implementation overview 5. Protection against side-channel a acks

cycle R0 R1 R2 R3
0 - - - -
1 input, A - - -
2 input, B λ( A + Din ) - -
3 input, C λ( B) λ( A + Din ) -
4 χ ′ ( R2 , R1 ), C ′ λ(C ) λ( B) λ( A + Din )
5 χ ′ ( R2 , R1 ), A ′ λ(C ′ ) λ(C ) λ( A + Din )
6 χ ′ ( R2 , R3 ), B ′ λ( A′ ) λ (C ′ ) λ(C )
7 χ′ ( R2 , R1 ), B′′ λ( B′ ) λ( A′ ) λ(C ′ )
8 χ′ ( R2 , R1 ), C ′′ λ( B′′ ) λ( B′ ) λ(C ′ )
9 χ′ ( R2 , R3 ), A′′ λ(C ′′ ) λ( B′′ ) λ( B′ )
10 χ′ ( R2 , R1 ), A′′′ λ( A′′ ) λ(C ′′ ) λ( B′′ )
11 χ′ ( R2 , R1 ), B′′′ λ( A′′′ ) λ( A′′ ) λ( B′′ )
12 χ′ ( R2 , R3 ), C ′′′ λ( B′′′ ) λ( A′′′ ) λ( A′′ )

Table 5.1: The content of the registers during the computation.

as inputs. The three initial values of the shares are indicated as A, B and C. The values a er
the ﬁrst round by A′ , B′ and C ′ , a er the second round by A′′ , etc. At clock cycle zero the
registers do not yet contain any relevant data, as indicated with a −. Instead of loading the
three shares in parallel, as done in the previous architecture, one share per clock cycle is
loaded into register R0 during the three initial clock cycle.

The content of R1 is in general the result of λ applied to R0 , with the Din enabled and
XORed at the input of λ( R0 ) before an initial K - f round.

The content of R0 is the result of χ′ applied to either the couple R2 and R1 or R2 and R3 .
While in the one-cycle round architecture ι was applied only to one of the three shares, here
it is applied to all of the three shares. This does not change the result of the computation and
simpliﬁes the control logic.

The registers R2 and R3 are used for maintaining the values required for the computation
of χ′ . In the second clock cycle B is loaded into R0 , and R1 receives λ( A). In the third clock
cycle C is loaded into R0 , R1 receives λ( A) and λ( A) is moved from R1 to R2 . In the fourth
clock cycle no more shares need to be loaded. Instead R0 receives χ′ applied to the content
of R1 and R2 , which means λ( A) and λ( B). It follows that R0 now contains the share C ′ a er
the first round. R1 receives λ(C ), λ( B) is moved from R1 to R2 and λ( A) is moved from R2 to
R3 . In the fi h clock cycle R0 receives χ′ applied to the content of R1 and R2 , being λ( B) and
λ(C ) and hence contains the share A a er the first round. R1 receives λ(C ′ ), λ(C ) is moved
from R1 to R2 , while λ( A) remains in R3 . Since shares A′ and C ′ as input of the second round
have already been computed, in the next clock cycle share B′ must be computed, requiring
λ( A) and λ(C ), and they are in R1 and R3 respectively before clock cycle five. Thus we have
described also what will be computed in the next clock cycle and this is basically the end of
a round. Starting from the next clock cycle there will be a loop of three clock cycles where
the alternation of data contained in the registers are those for computing a round.

Using this circuit, a K computation takes 3 initial cycles plus 24 cycles per execution
of K -f.

44 / 51
5. Protection against side-channel a acks K implementation overview

Figure 5.2: Protected architecture computing one round in three clock cycles.

45 / 51
K implementation overview 5. Protection against side-channel a acks

Core Size Frequency Throughput

(r = 1024) KGE MHz Gbit/s.
Unprotected one-cycle 48 526 22.4
One-cycle (fast) 183 500 21.3
One-cycle (compact) 127 200 8.5
Three-cycle (fast) 115 714 10.1
Three-cycle (medium) 106 500 7.1
Three-cycle (compact) 95 200 2.8

Table 5.2: Performance and gate count of the diﬀerent implementations

5.4.3 Synthesis results

We have implemented both architectures in VHDL and synthesized it for understanding the
maximum frequency and silicon area demand. We have used the same technology library
adopted for the unprotected implementation reported in Section 4.2, a 130 nm general pur-
pose library from STMicroelectronics, and the Synopsys Design Compiler.
Table 5.2 summarizes the gate count and performance numbers of the different imple-
mentations, together with the numbers for the one-cycle round unprotected architecture de-
scribed in Section 4.2. It gives the total gate count, the maximum frequency and the through-
put assuming a bitrate of 1024 bits.
The gate count figures are in line with the estimations made in Section 5.4.1. All imple-
mentations have an I/O buffer of 9 Kilo gate equivalent (KGE) for connecting the core to a
system bus. This allows to load the I/O buffer, 64-bit per clock cycle, simultaneously with
the application of K - f on the previous input block.
In the three-cycle round architecture the computation of one round is executed in 3 clock
cycles and the initialization of the state requires also 3 clock cycles. Thanks to the very short
critical path, the fast variant of the three-cycle round architecture can reach 714 MHz, still
resulting in a very competitive speed of 10 Gbit/s. If we compare the one-cycle round and
the three-cycle round architectures both running at 500Mhz, we can see that the three-cycle
requires 40% less silicon area at a cost of a throughput reduction by a factor 3.
It is interesting to note that the strategy adopted in the design of the function allows
implementing this countermeasure with a cost consisting only of silicon area with almost no
penalty in terms of throughput: it is reduced only by 5%, from 22.4 Gbit/s. to 21.3 Gbit/s.
when using a rate of 1024 bits.

5.5 Computing in parallel or sequentially?

The use of three shares computed at different times gives a provable resistance against first-
order DPA, but not against DPA of higher order. Higher-order DPA involves taking mea-
surements corresponding to the computation of the different shares, and this task is more
complex due to its higher sensitivity to noise than first-order DPA [17, Appendix A].
In the architectures presented in the previous sections, several computations take place
in parallel and the power consumption at a given time depends on all these computations.
For this reason, these architectures are not provably resistant against first-order DPA. For
instance, in [26], the authors analyzed an implementation with three shares working in par-
allel. Storing the three shares in a register causes a power consumption to depend on the
three shares simultaneously. In the absence of noise, the distribution of the consumption

46 / 51
5. Protection against side-channel a acks K implementation overview

depends on the native variable being stored and hence is vulnerable to MIA. In this section,
we explain that the introduction of noise has a much stronger impact on the feasibility of this
a ack for a two-share or three-share implementation than for an unmasked one, and that the
qualitative difference is similar to the one between first-order and higher-order DPA. Fur-
thermore, we argue that a masked implementation has per construction a higher algorithmic
noise level than an unmasked one.
With three shares, a native bit equal to 0 (resp. 1) can be represented as 000, 011, 101 or
110 (resp. 001, 010, 101 or 111). If the three shares are processed simultaneously, the power
consumption can leak the Hamming weight of the shares, which means 0 or 2 for a native
bit 0, or 1 or 3 for a native bit 1. Clearly, these two distributions are different and can be
distinguished.
We start the discussion by constructing a simple model where the power consumption
is equal to the Hamming weight plus an additive Gaussian noise of standard deviation σ,
expressed in the same unit as the impact of the Hamming weight on the power consumption.
In this model, the distribution of the power consumption for three shares is as in Figure 5.3(c),
with σ = 0.2. Similarly, we can look at an unmasked implementation, where the power
consumption is 0 or 1 plus the Gaussian noise, and at masking with two shares. In this last
case, the Hamming weight is 0 or 2 for a native bit 0 (represented as 00 or 11) or 1 for a
native bit 1 (represented as 01 or 10). The two cases can be found in Figures 5.3(a) and 5.3(b),
respectively.

(a) (b) (c)

(d) (e) (f)

Figure 5.3: Distribution of the power consumption for a simple model. The solid line shows
the distribution for a bit with native value 0 and the dashed line for a bit 1. Sub-ﬁgures (a),
(b) and (c) show the case of one, two and three shares, respectively, with a noise level of
σ = 0.2. Sub-ﬁgures (d), (e) and (f) follow the same model with σ = 2.

Following this model, we compute the number of samples that are needed to distinguish
between the distribution for a native bit 0 and the distribution for a native bit 1. We follow
the same reasoning as in [17, Appendix A]. The number z of samples needed to distinguish
one distribution over the other is inversely proportional to the Kullback-Leibler divergence

47 / 51
K implementation overview 5. Protection against side-channel a acks

between the two distributions [23], z = 1/D ( f | g) with

∫
D ( f | g) = f ( x )(log( f ( x )) − log( g( x ))dx.

In this model, the scaling of z as a function of σ is different for one, two or three shares.
For one share, z ∼ 2σ2 samples are needed to distinguish a native bit 0 from a native bit 1
from unmasked values, whereas about z2 samples are needed for the same noise level when
two shares are used, and this number grows to about z3 samples for three-share masking.
Hence, the difference between the one-share and two-share and three-share implementation
is qualitatively the same as the one between first-order and second-order and third-order
DPA.
The real-world behavior is likely to differ from this simple model. Nevertheless, we ex-
pect a significantly higher sensitivity to noise for three shares than for one. Qualitatively,
the three pairs of distributions are different. For one share, the mean is different for native
bits 0 and 1. For two shares, the two distributions have the same mean but a different vari-
ance. For three shares, the two distributions have the same mean and variance; they differ
only starting from their third moment. Figures 5.3(d), 5.3(e) and 5.3(f) illustrate this with the
simple model and a higher noise level σ = 2.
So far, we have assumed that the three levels of masking are subject to the same noise
level σ. However, the masking by itself introduces noise, as m − 1 shares are randomly and
independently chosen for every execution of the algorithm. The very dependence of the
processing of a bit in the power consumption that an a acker exploits turns against her, as
it becomes an additional source of noise due the randomization. For instance, in the one-
cycle-one-round implementation of K - f [1600] with three shares, the noise due to the
Hamming weight of 3200 random bits must be compared to a small number of unmasked
bit values that a differential power analysis a empts at recovering.

48 / 51
Bibliography

[1] D. J. Bernstein and T. Lange (editors), eBACS: ECRYPT Benchmarking of cryptographic

systems, http://bench.cr.yp.to, accessed 21 December 2010.

[2] G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche, Note on side-channel a acks and
their countermeasures, Comment on the NIST Hash Competition Forum, May 2009, http:
//keccak.noekeon.org/NoteSideChannelAttacks.pdf.

[3] , Cryptographic sponge functions, January 2011, http://sponge.noekeon.org/.

[4] , Hardware implementation of K in VHDL, 2011, available from http://

keccak.noekeon.org/.

[5] ,K hardware performance ﬁgures page, 2011, http://keccak.noekeon.org/hw_

performance.html.

[6] ,K so ware performance ﬁgures page, 2011, http://keccak.noekeon.org/sw_

performance.html.

[7] ,K T so ware, January 2011, http://keccak.noekeon.org/.

[8] , Known-answer and Monte Carlo test results, 2011, available from http://keccak.
noekeon.org/.

[9] , Reference and optimized implementations of K , 2011, available from http:

//keccak.noekeon.org/.

[10] , The K reference, January 2011, http://keccak.noekeon.org/.

[11] , The K SHA-3 submission, January 2011, http://keccak.noekeon.org/.

[12] E. Brier, C. Clavier, and F. Olivier, Correlation power analysis with a leakage model, CHES
(M. Joye and J.-J. Quisquater, eds.), Lecture Notes in Computer Science, vol. 3156,
Springer, 2004, pp. 16–29.

[13] Altera corporation, Quartus II web edition so ware, http://www.altera.com.

[14] Intel Corporation, Intel 8XC51RA/RB/RC hardware description, http://www.intel.com/

design/mcs51/manuals/272668.htm.

[15] , Intel MCS 51/251 microcontrollers - expanded RAM, http://www.intel.com/

design/mcs51/er_51.htm.

[16] Xilinx corporation, ISE WebPACK so ware, http://www.xilinx.com.

49 / 51
K implementation overview BIBLIOGRAPHY

[17] J. Daemen, M. Peeters, and G. Van Assche, Bitslice ciphers and power analysis a acks, Fast
So ware Encryption 2000 (B. Schneier, ed.), Lecture Notes in Computer Science, vol.
1978, Springer, 2000, pp. 134–149.

[18] G. Gielen and J. Figueras (eds.), 2004 design, automation and test in Europe conference and
exposition (DATE 2004), 16-20 February 2004, Paris, France, IEEE Computer Society, 2004.

[19] B. Gierlichs, L. Batina, P. Tuyls, and B. Preneel, Mutual information analysis, CHES (E. Os-
wald and P. Rohatgi, eds.), Lecture Notes in Computer Science, vol. 5154, Springer,
2008, pp. 426–442.

[20] S. Guilley, P. Hoogvorst, Y. Mathieu, R. Pacalet, and J. Provost, CMOS structures suitable
for secured hardware, in Gielen and Figueras [18], pp. 1414–1415.

[21] G. Hoﬀmann, K implementation on GPU, 2010, http://www.cayrel.net/spip.php?

article189.

[22] P. C. Kocher, J. Jaﬀe, and B. Jun, Diﬀerential power analysis, Advances in Cryptology –
Crypto ’99 (M. Wiener, ed.), Lecture Notes in Computer Science, vol. 1666, Springer,
1999, pp. 388–397.

[23] S. Kullback and R. A. Leibler, On information and suﬃciency, Ann. Math. Statist. 22 (1951),
no. 1, 79–86, http://projecteuclid.org/euclid.aoms/1177729694.

[24] S. Mangard, N. Pramstaller, and E. Oswald, Successfully a acking masked AES hardware
implementations, CHES (J.R. Rao and B. Sunar, eds.), Lecture Notes in Computer Science,
vol. 3659, Springer, 2005, pp. 157–171.

[25] S. Nikova, V. Rĳmen, and M. Schläﬀer, Secure hardware implementation of nonlinear func-
tions in the presence of glitches, ICISC (P. J. Lee and J. H. Cheon, eds.), Lecture Notes in
Computer Science, vol. 5461, Springer, 2008, pp. 218–234.

[26] , Secure hardware implementation of nonlinear functions in the presence of glitches,

Journal of Cryptology, 2010, to appear.

[27] NIST, Announcing request for candidate algorithm nominations for a new cryptographic hash
algorithm (SHA-3) family, Federal Register Notices 72 (2007), no. 212, 62212–62220, http:
//csrc.nist.gov/groups/ST/hash/index.html.

[28] , ANSI C cryptographic API proﬁle for SHA-3 candidate algorithm submissions, re-
vision 5, February 2008, available from http://csrc.nist.gov/groups/ST/hash/sha-3/
Submission_Reqs/crypto_API.html.

[29] E. Oswald S. Mangard and T. Popp, Power analysis a acks — revealing the secrets of smart-
cards, Springer-Verlag, 2007.

[30] G. Sevestre, K tree hashing on GPU, using Nvidia Cuda API, 2010, http://sites.
google.com/site/keccaktreegpu/.

[31] J. Strömbergson, Implementation of the Keccak hash function in FPGA devices, http://www.
strombergson.com/files/Keccak_in_FPGAs.pdf.

[32] K. Tiri and I. Verbauwhede, A logic level design methodology for a secure DPA resistant ASIC
or FPGA implementation, in Gielen and Figueras [18], pp. 246–251.

50 / 51
BIBLIOGRAPHY K implementation overview

[33] C. Wenzel-Benner and J. Gräf, XBX: keeping two eyes on embedded hashing, http://xbx.
das-labor.org/, accessed 21 December 2010.

51 / 51

Inte 20013 Integrated Software Application 1
No ratings yet
Inte 20013 Integrated Software Application 1
143 pages
DesigningVideoGameHardrwareInVerilog Ebook Dec2018
No ratings yet
DesigningVideoGameHardrwareInVerilog Ebook Dec2018
217 pages
Finite State Machine Datapath Design, Optimization, and Implementation
100% (2)
Finite State Machine Datapath Design, Optimization, and Implementation
123 pages
A Pipelined Multi-Core MIPS Machine Hardware Implementation and Correctness Proof (Mikhail Kovalev, Silvia Melitta Muller, Wolfgang J. Paul)
No ratings yet
A Pipelined Multi-Core MIPS Machine Hardware Implementation and Correctness Proof (Mikhail Kovalev, Silvia Melitta Muller, Wolfgang J. Paul)
359 pages
Numerical Methods Implementation On CUDA
No ratings yet
Numerical Methods Implementation On CUDA
73 pages
Chisel Book
No ratings yet
Chisel Book
144 pages
Audio, Video, and Media in the Ministry
From Everand
Audio, Video, and Media in the Ministry
Clarence Floyd Richmond
No ratings yet
Physical Examination
100% (4)
Physical Examination
18 pages
Keccak Implementation 3.2
No ratings yet
Keccak Implementation 3.2
59 pages
Introducing The Spartan 3E FPGA and VHDL
No ratings yet
Introducing The Spartan 3E FPGA and VHDL
130 pages
40 EN- Computer Architecture Complexity And Correctness
No ratings yet
40 EN- Computer Architecture Complexity And Correctness
565 pages
High Speed RSA Implementations
No ratings yet
High Speed RSA Implementations
86 pages
Design and Development of An FPGA-based Distributed Computing Pro
No ratings yet
Design and Development of An FPGA-based Distributed Computing Pro
140 pages
Digital ASIC Manual
100% (1)
Digital ASIC Manual
162 pages
Henry Thesis PHD
No ratings yet
Henry Thesis PHD
275 pages
Super Cpmputers
No ratings yet
Super Cpmputers
101 pages
An Evaluation of Methods For FPGA Implementation From A Matlab Description
No ratings yet
An Evaluation of Methods For FPGA Implementation From A Matlab Description
140 pages
The Science of Computing - ToC
No ratings yet
The Science of Computing - ToC
5 pages
The Science of Computing: Curl Burch
No ratings yet
The Science of Computing: Curl Burch
118 pages
Project 3: Writing A Kernel From Scratch: 15-410 Operating Systems
No ratings yet
Project 3: Writing A Kernel From Scratch: 15-410 Operating Systems
44 pages
DSP Prcoessor
No ratings yet
DSP Prcoessor
130 pages
Embedded Systems
No ratings yet
Embedded Systems
92 pages
Embedde, D Systems
No ratings yet
Embedde, D Systems
91 pages
A Highly Productive Implementation of An Out-Of-Order
No ratings yet
A Highly Productive Implementation of An Out-Of-Order
157 pages
Main
No ratings yet
Main
170 pages
CESE4040 - Processor Design Project Guide
No ratings yet
CESE4040 - Processor Design Project Guide
32 pages
My_Thesis (2)
No ratings yet
My_Thesis (2)
59 pages
IntroCompOrg Preview
No ratings yet
IntroCompOrg Preview
274 pages
Finite-State-Machine-Datapath-Design,-Optimization,-And-Implementation - (Cuuduongthancong - Com)
No ratings yet
Finite-State-Machine-Datapath-Design,-Optimization,-And-Implementation - (Cuuduongthancong - Com)
123 pages
UE719
No ratings yet
UE719
48 pages
CAD of Circuits and Integrated Systems-Wiley-ISTE (2020) PDF
No ratings yet
CAD of Circuits and Integrated Systems-Wiley-ISTE (2020) PDF
288 pages
Appunti Digital Eletctronics
No ratings yet
Appunti Digital Eletctronics
136 pages
Getting_started
No ratings yet
Getting_started
67 pages
MA-2004-03
No ratings yet
MA-2004-03
186 pages
Open 653
No ratings yet
Open 653
118 pages
Network On Chip On FPGAs
No ratings yet
Network On Chip On FPGAs
232 pages
Project Documet Group 12 3
No ratings yet
Project Documet Group 12 3
98 pages
Khan Salman MASc 2015
No ratings yet
Khan Salman MASc 2015
79 pages
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
CSE_ESD_LN_UG24
No ratings yet
CSE_ESD_LN_UG24
238 pages
téléchargement
No ratings yet
téléchargement
144 pages
Ofdm Thesis
No ratings yet
Ofdm Thesis
128 pages
BITS Pilani: Reconfigurable Computing Es ZG 554 / Mel ZG 554 Session 1
No ratings yet
BITS Pilani: Reconfigurable Computing Es ZG 554 / Mel ZG 554 Session 1
23 pages
Family Manual: 16-Bit Digital Signal Controllers
No ratings yet
Family Manual: 16-Bit Digital Signal Controllers
448 pages
FULLTEXT01
No ratings yet
FULLTEXT01
134 pages
Designing A 4-Bit Gray Asynchronous Counter: Nguyen Ngoc Hoang
No ratings yet
Designing A 4-Bit Gray Asynchronous Counter: Nguyen Ngoc Hoang
31 pages
Course Script HWSW Codesign
No ratings yet
Course Script HWSW Codesign
144 pages
cs61c Notes
No ratings yet
cs61c Notes
29 pages
Thesis 237
No ratings yet
Thesis 237
116 pages
Laboratory Manual For TSEA44: October 24, 2013
No ratings yet
Laboratory Manual For TSEA44: October 24, 2013
82 pages
1450401864
No ratings yet
1450401864
255 pages
Simulation Manual For Configurable Mapreduce Accelerator: Gheorghe M. S Tefan
No ratings yet
Simulation Manual For Configurable Mapreduce Accelerator: Gheorghe M. S Tefan
78 pages
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
From Everand
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
Matthew C. Smith
No ratings yet
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
From Everand
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
Michael Basler
No ratings yet
Osama the Gun
From Everand
Osama the Gun
Norman Spinrad
5/5 (1)
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
Kellory the Warlock
From Everand
Kellory the Warlock
Lin Carter
No ratings yet
Software Patterns Made Easy
From Everand
Software Patterns Made Easy
Justice Nanhou
No ratings yet
Fundamentals of Export Business Plan Major Points
75% (4)
Fundamentals of Export Business Plan Major Points
84 pages
Essence of Vaalmiki Baala Ramayana
No ratings yet
Essence of Vaalmiki Baala Ramayana
171 pages
Antonio Pigafetta
No ratings yet
Antonio Pigafetta
3 pages
Peybernes, 1993
No ratings yet
Peybernes, 1993
13 pages
Alkyl Halides
No ratings yet
Alkyl Halides
4 pages
The Eight Parts of Speech Complete
No ratings yet
The Eight Parts of Speech Complete
46 pages
Bank Reconciliation L2
No ratings yet
Bank Reconciliation L2
3 pages
Chapter 14 - Operations Scheduling and Sequencing
No ratings yet
Chapter 14 - Operations Scheduling and Sequencing
24 pages
Robotic Systems For Cleaning and Inspection of Large Concrete Pipes
No ratings yet
Robotic Systems For Cleaning and Inspection of Large Concrete Pipes
7 pages
Msds Pulron 101 en Rev
No ratings yet
Msds Pulron 101 en Rev
4 pages
Design of Transformer
No ratings yet
Design of Transformer
47 pages
EYE DIRECTIVE WHEEL CHAIR Mahesh
75% (4)
EYE DIRECTIVE WHEEL CHAIR Mahesh
15 pages
Republic of The Philippines Professional Regulation Commission Manila
No ratings yet
Republic of The Philippines Professional Regulation Commission Manila
3 pages
Pawan Karki
No ratings yet
Pawan Karki
45 pages
Muhammad Tahir: Research Interests
No ratings yet
Muhammad Tahir: Research Interests
4 pages
Mother's Day Differentiated Reading Comprehension Activity
No ratings yet
Mother's Day Differentiated Reading Comprehension Activity
11 pages
Teachers As Agents of Change
No ratings yet
Teachers As Agents of Change
4 pages
Knights Templar The RED CROSS Alchemy and TIME - Science and Religion Reconciled With A Little Help From My Friends
100% (1)
Knights Templar The RED CROSS Alchemy and TIME - Science and Religion Reconciled With A Little Help From My Friends
20 pages
ELC501 Purpose Tone Intended Audience answERS
No ratings yet
ELC501 Purpose Tone Intended Audience answERS
8 pages
Trading Strategy
No ratings yet
Trading Strategy
2 pages
Exp 1.5 (This Experiment Deals With The Load Flow Analysis Using Newton Raphson Method - Anand - 20BEL1015)
No ratings yet
Exp 1.5 (This Experiment Deals With The Load Flow Analysis Using Newton Raphson Method - Anand - 20BEL1015)
9 pages
NotificationAdvertisementNo022025
No ratings yet
NotificationAdvertisementNo022025
13 pages
M.Saravana Kumar. Ap/ Mech Fmcet
No ratings yet
M.Saravana Kumar. Ap/ Mech Fmcet
6 pages
Pondicherry University: Directorate of Distance Education
No ratings yet
Pondicherry University: Directorate of Distance Education
27 pages
Gratitude Indebtedness and Regret On Rec
No ratings yet
Gratitude Indebtedness and Regret On Rec
40 pages
Homework 5 5
No ratings yet
Homework 5 5
2 pages
Recovery and Psychiatric Rehabilitation
No ratings yet
Recovery and Psychiatric Rehabilitation
10 pages
Topic 6 Notes - Distributed Control System & Industrial Data Communication
No ratings yet
Topic 6 Notes - Distributed Control System & Industrial Data Communication
59 pages