Compact and Low-Latency FPGA-Based Number Theoreti
Compact and Low-Latency FPGA-Based Number Theoreti
Article
Compact and Low-Latency FPGA-Based Number Theoretic
Transform Architecture for CRYSTALS Kyber Postquantum
Cryptography Scheme
Binh Kieu-Do-Nguyen 1,2,3 , Nguyen The Binh 1,2 , Cuong Pham-Quoc 1,2, * , Huynh Phuc Nghi 1,2 ,
Ngoc-Thinh Tran 1,2 , Trong-Thuc Hoang 3 and Cong-Kha Pham 3
1 Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly
Thuong Kiet St., Dist. 10, Ho Chi Minh City 740050, Vietnam; binh@vlsilab.ee.uec.ac.jp (B.K.-D.-N.);
binh.nguyen288@hcmut.edu.vn (N.T.B.); nghihp@hcmut.edu.vn (H.P.N.); tnthinh@hcmut.edu.vn (N.-T.T.)
2 Computer Engineering Department, Vietnam National University—Ho Chi Minh City (VNU-HCM), Thu
Duc, Ho Chi Minh City 700000, Vietnam
3 Department of Computer and Network Engineering, University of Electro-Communications (UEC),
Tokyo 182-8585, Japan; hoangtt@uec.ac.jp (T.-T.H.); phamck@uec.ac.jp (C.-K.P.)
* Correspondence: cuongpham@hcmut.edu.vn
Abstract: In the modern era of the Internet of Things (IoT), especially with the rapid development of
quantum computers, the implementation of postquantum cryptography algorithms in numerous termi-
nals allows them to defend against potential future quantum attack threats. Lattice-based cryptography
can withstand quantum computing attacks, making it a viable substitute for the currently prevalent
classical public-key cryptography technique. However, the algorithm’s significant time complexity
places a substantial computational burden on the already resource-limited chip in the IoT terminal. In
lattice-based cryptography algorithms, the polynomial multiplication on the finite field is well known
as the most time-consuming process. Therefore, investigations into efficient methods for calculating
polynomial multiplication are essential for adopting these quantum-resistant lattice-based algorithms on
a low-profile IoT terminal. Number theoretic transform (NTT), a variant of fast Fourier transform (FFT),
Citation: Kieu-Do-Nguyen, B.; is a technique widely employed to accelerate polynomial multiplication on the finite field to achieve a
Nguyen, T.B.; Pham-Quoc, C.; Nghi,
subquadratic time complexity. This study presents an efficient FPGA-based implementation of number
H.P.; Tran, N.-T.; Hoang, T.-T.; Pham,
theoretic transform for the CRYSTAL Kyber, a lattice-based public-key cryptography algorithm. Our
C.-K. Compact and Low-Latency
hybrid design, which supports both forward and inverse NTT, is able run at high frequencies up to 417
FPGA-Based Number Theoretic
Transform Architecture for
MHz on a low-profile Artix7-XC7A100T and achieve a low latency of 1.10 µs while achieving state-of-
CRYSTALS Kyber Postquantum the-art hardware efficiency, consuming only 541-LUTs, 680 FFs, and four 18 Kb BRAMs. This is made
Cryptography Scheme. Information possible thanks to the newly proposed multilevel pipeline butterfly unit architecture in combination
2024, 15, 400. https://doi.org/ with employing an effective coefficient accessing pattern.
10.3390/info15070400
Keywords: field-programmable gate array; postquantum cryptography; number theoretic transform;
Academic Editor: Luca Ardito
CRYSTALS Kyber; lightweight design
Received: 8 May 2024
Revised: 26 June 2024
Accepted: 8 July 2024
Published: 11 July 2024 1. Introduction
Security is a crucial concern in computer systems for safeguarding confidential data
from the ever-changing landscape of cyber attacks. The conventional encryption methods,
Copyright: © 2024 by the authors.
which depend on intricate mathematical issues that pose challenges for classical computers
Licensee MDPI, Basel, Switzerland. to answer, effectively fulfill their purpose of safeguarding sensitive data from malevolent
This article is an open access article entities. Quantum computers, which rely on the principles of quantum physics, can disrupt
distributed under the terms and the equilibrium between defenders and attackers by significantly accelerating some cal-
conditions of the Creative Commons culations. By implementing Shor’s algorithm [1] and Grover’s algorithm [2] on quantum
Attribution (CC BY) license (https:// computers, standard public key cryptography (PKC) systems such as RSA [3] and elliptic
creativecommons.org/licenses/by/ curve cryptography (ECC) [4] may be vulnerable. Therefore, replacing them with postquan-
4.0/). tum cryptography (PQC) is necessary. CRYSTALS Kyber is a cryptographic algorithm
commonly referred to as Kyber. It is documented in the source [5]. The designation of Kyber
as a postquantum key encapsulation mechanism (KEM) protocol is due to its classification
within the family of lattice-based cryptography (LBC). The fundamental challenge that
Kyber addresses is module learning with error (M-LWE). Kyber is a cryptographic system
built on lattices and relies on the M-LWE problem. Kyber’s security is based on the premise
that addressing learning with error (LWE) problems is challenging, even for quantum com-
puters. The hardness of the material contributes to its resistance to quantum cryptanalysis.
Generally, systems based on the M-LWE offer enhanced security [6] and improved perfor-
mance compared to LWE schemes [7]. In this method, polynomials are typically defined
over the ring Zq [ x ]/( x n + 1), with the value n indicating the size of the public-key matrix
and the security levels. Kyber provides three sets of parameters: Kyber-512, Kyber-768,
and Kyber-1024. These factors define the dimensions of the mathematical entities employed
and impact the algorithm’s security and efficiency. The Kyber-768 algorithm provides an
optimal equilibrium between robustness and effectiveness.
Performing polynomial multiplication within the ring is an expensive calculation, both in
terms of hardware and software implementation. This holds true not only for the Kyber cryp-
tography algorithm but also for other cryptographic algorithms. Polynomial multiplication
can be computed using several algorithms, including the schoolbook algorithm [8–10], Karat-
suba algorithm [11,12], Toom–Cook algorithm [13,14], and the number theoretictTransform
(NTT) algorithm [15–17]. Of these algorithms, the schoolbook algorithm is the most basic
and straightforward. However, it has a quadratic complexity of O(n2 ), where n represents
the length of the polynomials. The Karatsuba algorithm employs the “divide-and-conquer”
approach by splitting the original polynomial into two sections, resulting in a computa-
tional complexity of O(n1.58 ). The Toom–Cook algorithm is an extension of the Karatsuba
algorithm. The Toom–Cook algorithm differs from the Karatsuba algorithm by dividing
the original polynomial into k parts. This division results in a complexity of O(nr ), where
r = logk (2k − 1). The discrete Fourier transform (DFT) can be employed for polynomial
multiplication. However, the computational complexity of directly computing the DFT is
O(n2 ), which is comparable to that of the schoolbook technique.
Over the past few decades, numerous efficient algorithms for calculating the DFT
have been introduced, which are currently referred to as the fast Fourier transform (FFT).
The concept of the FFT was initially formulated by Carl Friedrich Gauss in an unpublished
manuscript in 1805 [18]. The FFT gained significant recognition in 1965 when Cooley
and Tukey independently introduced the Cooley–Tukey method, a rapid technique for
DFT. The NTT is a specific instance of DFT that operates on a finite field [19]. Similarly,
the majority of FFT approaches can be utilized for NTT, leading to the development
of efficient algorithms for number theoretic transformations. From an implementation
perspective, the FFT and NTT are the most efficient techniques for calculating polynomial
multiplication of large degrees. This is because they have a quasilinear complexity of
O(n log n), which gives them a major advantage over other multiplication algorithms.
However, due to the fact that the FFT is carried out in the complex domain, the floating point
operations involved in the computation may introduce inaccuracies in rounding precision
when multiplying two polynomials with integer coefficients. Conversely, the calculations
involved in NTT computing exclusively utilize integers, circumventing any potential
precision issues.
Most of the polynomials used in lattice-based schemes are created using integer coeffi-
cients. It is evident that NTT is particularly well suited for implementing these schemes.
Most schemes that utilize lattices with algebraic structures tend to favor NTT-based multi-
plication because of its numerous benefits, in addition to its quasilinear complexity and
utilization of integer arithmetic. Nevertheless, certain lattice-based techniques are unable
to utilize NTT directly due to the parameter requirements imposed by NTT. NTT utilizes
negative wrapped convolution (NWC). The current settings of Kyber, which is the sole
KEM specified by the National Institute of Standards and Technology (NIST), only partially
Information 2024, 15, 400 3 of 15
satisfy the condition q ≡ 1 (mod n), therefore preventing the full utilization of the NTT
based on the number theoretic cryptography approach.
The process of encrypting and decrypting in CRYSTALS Kyber is significantly de-
pendent on performing polynomial multiplication within a certain mathematical ring.
In order to address these problems, Kyber employs NTT to carry out a crucial operation
with optimal efficiency. NTT functions as an optimization tool. It converts polynomials
from their standard form into an alternative representation that allows for significantly
faster multiplication. Using NTT greatly enhances the velocity and effectiveness of poly-
nomial multiplication in CRYSTALS Kyber. This results in accelerated key generation,
encapsulation, and decapsulation procedures. The term “improve NTT” is used to enhance
the overall performance of the CRYSTALS Kyber PQC scheme. In this study, we introduced
a new NTT architecture to expedite the polynomial matrix–vector multiplication in Kyber.
The supplied accelerator achieves reduced latency and resource consumption compared to
current works on the same platform.
2. Background Knowledge
FFT traditionally operates on the vector of complex numbers and can be utilized
to perform fast convolution with subquadratic time complexity. Fortunately, the same
algorithm can be applied on an n − 1-degree polynomial with coefficients in any finite field
Zq with the nth root of unity to achieve fast multiplication. The variant of the FFT that
operates on the finite field is called the NTT.
For CRYSTALS Kyber version 2 and later, the parameters are q = 3329 and n = 256,
and the polynomial is operated on the ring Z3329 [ x ]/( x256 + 1). However, since Z3329
only has up to 256 primitive roots of unity, namely, ζ = 17, an NTT-based multiplication
can only give us the result modulo x256 − 1. Therefore, the modulus is first factored as
x256 + 1 = ∏127 2
i =0 ( x − ζ
2br7 (i )+1 ) and the 256-coefficient polynomial is transformed to a
vector of 128 degree-1 polynomial modulo x2 − ζ 2br7 (i)+1 . The pointwise multiplication
process now becomes “degree-1-wise” multiplication.
Furthermore, the traditional iterative FFT algorithm (Cooley–Tukey) requires bit-
reversal preprocessing on both the forward and inverse transform; this introduces non-
negligible impacts on hardware utilization and performance. Therefore, in the reference
implementation of CRYSTALS Kyber, the forward transform (forward NTT) is performed
using the Cooley–Tukey algorithm and outputs the result in reverse order, which then
is used to perform pointwise multiplication in the same reversed order, and finally the
Gentleman–Sande algorithm, which takes in reversed order vector, is used. The final result
is output in natural order, as expected. The forward and inverse NTT algorithms are
described in Algorithm 1 and Algorithm 2, respectively.
Information 2024, 15, 400 5 of 15
where k is an integer from 1 to 127; the table of ζ is precomputed and stored in ROM.
Similarly, the butterfly calculation of the inverse NTT (INTT) is defined in Equation (2).
(
r [i ] = (r [i + len] + r [i ])2−1 mod q
(2)
r [i + len] = (r [i + len] − r [i ])ζ [k]2−1 mod q
The most significant bottleneck of the butterfly unit is the modular multiplication of
ζ [k]. If we perform this operation naïvely, the steps involve a 12 × 12 multiplication, which
is most efficiently achieved by using a DSP on FPGA and then somehow reducing this
24-bit product mod q.
On the reference implementation of Kyber, which is optimized for AVX2 (software)
version 2.0, this is achieved by performing the NTT and PWM in Montgomery form, replac-
ing all modular multiplications with Montgomery multiplications, and then converting
back to normal form at the INTT stage.
In this work, we developed two solutions with different hardware tradeoffs to perform
this particular modular multiplication efficiently.
t = y[11 : 9]ζ [k ]29 + y[8 : 6]ζ [k]26 + y[5 : 3]ζ [k]23 + y[2 : 0]ζ [k] (3)
Every term in Equation (3) has 1024 possible outcomes that are 12-bit each. Therefore,
we can precompute all these terms and store them as four 1024 × 12 ROMs indexed by
{i [0 : 2], k}, namely, luts9, luts6, luts3, and luts0, with the value, as described in Equation (4).
Equation (3) now becomes Equation (5).
luts9[{i [0 : 2], k}] := i × ζ [k]29 mod q
luts6[{i [0 : 2], k}] := i × ζ [k]26 mod q
(4)
luts3[{i [0 : 2], k}] := i × ζ [k]23 mod q
luts0[{i [0 : 2], k}] := i × ζ [k]20 mod q
t = luts9[{y[11 : 9], k}] + luts9[{y[8 : 6], k }] + luts9[{y[5 : 3], k}] + luts0[{y[2 : 0], k}] (5)
Information 2024, 15, 400 7 of 15
This method requires either four single-port 2 KB ROMs per butterfly unit or four
dual-port 2 KB ROMs per two butterfly units. Since Xilinx’s BRAM does have the capability
of being dual-port ROMs, this work used the latter forms to achieve higher throughput
and the maximum utilization of Xilinx’s BRAMs, where the two butterfly units share the
same ROMs as the look-up table. In our testing, the synthesis tool was able to automatically
fuse two identical single-port ROMs inside two instances of the butterfly unit into a single
dual-port ROM without any explicit RTL code change.
( x + y )2 ( x − y )2 ( x2 + 2xy + y2 ) − ( x2 − 2xy + y2 )
− = = xy (6)
4 4 4
Information 2024, 15, 400 8 of 15
Note that the depth of the ROM can be reduced by four by performing preprocessing
and postprocessing. The first reduction factor of 2 is achieved by taking advantage of
the symmetry of squares x2 = (− x )2 ; in the modular field, this means n2 = (q − n)2 =
min(n, q − n)2 mod q. Thus, we only need to look up the quarter square of min(n, q − n),
which is only up to (q − 1)/2 and is 11 bits wide.
The second reduction factor of 2 is achieved by further realizing the fact that the table
saves both the even (2n)2 /4 and odd (2n + 1)2 /4 entries; the odd entries (2n + 1)2 /4
can be computed using the even entries (2n)2 /4 by (2n + 1)2 /4 = (2n)2 /4 + n + 4−1 .
Therefore, the tables only need to store the even quarter squares, x2 for x up to q/4 (10-bit);
to compute x2 4−1 mod q, we only look up the truncated number n = x [10 : 1] and then
conditionally adding n + 4−1 if x is odd, as shown in Equation (9).
(
lutx2 mod q ( n ) : = n2 mod q, 0 ≤ n < 210
(9)
x [10 : 0]2 4−1 mod q = lutx2 mod q ( x [10 : 1]) + x [0]( x [10 : 1] + 4−1 ) mod q
Finally, with the preprocessing and postprocessing in place, we only need a single 1024
× 12 dual-port ROM to store the quarter squares, which fits perfectly in a single 2 KB true
dual-port BRAM of Xilinx FPGA. This variant of the butterfly unit is implemented as shown
in Figure 4. The modular multiplication is handled by a quarter square mul-mod submod-
ule, implemented as whon in Figure 5. The wuarter square dual-port ROM stores the value
of lutx2 mod q described in Equation (9). For simplicity, all trivial reductions modulo q are
omitted. This method requires a single true dual-port 2 KB ROM per butterfly unit.
The zeta[k] multiplier submodule can be any modular multiplication module, such as
the ones in Figure 3 or Figure 5. With intt_sel signal, when asserted, the butterfly unit acts
as an inverse NTT butterfly unit, as in Equation (10).
Similarly, when intt_sel is deasserted, the butterfly unit acts as a forward NTT butterfly
unit, as in Equation (11).
xout
= xin + yin ζ [k]
yout = xin − yin ζ [k] (11)
intt_sel =0
Information 2024, 15, 400 10 of 15
Note that the multiplications by the inverse of 2 modulo q in Equation (10), a2−1 ,
can simply be achieved by realizing that, if a is even, then a2−1 ≡ a ≫ 1 (mod q),
where a ≫ 1 is the result of the unsigned binary presentation of a shifted right by 1.
It can also be defined as a ≫ 1 := ⌊ 2a ⌋. When a is odd, then a + q is even, and thus
a2−1 ≡ a2−1 + q2−1 ≡ ( a + q)2−1 ≡ ( a + q) ≫ 1 (mod q).
Therefore, all modular multiplications by 2−1 are efficient, as shown in Equation (12).
a2−1 mod q = ( a + qa[0]) ≫ 1 mod q (12)
4. Experimental Results
4.1. Experimental Setup
To verify the functionality and the performance of the NTT core under ideal conditions,
we created a test bench with the core top as the unit test and interfaced with it using an
AXI-like handshake mechanism. For the input side, after reasserting reset, we provided
the core with a continuous stream of input data as the coefficient of the polynomial on
which we needed the core to operate. Eventually, the core was finished and a data_out_-
valid signal was asserted to provide a stream of output data as the transformed result.
The operation latency was as the number of cycles between the first in_data_valid from the
test bench assertion up until the last assertion of out_data_valid (the previous result data).
Furthermore, the result was then checked with the expected result to make sure the core
was working correctly; this was particularly handy when we were in the process of rapidly
tweaking the core to fix the violated path to achieve higher frequency.
Information 2024, 15, 400 11 of 15
The top verification module is shown in Figure 8. A UART transceiver was used
receive the test data and send back the result, which was then checked against the expected
result. The other end of the UART communication was a desktop computer running a
UART test driver software that transmitted test polynomials randomly and then checked for
the FPGA’s NTT result for any error. The test driver algorithm is described in Algorithm 3.
wall-clock time was omitted for simplicity since it differs from NTT by a few clock cycles at
most. The cycles reported for “Both NTT and INTT” designs are in NTT / INTT format
(i.e., 126/127 means 126 clock cycles for NTT and 127 clock cycles for INTT), unless both of
them are the same or the work did not report it clearly. The ATP metric, where a smaller
value indicates better hardware utilization efficiency, is calculated as the product of time
in µs and the respective resource utilized, for instance, LUT ATP := Time (µs) × LUTs and
FF ATP := Time (µs) × FFs. For the methods that have a configurable number of parallel
butterfly units, the notation [c] ×n indicates the n-parallel butterfly units design of the
method [c].
In Table 1, for the NTT-only cores, the α core uses 429 LUTs, 538 FFs, and 4 BRAMs
with two parallel butterfly units (unroll factor 2), achieving a hardware efficiency of 441.87
LUT ATP and 554.14 FF ATP with 459 cycles of latency. The β core, with a single butterfly,
using only 379 LUTs, 414 FFs, with 910 cycles latency, achieved 792.11 LUT ATP and 865.25
FF ATP, which is worse than α but only uses a single BRAM as the ROM quarter square
look-up table for the butterfly unit. Note that it is possible to not unroll the design α and
run with just a single butterfly unit; however, this still requires four BRAMs due to the
nature of the approach, that is, four ROMs shared between two butterflies via two different
ports. In the β design, each butterfly has a dedicated ROM; hence, there are more rooms
to scale in the quarter square method. And, finally, the hybrid core γ, which supports
both NTT and INTT and is based on the α design, only requires slightly more resources at
541 LUTs (595.10 ATP), 680 FFs (748.00 ATP,) and two-cycle latency penalty, where both
INTT and NTT take 461 cycles to finish.
For the NTT standalone cores, although a recent method [23] (FNTT) successfully
eliminated the usage of any DSP or BRAM, the approach heavily hinders the LUT and FF
utilization and frequency, which can only run at about 100 MHz even on the higher-end
Virtex-7 platform. Our α design is able to achieve more than 10× lower latency and more
Information 2024, 15, 400 13 of 15
than 200× smaller ATPs than [23] (FNTT). Furthermore, even the β design has almost 7×
better latency and more than 100× better ATPs than the FNTT method [23] with the cost of
only a single additional 18 Kb BRAM.
For the hybrid design (NTT- and INTT-capable) comparison, thanks to the highly
pipelined nature of the butterfly unit, our γ design can run at a higher frequency (417 MHz)
on the Artix-7 platform than the other methods, including [22] (300 MHz), which uses a
much more recent and faster Zynq Ultrascale+ platform. The newly proposed modular
multiplication methods can achieve better latency (1.10 µs) than all other methods except
the fastest ×16 configuration in [28] (0.40 µs). However, [28] uses nearly 17.57× more LUTs
and 17.5× more BRAMs than ours while being only 2.75× faster. Therefore, we achieved a
6.39× more efficient LUT ATP and 1.44× more efficient FF ATP than [28] ×16.
Compared to [26], which employs the same parallel butterflies strategy, our method
achieved 3.71× and 1.37× better LUT and FF ATP in the ×1 configuration and 5.10× and
1.60× better LUT and FF ATP in the ×2 configuration.
Our design also achieved about 2× better LUT ATP and 1.5× better FF ATP than [21,27].
In [27], the authors used two fewer BRAMs but at the cost of four DSPs and much higher
LUTs and FFs usage, and latency was higher. The same apply for [22,28], which use fewer
BRAMs at the expense of significantly more LUTs/FFs and two DSPs; hence, these methods
are less efficient than our work in LUT and FF ATP.
In [24], the authors did not report the exact maximum frequency that their design is
able to run. Assuming that it could run at the same frequency as our γ design at 417 MHz,
which is a highly unlikely scenario given the nearly 5000 LUTs and 35 BRAMs utilized, the
method in [24] would have 3.66× better latency than γ while using 9.18× more LUTs and
2.38× more FFs and thus would be more efficient in terms of FFs ATP but significantly
worse in term of LUTs ATP.
5. Conclusions
In this paper, we proposed a new NTT architecture with a unique data access pattern
along with a reorder buffer, making it suitable for a single, shallow-depth, simple and dual-
port SRAM; in the FPGA architecture, this was mapped to abundant LUT-RAM resources.
This allowed us to unroll the inner loop of the NTT by a factor of two and beyond (×4, ×8)
to achieve lower latency without using significantly more resources in the control path.
Furthermore, we proposed two new look-up table-based butterfly units, namely, multipli-
cationless and quarter square multiplication, both with different tradeoffs in LUT/BRAM
usage and eliminate the need for computing the full 12 × 12 product and subsequently the
need for DSP utilization. Thanks to the new butterfly units, our whole design runs at a
higher frequency (417 MHz), has good latency while maintaining excellent hardware usage
efficiency (LUT/FF ATP), and is suitable for embedding in high-performance, lightweight
CRYSTALS Kyber accelerators for more efficient polynomial multiplication.
References
1. Shor, P.W. Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer. SIAM J. Sci. Statist. Com-
put. 1997, 26, 1484–1509.
2. Grover, L.K. A Fast Quantum Mechanical Algorithm for Database Search. In Proceedings of the Twenty-Eighth Annual ACM
Symposium on Theory of Computing, Philadelphia, PA, USA, 22–24 May 1996; pp. 212–219.
3. Rivest, R.L.; Shamir, A.; Adleman, L. A Method for Obtaining Digital Signatures and Public-key Cryptosystems. Commun. ACM
1978, 21, 120–126.
4. Miller, V.S. Use of Elliptic Curves in Cryptography. In Proceedings of the Advances in Cryptology (CRYPTO), Santa Barbara, CA,
USA, 18–22 August 1985; pp. 417–426.
5. Bos, J.; Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schanck, J.M.; Schwabe, P. .; Seiler, G.; Stehle, D.D. CRYSTALS—Kyber:
A CCA-Secure Module-Lattice-Based KEM. In Proceedings of the European Symposium on Security and Privacy (EuroS&P),
London, UK, 24–26 April 2018; pp. 353–367.
6. Lyubashevsky, V.; Peikert, C.; Regev, O. On Ideal Lattices and Learning with Errors Over Rings. In Proceedings of the Advances
in Cryptology (EUROCRYPT), Santa Barbara, CA, USA, 15–19 August 2010; pp. 1–23.
7. Lindner, R.; Peikert, C. Better Key Sizes (and Attacks) for LWE-Based Encryption. In Proceedings of the Topics in Cryptology
(CT-RSA), San Francisco, CA, USA, 14–18 February 2011; pp. 319–339.
8. Das, M.; Jajodia, B.B. Hardware Design of Optimized Large Integer Schoolbook Polynomial Multiplications on FPGA. In Proceedings
of the International SoC Design Conference (ISOCC), Gangneung-si, Republic of Korea, 19–22 October 2022; pp. 65–66.
9. Zhang, Y.; Cui, Y.; Ni, Z.; Kundi, D.; Liu, D.; Liu, W. A Lightweight and Efficient Schoolbook Polynomial Multiplier for Saber.
In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Austin, TX, USA, 27 May–1 June 2022;
pp. 2251–2255.
10. Birgani, Y.A.; Timarchi, S.; Khalid, A. Area-Time-Efficient Scalable Schoolbook Polynomial Multiplier for Lattice-Based Cryptog-
raphy. IEEE Trans. Circ. Syst. II Express Briefs (TCAS-II) 2022, 69, 5079–5083.
11. Yang, S.; Liu, D.; Hu, A.; Li, A.; Zhang, J.; Li, X.; Lu, J.; Mo, C. An Instruction-configurable Post-quantum Cryptographic Processor
Towards NTRU. In Proceedings of the Asian Hardware Oriented Security and Trust Symposium (AsianHOST), Singapore, 14–16
December 2022; pp. 1–6.
12. Wong, Z.Y.; Wong, D.C.K.; Lee, W.K.; Mok, K.M.; Yap, W.S.; Khalid, A. KaratSaber: New Speed Records for Saber Polynomial
Multiplication Using Efficient Karatsuba FPGA Architecture. IEEE Trans. Comput. 2023, 72, 1830–1842.
13. Ghosh, A.; Mera, J.M.B.; Karmakar, A.; Das, D.; Ghosh, S.; Verbauwhede, I.; Sen, S. A 334 µW 0.158 mm2 ASIC for Post-Quantum
Key-Encapsulation Mechanism Saber With Low-Latency Striding Toom–Cook Multiplication. IEEE J. Solid-State Circ. 2023,
58, 2383–2398.
14. Wang, J.; Yang, C.; Zhang, F.; Meng, Y.; Su, Y. TCPM: A Reconfigurable and Efficient Toom-Cook-Based Polynomial Multiplier Over
Rings Using a Novel Compressed Postprocessing Algorithm. IEEE Trans. Very Large Scale Integr. (Vlsi) Syst. 2023, 31, 1153–1166.
15. Ye, Z.; Cheung, R.; Huang, K. PipeNTT: A Pipelined Number Theoretic Transform Architecture. IEEE Trans. Circ. Syst. II Express
Briefs (TCAS-II) 2022, 69, 4068–4072.
16. Xin, M.; Mingyong; Xu, C.; Huang, K.; Yu, H.; Yao, H.; Jiang, X.; Liu, D. Implementation of Number Theoretic Transform Unit
for Polynomial Multiplication of Lattice-based Cryptography. In Proceedings of the International Conference on Consumer
Electronics and Computer Engineering (ICCECE), Guangzhou, China, 14–16 January 2022; pp. 323–327.
17. Nguyen, T.H.; Binh, K.D.N.; Pham, C.K.; Hoang, T.T. High-Speed NTT Accelerator for CRYSTAL-Kyber and CRYSTAL-Dilithium.
IEEE Access 2024, 12, 34918–34930.
18. Gauss, C.F. Nachlass: Theoria Interpolationis Methodo Nova Tractata. Carl Friedrich Gauss 1866, 3, 265–303.
19. Pollard, J.M. The Fast Fourier Transform in a Finite Field. Math. Comput. 1971, 25, 365–374.
20. Cooley, J.W.; Tukey, J.W. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comput. 1965, 19, 297–301.
21. Zhang, C.; Liu, D.; Liu, X.; Zou, X.; Niu, G.; Liu, B.; Jiang, Q. Towards Efficient Hardware Implementation of NTT for Kyber on
FPGAs. In Proceedings of the 2021 IEEE International Symposium on Circuits and Systems (ISCAS), Daegu, Republic of Korea,
22–28 May 2021.
22. Ni, Z.; Khalid, A.; Liu, W.; O’Neill, M. Towards a Lightweight CRYSTALS-Kyber in FPGAs: An Ultra-lightweight BRAM-free NTT
Core. In Proceedings of the IEEE International Symposium on Circuits and Systems 2023, Monterey, CA, USA, 21–25 May 2023.
23. Imran, M.; Khan, S.; Khalid, A.; Rafferty, C.; Shah, Y.A.; Pagliarini, S.; Rashid, M.; O’Neill, M. Evaluating NTT/INTT Implement-
ation Styles for Post-Quantum Cryptography. IEEE Embed. Syst. Lett. 2024, early access. https://doi.org/10.1109/LES.2024.34105
16.
24. Yang, H.; Chen, R.; Wang, Q.; Wu, Z.; Peng, W. Hardware Acceleration of NTT-Based Polynomial Multiplication in CRYSTALS-
Kyber. In Information Security and Cryptology; Ge, C., Yung, M., Eds.; Springer: Singapore, 2024; pp. 111–129.
25. Longa, P.; Naehrig, M. Speeding Up the Number Theoretic Transform for Faster Ideal Lattice-Based Cryptography. Cryptology
ePrint Archive, Paper 2016/504, 2016. Available online: https://eprint.iacr.org/2016/504 (accessed on 7 July 2024. )
26. Niasar, M.B.; Azarderakhsh, R.; Kermani, M.M. Instruction-Set Accelerated Implementation of CRYSTALS-Kyber. IEEE Trans.
Circ. Syst. I Regul. Pap. (TCAS-I) 2021, 68, 4648–4659.
Information 2024, 15, 400 15 of 15
27. Bisheh-Niasar, M.; Azarderakhsh, R.; Mozaffari-Kermani, M. High-Speed NTT-based Polynomial Multiplication Accelerator for
Post-Quantum Cryptography. In Proceedings of the 2021 IEEE 28th Symposium on Computer Arithmetic (ARITH), Lyngby,
Denmark, 14–16 June 2021.
28. Yaman, F.; Mert, A.C.; Öztürk, E.; Savas, E. A Hardware Accelerator for Polynomial Multiplication Operation of CRYSTALS-
KYBER PQC Scheme. In Proceedings of the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble,
France, 1–5 February 2021.
29. Xing, Y.; Li, S. A Compact Hardware Implementation of CCA-Secure Key Exchange Mechanism CRYSTALS-KYBER on FPGA.
IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021, 2021, 328–356
30. Matteo, S.D.; Sarno, I.; Saponara, S. CRYPHTOR: A Memory-Unified NTT-Based Hardware Accelerator for Post-Quantum
CRYSTALS Algorithms. IEEE Access 2024, 12, 25501–25511. https://doi.org/10.1109/ACCESS.2024.3367109.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.