Design and Implementation of a Fast and Scalable NTT-Based Polynomial Multiplier Architecture

Paper 2019/109

Design and Implementation of a Fast and Scalable NTT-Based Polynomial Multiplier Architecture

Ahmet Can Mert, Erdinc Ozturk, and Erkay Savas

Abstract

In this paper, we present an optimized FPGA implementation of a novel, fast and highly parallelized NTT-based polynomial multiplier architecture, which proves to be effective as an accelerator for lattice-based homomorphic cryptographic schemes. As input-output (I/O) operations are as time-consuming as NTT operations during homomorphic computations in a host processor/accelerator setting, instead of achieving the fastest NTT implementation possible on the target FPGA, we focus on a balanced time performance between the NTT and I/O operations. Even with this goal, we achieved the fastest NTT implementation in literature, to the best of our knowledge. For proof of concept, we utilize our architecture in a framework for Fan-Vercauteren (FV) homomorphic encryption scheme, utilizing a hardware/software co-design approach, in which NTT operations are offloaded to the accelerator while the rest of operations in the FV scheme are executed in software running on an off-the-shelf desktop computer. Specifically, our framework is optimized to accelerate Simple Encrypted Arithmetic Library (SEAL), developed by the Cryptography Research Group at Microsoft Research, for the FV encryption scheme, where forward and inverse NTT operations are utilized extensively for large degree polynomial multiplications. The hardware part of the proposed framework targets XILINX VIRTEX-7 FPGA device, which communicates with its software part via a PCIe connection. Offloading forward/inverse NTT and coefficient multiplication operations on FPGA, taking into account the time expended on I/O operations, the proposed framework achieves almost x11 latency speedup for the offloaded operations compared to their pure software implementations. With careful pipelining, overlapping I/O operations with actual polynomial multiplication computations, and assuming one of the operands for the polynomial multiplication operation is already inside the FPGA (valid assumption for encrypt/decrypt operations for homomorphic applications), we achieved a throughput of almost 800k polynomial multiplications per second, for polynomials of degree 1024 with 32-bit coefficients.

Metadata

Available format(s): PDF
Category: Implementation
Publication info: Preprint. MINOR revision.
Keywords: Number Theoretic Transform Large-Degree Polynomial Multiplication Fan-Vercauteren SEAL FPGA
Contact author(s): acmert @ sabanciuniv edu
erdinco @ sabanciuniv edu
erkays @ sabanciuniv edu
History: 2019-02-05: received
Short URL: https://ia.cr/2019/109
License: CC BY

BibTeX

@misc{cryptoeprint:2019/109,
      author = {Ahmet Can Mert and Erdinc Ozturk and Erkay Savas},
      title = {Design and Implementation of a Fast and Scalable {NTT}-Based Polynomial Multiplier Architecture},
      howpublished = {Cryptology {ePrint} Archive, Paper 2019/109},
      year = {2019},
      url = {https://eprint.iacr.org/2019/109}
}