Fixed-Point CNN For FPGA
Fixed-Point CNN For FPGA
Abstract — Modern mobile neural networks with a reduced blocks. In the case of field programmable gate arrays (FPGAs),
number of weights and parameters do a good job with image this allows to program the network on different types of
classification tasks, but even they may be too complex to be FPGAs, providing different processing speed. For example,
implemented in an FPGA for video processing tasks. The article implementation of higher number of convolutional blocks on
proposes neural network architecture for the practical task of an FPGA can directly lead to a speed-up in processing.
recognizing images from a camera, which has several advantages in
terms of speed. This is achieved by reducing the number of Related direction in neural network research considers
weights, moving from a floating-point to a fixed-point arithmetic, adapting NNs for the use on mobile devices [4]. Mobile
and due to a number of hardware-level optimizations associated networks typically have reduced number of weights and require
with storing weights in blocks, a shift register, and an adjustable relatively small number of arithmetic operations. However,
number of convolutional blocks that work in parallel. The article they are still executed at the software level and use floating-
also proposed methods for adapting the existing data set for solving point calculations. For some tasks such as real-time video
a different task. As the experiments showed, the proposed neural analysis that requires processing of 30 frames per second
network copes well with real-time video processing even on the mobile networks still can be not fast enough without further
cheap FPGAs. optimization. In order to use an already trained neural network
in a mobile device, a set of optimizations can be used to speed
Keywords—Neural network hardware; Field programmable gate
up computation. There exist a number of approaches to do so,
arrays; Fixed-point arithmetic; 2D convolution
including weight compression or computation using low-bit
I. INTRODUCTION data representations. Since hardware requirements for neural
networks keep increasing, there is a need for design and
Recent research in artificial neural networks has demon- development of specialized hardware block for the use in ASIC
strated their ability to perform well on a wide range of tasks and FPGA. The speed up can be achieved by following:
[1], [2]. Most of the modern neural network architectures for
computer vision include convolutional layers and thus are • hardware implementation of the convolution
called convolutional neural networks (CNNs). They have high operation, which is faster than software convolution;
computational requirements. However, there is a compelling • using fixed-point arithmetic instead of floating-point
need for the use of deep convolutional neural networks on calculations;
mobile devices and in embedded systems. This is particularly
important for video processing in, for example, autonomous • reducing the network size while preserving the
cars and medical devices [3]. performance;
Following properties of many modern high-performing • modifying structure of a network architecture while
CNN architectures make their hardware implementation preserving the same level of performance and
feasible: decreasing footprint of the hardware implementation
and saved weights.
• high regularity: all commonly used layers have similar
structure (Conv3x3, Conv1x1, MaxPooling, For example, Zhang C. et al. [6] quantitatively analyzed
FullyConnected, GlobalAvgPooling); computing throughput and required memory bandwidth for
various CNNs using optimization techniques, such as loop
• typically small size of convolutional filters: 3 × 3;
tiling and transformation. This allowed their implementation to
• ReLU activation function (comparison of the value achieve peak performance of 61.62 GFLOPS. Qiu J. et al. [5]
with zero): easier to compute compared to previously proposed an FPGA implementation of pre-trained deep neural
used Sigmoid and Tanh functions. networks from VGG family. They used dynamic-precision
Due to high regularity, size of the network can be easily quantization with 48-bit data representation and singular vector
varied, for example, by changing the number of convolutional decomposition to reduce the size of fully-connected layers,
II. METHODS
A. Implementation requirements
Fig. 2. Scheme for connecting camera and screen modules to De0-Nano
To demonstrate our approach, we implement a solution for board. Pins with the same name are connected. Pins marked as ’x’ remain
the problem of recognizing handwritten digits received from a unconnected
camera in real time. The results are displayed on an electronic
LED screen. The minimal speed of digit recognition should C. Dataset preparation
exceed 30 FPS, that is, neural network should be able to
process a single image in 33ms. The resulting hardware The MNIST dataset for handwritten digit recognition [9] is
implementation should be ready for transfer to separate custom widely used in the computer vision community. However, it is
VLSI device for mass production. not well suited for training a neural network in our application,
since it differs greatly from the camera images (Fig. 3).
B. Hardware specifications Major differences include:
We use the compact development board DE0-Nano due to • MNIST images are light digits over dark background,
the following reasons: opposite to those from the camera feed;
•camera produces color images, while MNIST is grayscale;
1606
•size of MNIST image is 28 × 28 pixels, while camera In order to use MNIST images for training a neural
image size is 320 × 240 pixels; network, on-the-fly data augmentation is used. This method
implies that during the creation of the next mini-batch for
• unlike centrally placed digits and homogenious training, a set of different filters is arbitrarily applied to each
background in MNIST images, digits can be shifted and image. This technique is used to easily increase the dataset
slightly rotated in camera images, sometimes with noise size, as well as to bring images to the required form, as in our
in the background; case.
• MNIST does not have a separate class of images without The following filter set was used for augmenting MNIST
digits. images:
• color inversion;
• random 10 degrees rotation in both directions;
• random expansion or reduction of an image by 4-pixels;
• random variation of image intensity (from 0 to 80);
• adding random noise from 0% to 10%.
Optionally, images from camera can be mixed into mini-
batches.
1607
the major disadvantage of this model is the number of weights, At the layer input is a two-dimensional matrix (original
size of which exceeds FPGA capacity. Besides, the exchange picture) 28 × 28 with values from [0; 1). It is also known that if
with external memory imposes additional time costs. a ∈ [−1, 1] and b ∈ [−1, 1], then a · b ∈ [−1, 1].
Moreover, this model involves a ”bias” term, which also has to
be stored, requires additional processing blocks, and tends to For 3 × 3 convolution, the value of the certain pixel (i, j) in
accumulate error if implemented in fixed-point representation. the second layer can be calculated as follows:
Therefore, we propose a further modification of this
architecture that we call Low Weights Digit Detector (LWDD). (2)
1608
to happen when the predicted value is compared with the there is downloaded through the controller to the small
prediction by the software implementation, rather then with the memory unit for the further use. In the hardware realization,
true image label. To validate the “accuracy” of predictions, we not all layers of neural network under test are used; some of
run all test images through both the floating-point software them are replaced by other functions. For example, there is no
implementation and fixed-point software implementation (or ZeroPadding layer, instead of it module of intermediate image
Verilog benchmark) and then compare predictions. Ratio of edge detection is applied, which allows to reduce chip memory
mismatches to the total number of tests is a measure of is usage. GlobalMaxPooling layer is replaced by the function
“inaccuracy” measure for the given width of weights and from the Convolution layer that immediately gets
intermediate results. We choose the bit width at which the GlobalMaxPooling layer result by finding the largest value in
number of errors is 0. the intermediate image. The rest of the layers are implemented
as separate modules. Since Convolution and Dense layers can
When using fixed-point calculations with convolution use convolutional blocks for calculations, both of them have
blocks, two different strategies are possible: access to these blocks. Modules contain ReLU activation
• rounding after each elementary operation of addition and function, which is used as needed. In the last layer, Softmax
multiplication; activation function is applied. It is implemented as traditional
• calculation with full accuracy and rounding at the very end of Maximum, because position of a neuron with the maximum
convolution operation. value is always the same for these functions. To implement the
Two experiments are carried out to determine the most neural network, the specialized Convolution block is used,
effective approach. To achieve zero difference from the which performs convolution of 3 × 3 in one clock cycle. This
floating-point model, the number storage requires 17 bits in block is a scalar product of vectors and contains 9
case of rounding at the beginning, and only 12 bits in case of multiplications and 8 additions. The same block is used for
rounding at the end. calculations in the fully connected Dense layer due to splitting
the entire set of additions and multiplications into blocks of 9
Rounding after each operation slightly increases neurons.
performance, and significantly increases memory overhead.
Therefore, it is advantageous to perform rounding after
convolution block.
1609
To increase the performance, a number of techniques are 7
applied that made it possible to reduce the number of cycles
required for one image classification. III. PERFORMANCE RESULTS
1) Increasing of convolution blocks number: If there is The proposed design is successfully implemented in FPGA.
enough free space in FPGA, we can improve the performance Details on logic cells number and memory usage are given in
by increasing the number of convolution blocks, thereby Table I. These numerical results demonstrate low hardware
multiplying productivity. Consider the second convolutional requirements of the proposed model architecture. Moreover,
block in the proposed neural network LWDD. There are 4 of for this implementation, the depth of the neural network can be
28× 28 images at the layer input, and 16 blocks of weights are further increased without exhausting the resources of this
given. To calculate the set of outputs for this layer, also specific hardware.
consisting of 4 images, we have to perform four multiplications In this implementation, input images are processed in real
of the same set of pixels by different sets of weights. If there is time, and the original image is displayed along with the result.
only one convolutional block, this takes at least 4 cycles, but if Classification of one image requires about 230 thousand clock
there are 4 such blocks, then only one clock cycle is needed, cycles and we achieve overall processing speed with the large
thus Convolution layer calculation speeds up 4 times. margin over 150 frames/sec.
2) Shift register: To perform an elementary convolution If performance is insufficient and spare logic cells are
operation, we have to get values of 9 neighboring pixels from available, we can speed up calculations by adding
an input image, then next 9 pixels, 6 of which have already convolutional blocks that perform computations in parallel.
been received in the previous step (see Fig. 5). To shorten the Table II shows number of clock cycles required to process one
time for the necessary data call up, shift register is developed to frame using different number of convolutional blocks. Table III
keep new data at their input and at the same time to ”push out” shows total resources required for entire FPGA based project
old data. Thus, each step requires only 3 new values instead of implementation for different weight dimensions and different
9. number of convolution blocks. Missing values denotes the
cases that Quartus could not synthesize due to the lack of
TABLE II. CLOCK CYCLES PER CONVOLUTION BLOCK
FPGA resources. Source code for both software and hardware
Clock cycles per 1 Processing implementations as well as video demonstrating the real-time
frame speed-up digit classification from the mobile camera video feed is
1st convolution 236746 -
block
available on GitHub [11].
2nd convolution 125320 1,89
block IV. CONCLUSIONS
4th convolution 67861 3,49 In this work we propose a design and an implementation of
block
FPGA-based CNN with fixed-point calculations that allows to
achieve the exact performance of the corresponding software
3) Storing of all data for one Convolution operation at the implementation on the live handwritten digit recognition
same address: When we call up data that are necessary for problem. Due to the reduced number of parameters we avoid
calculations, one clock cycle is used for each value. Therefore, common issues with memory bandwidth. Suggested method
in order to reduce time spent on downloading required data, as can be implemented on a very basic set of FPGAs, but also is
well as convenience of access, prior to putting into internal scalable for the use on FPGAs with large number of logical
memory of FPGA, data are stacked in blocks of 9 pieces, after cells. Additionally, we demonstrate how existing open datasets
which they are accessible at one address. With such memory can be modified in order to better adapt them for real-life
arrangement, we can perform the extraction of weights in one applications. Finally, in order to promote the reproducibility of
clock cycle and, thus, speed up calculations for convolutional results, facilitate open-scientific development, and enable
and fully connected layers. Example is shown in Fig. 6. collaborative validation, source code, documentation, and all
results from this study are made available online. There are
TABLE III. TOTAL RESOURCE USAGE FOR THE PROJECT many possible ways to improve performance of hardware
implementations of neural networks. While we explored and
Weight Convol Logical Memor Embed Critic Ma
dimensio utional cells y ded al x. implemented some of them in this work, only relatively
ns blocks M9 path FPS shallow neural networks were considered, without additional
elemen delay architectural features, such as skip connections. Implementing
ts even deeper networks with multiple dozens of layers is
1 3750 232111 25 21,84 193 problematic, since all layer weights would not fit into the
11 bit 2 4710 309727 41 22,628 352
FPGA memory and will require the use of the external RAM,
4 6711 464959 77 23,548 625
1 3876 253212 25 24,181 174
which can lead to the decrease in performance. Moreover, due
12 bit 2 4905 337884 41 24,348 327 to the large number of layers, error accumulation will increase
4 10064 589148 77 - - and will require wider bit range to store fixed-point weight
1 3994 274313 25 22,999 183 values. In the future, we plan FPGA-based implementation of
13 bit 2 5124 366041 41 25,044 318 specialized lightweight neural network architectures that are
4 8437 54949 77 - - currently successfully used on mobile devices. This will allow
1610
to use the same hardware implementation for different tasks by Arrays, ser.FPGA ’16. New York, NY, USA: ACM, 2016, pp. 26–35.
fine-tuning the architecture using pre-trained weights. [Online]. Available: http://doi.acm.org/10.1145/2847263.2847265
[6] C. Zhang et al., “Optimizing FPGA-based accelerator design for deep
ACKNOWLEDGMENT convolutional neural networks,” in Proceedings of the 2015
ACM/SIGDA International Symposium on Field-Programmable Gate
Research has been conducted with the financial support Arrays, ser. FPGA ’15. New York, NY, USA: ACM, 2015, pp. 161–170.
from the Russian Science Foundation (grant 17-19-01645). [Online]. Available: http://doi.acm.org/10.1145/2684746.2689060
[7] N. Suda et al., “Throughput-optimized OpenCL-based FPGA accelerator
REFERENCES for large-scale convolutional neural networks,” in Proceedings of the
2016ACM/SIGDA International Symposium on Field-Programmable
[1] Huang, Gao, et al. "Densely connected convolutional networks." CVPR. GateArrays, ser. FPGA ’16. New York, NY, USA: ACM, 2016, pp. 16–
Vol. 1. No. 2. 2017. 25. [Online]. Available: http://doi.acm.org/10.1145/2847263.2847276
[2] Chen, Liang-Chieh, et al. "Deeplab: Semantic image segmentation with [8] J. Duarte et al., “Fast inference of deep neural networks in fpgas for
deep convolutional nets, atrous convolution, and fully connected crfs." particle physics,” arXiv preprint arXiv:1804.06913, 2018.
IEEE transactions on pattern analysis and machine intelligence 40.4 [9] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
(2018): 834-848. applied to document recognition,” Proceedings of the IEEE, vol. 86,no.
[3] A. Shvets, A. Rakhlin, A. A. Kalinin, and V. Iglovikov, “Automatic 11, pp. 2278–2324, Nov 1998.
instrument segmentation in robot-assisted surgery using deep learning,” [10] R. A. Solovyev, A. G. Kustov, V. S. Ruhlov, A. N. Schelokov, and D. V.
arXiv preprint arXiv:1803.01207, 2018. Puzyrkov, “Hardware implementation of a CNN in FPGA based on
[4] Sandler M. et al. “Inverted residuals and linear bottlenecks: Mobile fixed point calculations,” Izvestiya SFedU. Engineering Sciences, July
networks for classification, detection and segmentation” arXiv preprint 2017, in Russian.
arXiv:1801.04381, 2018. [11] “Verilog generator of neural net digit detector for FPGA,” GitHub.
[5] J. Qiu et al., “Going deeper with embedded FPGA platform for [Online]. Available: https://github.com/ZFTurbo/Verilog-Generator-of-
convolutional neural network,” in Proceedings of the 2016 Neural-Net-Digit-Detector-for-FPGA
ACM/SIGDA International Symposium on Field-Programmable Gate
1611