Fixed-Point CNN For FPGA

CNN accelerator on FPGA

Uploaded by

manh.tranduc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views7 pages

Fixed-Point CNN For FPGA

CNN accelerator on FPGA

Uploaded by

manh.tranduc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Fixed-Point Convolutional Neural Network for Real-

Time Video Processing in FPGA

Roman Solovyev, Alexander Kustov, Dmitry Alexandr Kalinin,
Telpukhov, Vladimir Rukhlov Department of Computational Medicine and Bioinformatics,
Institute for Design Problems in Microelectronics of University of Michigan,
Russian Academy of Sciences (IPPM RAS) Ann Arbor, MI, USA
Moscow, Russia
zf-turbo@yandex.ru

Abstract — Modern mobile neural networks with a reduced blocks. In the case of field programmable gate arrays (FPGAs),
number of weights and parameters do a good job with image this allows to program the network on different types of
classification tasks, but even they may be too complex to be FPGAs, providing different processing speed. For example,
implemented in an FPGA for video processing tasks. The article implementation of higher number of convolutional blocks on
proposes neural network architecture for the practical task of an FPGA can directly lead to a speed-up in processing.
recognizing images from a camera, which has several advantages in
terms of speed. This is achieved by reducing the number of Related direction in neural network research considers
weights, moving from a floating-point to a fixed-point arithmetic, adapting NNs for the use on mobile devices [4]. Mobile
and due to a number of hardware-level optimizations associated networks typically have reduced number of weights and require
with storing weights in blocks, a shift register, and an adjustable relatively small number of arithmetic operations. However,
number of convolutional blocks that work in parallel. The article they are still executed at the software level and use floating-
also proposed methods for adapting the existing data set for solving point calculations. For some tasks such as real-time video
a different task. As the experiments showed, the proposed neural analysis that requires processing of 30 frames per second
network copes well with real-time video processing even on the mobile networks still can be not fast enough without further
cheap FPGAs. optimization. In order to use an already trained neural network
in a mobile device, a set of optimizations can be used to speed
Keywords—Neural network hardware; Field programmable gate
up computation. There exist a number of approaches to do so,
arrays; Fixed-point arithmetic; 2D convolution
including weight compression or computation using low-bit
I. INTRODUCTION data representations. Since hardware requirements for neural
networks keep increasing, there is a need for design and
Recent research in artificial neural networks has demon- development of specialized hardware block for the use in ASIC
strated their ability to perform well on a wide range of tasks and FPGA. The speed up can be achieved by following:
[1], [2]. Most of the modern neural network architectures for
computer vision include convolutional layers and thus are • hardware implementation of the convolution
called convolutional neural networks (CNNs). They have high operation, which is faster than software convolution;
computational requirements. However, there is a compelling • using fixed-point arithmetic instead of floating-point
need for the use of deep convolutional neural networks on calculations;
mobile devices and in embedded systems. This is particularly
important for video processing in, for example, autonomous • reducing the network size while preserving the
cars and medical devices [3]. performance;
Following properties of many modern high-performing • modifying structure of a network architecture while
CNN architectures make their hardware implementation preserving the same level of performance and
feasible: decreasing footprint of the hardware implementation
and saved weights.
• high regularity: all commonly used layers have similar
structure (Conv3x3, Conv1x1, MaxPooling, For example, Zhang C. et al. [6] quantitatively analyzed
FullyConnected, GlobalAvgPooling); computing throughput and required memory bandwidth for
various CNNs using optimization techniques, such as loop
• typically small size of convolutional filters: 3 × 3;
tiling and transformation. This allowed their implementation to
• ReLU activation function (comparison of the value achieve peak performance of 61.62 GFLOPS. Qiu J. et al. [5]
with zero): easier to compute compared to previously proposed an FPGA implementation of pre-trained deep neural
used Sigmoid and Tanh functions. networks from VGG family. They used dynamic-precision
Due to high regularity, size of the network can be easily quantization with 48-bit data representation and singular vector
varied, for example, by changing the number of convolutional decomposition to reduce the size of fully-connected layers,

1605 978-1-7281-0339-6/19/$31.00 ©2019 IEEE

which led to smaller number of weights that had to be passed • Intel (Altera) FPGA is installed on this board, which is
from the device to the external memory. Higher level solution mass-produced and cheap;
is proposed in [7], which considers the use of the OpenGL • Cyclone IV FPGA has rather low performance and small
compiler for deep networks, such as AlexNet and VGG. Duarte number of logic cells, assuming increased performance
et al. [8] have recently suggested the protocol for automatic if re-implemented with most of other modern FPGAs;
conversion of neural network implementations in high-level
programming language to intermediate format (HLS) and then • it makes connecting peripherals, such as camera and
into FPGA implementation. However, their work is mostly touchscreen, easier;
focused on the implementation of fully-connected layers. In • the board itself has 32 MB of RAM, which can be used to
this work we propose a design and implementation of FPGA- store weights of a neural network. The general scheme
based CNN with fixed-point calculations that allows achieving of the board and external devices is shown in Fig. 1.
the exact performance of the corresponding software
OV7670 camera module is chosen for image acquisition
implementation on the live hand written digit recognition
due to the high quality/price ratio. In this application, high
problem. Due to the reduced number of parameters we avoid
resolution video is not required, since every image is reduced
common issues with memory bandwidth. Suggested method to the size of 28 × 28 pixels and converted to grayscale. The
can be implemented on very basic FPGAs, but also is scalable camera module also has a simple connection mechanism.
for the use on FPGAs with large number of logical cells.
Additionally, we demonstrate how existing open datasets can Display module with 320 × 240 resolution TFT screen is
be modified in order to better adapt them for real-life chosen as the output device. Display module driven by
applications. Finally, in order to promote the reproducibility microcontroller is equipped by 2.4 inches color touchscreen
of results, facilitate open-scientific development, and enable (18-bitcolor, 262,144 color variations). It also has backlight
collaborative validation we make our source code, and is convenient to use due to the large viewing angle.
documentation, and all results from this study available Contrast and dynamic properties of H24TM84A LCD indicator
allow displaying video. LCD controller contains RAM buffer
online.
that lowers the requirements for the device microcontroller.
Final scheme for connecting camera and screen modules to
De0-Nano board is shown in Fig. 2.

Fig. 1. DE0-Nano development board and external devices

II. METHODS

A. Implementation requirements
Fig. 2. Scheme for connecting camera and screen modules to De0-Nano
To demonstrate our approach, we implement a solution for board. Pins with the same name are connected. Pins marked as ’x’ remain
the problem of recognizing handwritten digits received from a unconnected
camera in real time. The results are displayed on an electronic
LED screen. The minimal speed of digit recognition should C. Dataset preparation
exceed 30 FPS, that is, neural network should be able to
process a single image in 33ms. The resulting hardware The MNIST dataset for handwritten digit recognition [9] is
implementation should be ready for transfer to separate custom widely used in the computer vision community. However, it is
VLSI device for mass production. not well suited for training a neural network in our application,
since it differs greatly from the camera images (Fig. 3).
B. Hardware specifications Major differences include:

We use the compact development board DE0-Nano due to • MNIST images are light digits over dark background,
the following reasons: opposite to those from the camera feed;
•camera produces color images, while MNIST is grayscale;

1606
•size of MNIST image is 28 × 28 pixels, while camera In order to use MNIST images for training a neural
image size is 320 × 240 pixels; network, on-the-fly data augmentation is used. This method
implies that during the creation of the next mini-batch for
• unlike centrally placed digits and homogenious training, a set of different filters is arbitrarily applied to each
background in MNIST images, digits can be shifted and image. This technique is used to easily increase the dataset
slightly rotated in camera images, sometimes with noise size, as well as to bring images to the required form, as in our
in the background; case.
• MNIST does not have a separate class of images without The following filter set was used for augmenting MNIST
digits. images:
• color inversion;
• random 10 degrees rotation in both directions;
• random expansion or reduction of an image by 4-pixels;
• random variation of image intensity (from 0 to 80);
• adding random noise from 0% to 10%.
Optionally, images from camera can be mixed into mini-
batches.

D. CNN architecture design

Fig. 3. Different appearance of (A) an image from the MNIST dataset; and
(B) an image from the camera feed
Despite recent developments in CNN architectures, the
Given that the recognition performance on the MNIST essence remains the same: the input size decreases from layer
to layer and the number of filters increases. At the end of a
dataset is very high, we reduce the size of image from camera
network, a set of characteristics is formed that are fed to the
to 28 × 28 pixels and convert them into grayscale. This helps
classification layer (or layers), and the output neurons indicate
us to address following problems: the likelihood that the image belongs to a particular class.
• there is no significant loss in accuracy, as even in small
images digits are still easily recognized by humans; The following set of rules for constructing a neural network
architecture is proposed to minimize the total number of stored
• color information is excessive for digit recognition; weights (which is critical for mobile systems) and facilitate the
• noisy images from camera can be cleaned by reducing transfer to fixed-point calculations:
and averaging neighboring pixels. • minimize number of fully connected layers, which
Since image transformation is also performed at hardware consume major part of memory for storing weights;
level, it is necessary to consider in advance a minimum set of • reduce number of filters of each convolution layer as
arithmetic functions that can effectively bring image to the much as possible without degrading the classification
desired form. The suggested algorithm for modifying camera performance;
images goes as following:
• stop using bias, which is important when shifting from
1) We crop a central part measuring 224 × 224 pixels from floating-point to fixed-point, because adding a constant
a 320 × 240 image, which subsequently allows easy transition hinders monitoring range of values, and rounding bias
to the desired image size, since 224 = 28 × 8. error over each layer tends to accumulate;
2) Then, a cropped image part is converted to a grayscale
• use simple type activation, such as RELU, since other
image. Because of the peculiarities of human visual perception,
activations, such as Sigmoid and Tahn, contain division,
we take weighted, rather than simple, average. To facilitate the
exponentiation, and other functions that are harder to
conversion at the hardware level, the following formula is
implement in hardware;
used:
• minimize number of heterogeneous layers, so that one
B W = (8 × G + 5 × R + 3 × B)/16 (1)
hardware unit can perform calculations at a large
Multiplication by 8 and division by 16 are implemented number of flow stages.
using shifts.
Before translating the neural network onto hardware, we
3) Finally, 224 × 224 image is split into 8 × 8 blocks. We train it on a prepared dataset and save the software
calculate average value for each of these blocks, forming a implementation for testing. We create software implementation
corresponding pixel in 28 × 28 image. using Keras with Tensorflow backend.
Resulting algorithm is simple and works very fast at the In our previous work, we have proposed a VGG Simple
hardware level. neural network [10], which is a lightweight modification of the
popular VGG architecture [9]. Despite the high performance,

1607
the major disadvantage of this model is the number of weights, At the layer input is a two-dimensional matrix (original
size of which exceeds FPGA capacity. Besides, the exchange picture) 28 × 28 with values from [0; 1). It is also known that if
with external memory imposes additional time costs. a ∈ [−1, 1] and b ∈ [−1, 1], then a · b ∈ [−1, 1].
Moreover, this model involves a ”bias” term, which also has to
be stored, requires additional processing blocks, and tends to For 3 × 3 convolution, the value of the certain pixel (i, j) in
accumulate error if implemented in fixed-point representation. the second layer can be calculated as follows:
Therefore, we propose a further modification of this
architecture that we call Low Weights Digit Detector (LWDD). (2)

Since weights w i, j and bias b are known, it is possible to

calculate potential minimum mn and maximum mx of the
second layer. Let M = max (|mn|, |mx|). If we divide wi,jand b
by the value of M, we can guarantee that for any configuration
of input data, the value on the second layer does not exceed 1.
We call M a reduction coefficient of the layer. For the second
layer, we use the same principle, namely, the value at layer
input belongs to interval [−1;1], so we can repeat our
reasoning. For the proposed neural network after all weight
reductions to the last layer, the position of the maximum of the
last neuron is not changed, that is, the network works
equivalently to the neural network without reductions from the
point of view of floating-point calculations.
After performing this reduction on each layer, we can move
from floating-point calculations to fixed-point calculations,
since we know exactly the range of values at each stage of
computation. We use the following notation to represent the
numbers of bits: x b= [x · 2 N].
If z = x + y, then addition can be expressed as: z‘=x b+ y
Fig. 4. Low Weight Digit Detector (LWDD) neural network architecture b= [x · 2 N] +[y · 2 N] = [(x + y) · 2 N] = [z · 2 N] = zb.
Multiplication can be expressed as: z’= x b+ y b= [x · 2 N] ·[y
First, we remove large fully connected layers and bias · 2 N] = [(x · y) ·2 N · 2 N] = [z · 2 N · 2 N] = [zb · 2 N], that
terms. Then, GlobalMaxPooling layer is added to neural is, we have to divide multiplication result by 2 N to get the real
network, instead of GlobalAvgPooling, which is traditionally value, or just shift it by N positions.
used, for example, in ResNet50. The efficiency of these layers
is approximately the same, while the hardware complexity of If we sort through all possible input images and focus on
finding a maximum is much simpler than mean value the potential minimum and maximum values, we can get very
calculation from the computational point of view. These large reduction coefficients, such that the accuracy will rapidly
introduced changes do not lead to the decrease in network decrease from layer to layer. This can require a large width of
performance. New architecture is shown in Fig. 4. Changes in fixed-point representation of weights and intermediate
neural network structure allow to reduce number of weights computational results. To avoid this, we can use all (or a part)
from 25,000 to approximately 4,500, and to store all weights in of the training set to find most likely maximum and minimum
the internal memory of the FPGA. On the modified MNIST values in each layer. As our experiments are show, usage of the
dataset with image augmentations, LWDD neural network training set makes it possible to decrease reduction coefficients.
achieves 96% accuracy. At that, we should scale up coefficients by a small margin,
either focusing on the value of 3σ or increasing the maximum
E. Fixed-point calculation implementation by several per cent.
However, under certain conditions, overflow and violation
In neural networks, calculations are traditionally performed of the calculated range are possible. To address this issue, a
with floating point either on GPU (fast) or CPU (slow), for hardware implementation requires a detector of such cases and
example, using float32 type. When implemented at the the mechanism for replacement of overflowed values with the
hardware level, floating-point calculations are slower than maximum for given layer. This can be achieved by minor
fixed-point due to the difficulty of controlling the mantissa and modifications of a convolutional unit.
the exponent for various operations.
For fixed-point calculations with the limited width of
Let’s consider the first convolutional layer of a neural weights and intermediate results, rounding errors inevitably
network, which is the main building block of convolutional arise, accumulate from layer to layer, and can lead to
architectures. ”inaccurate” predictions. We consider “inaccurate” predictions

1608
to happen when the predicted value is compared with the there is downloaded through the controller to the small
prediction by the software implementation, rather then with the memory unit for the further use. In the hardware realization,
true image label. To validate the “accuracy” of predictions, we not all layers of neural network under test are used; some of
run all test images through both the floating-point software them are replaced by other functions. For example, there is no
implementation and fixed-point software implementation (or ZeroPadding layer, instead of it module of intermediate image
Verilog benchmark) and then compare predictions. Ratio of edge detection is applied, which allows to reduce chip memory
mismatches to the total number of tests is a measure of is usage. GlobalMaxPooling layer is replaced by the function
“inaccuracy” measure for the given width of weights and from the Convolution layer that immediately gets
intermediate results. We choose the bit width at which the GlobalMaxPooling layer result by finding the largest value in
number of errors is 0. the intermediate image. The rest of the layers are implemented
as separate modules. Since Convolution and Dense layers can
When using fixed-point calculations with convolution use convolutional blocks for calculations, both of them have
blocks, two different strategies are possible: access to these blocks. Modules contain ReLU activation
• rounding after each elementary operation of addition and function, which is used as needed. In the last layer, Softmax
multiplication; activation function is applied. It is implemented as traditional
• calculation with full accuracy and rounding at the very end of Maximum, because position of a neuron with the maximum
convolution operation. value is always the same for these functions. To implement the
Two experiments are carried out to determine the most neural network, the specialized Convolution block is used,
effective approach. To achieve zero difference from the which performs convolution of 3 × 3 in one clock cycle. This
floating-point model, the number storage requires 17 bits in block is a scalar product of vectors and contains 9
case of rounding at the beginning, and only 12 bits in case of multiplications and 8 additions. The same block is used for
rounding at the end. calculations in the fully connected Dense layer due to splitting
the entire set of additions and multiplications into blocks of 9
Rounding after each operation slightly increases neurons.
performance, and significantly increases memory overhead.
Therefore, it is advantageous to perform rounding after
convolution block.

F. FPGA-based hardware implementation

In FPGA-based realization, SDRAM is used to store a

video frame from camera. In SDRAM memory on De0-
Nanocard used in this study, two equal areas for two frames are Fig. 5. Shift register operation: blue indicates data for previous
allocated — current frame is recorded in the first area, and convolution operation obtained at the previous step
previous frame is read from the other memory area. And after
the output is finished, these areas change their roles. When
using SDRAM memory in this study, we consider two
important issues. First, memory operates at high frequency of
143MHz, thus, we face one more problem of transferring of
data from the clock domain of camera to the clock domain of
SDRAM. Second, in order to achieve maximum speed, writing Fig. 6. Storage of all layer weights as single block
to SDRAM should be performed by whole transactions, or in
“burst”. FIFO directly built in FPGA memory is the best way TABLE I. INFORMATION ON RESOURCES USED ON FPGA AFTER PLACE
to solve both of these problems. Basic idea is that camera fills & ROUTE STAGE
FIFO at low frequency, then SDRAM controller reads data at Logiccells 9-bit elements Internal PLLs
high frequency, and immediately writes them to memory in one (available: (132) memory (4)
transaction. Data output to TFT screen is organized in the same 22320) (available:
way. Data from SDRAM are written to screen FIFO, and then 608256 bit)
are read at the frequency of 10MHz. After FIFO has been Input image 964 (4%) 0 (0%) 0 (0%) 0 (0%)
converter
cleared, the operation is repeated. Neural network 4014 (18%) 23 (17%) 285428 (47%) 0 (0%)
A picture from the camera, after passing through SDRAM, Weights database 0 (0%) 0 (0%) 70980 (12%) 0 (0%)
is displayed on the screen as is, and also is fed to neural
network for its recognition through block that converts image Storage of
intermediate 1 ( < 1%) 0 (0%) 214448 (35%) 0 (0%)
to grayscale and decreases resolution. When neural network calculation results
operation is finished, the result is also output directly to the Total usage 5947 (27%) 23 (17%) 371444 (61%) 2(50%)
screen.
After conversion, input image is stored in the database, G. Additional optimization of calculations
which also stores weight coefficients for each layer that were
calculated and wired-in beforehand. As necessary, data from

1609
To increase the performance, a number of techniques are 7
applied that made it possible to reduce the number of cycles
required for one image classification. III. PERFORMANCE RESULTS
1) Increasing of convolution blocks number: If there is The proposed design is successfully implemented in FPGA.
enough free space in FPGA, we can improve the performance Details on logic cells number and memory usage are given in
by increasing the number of convolution blocks, thereby Table I. These numerical results demonstrate low hardware
multiplying productivity. Consider the second convolutional requirements of the proposed model architecture. Moreover,
block in the proposed neural network LWDD. There are 4 of for this implementation, the depth of the neural network can be
28× 28 images at the layer input, and 16 blocks of weights are further increased without exhausting the resources of this
given. To calculate the set of outputs for this layer, also specific hardware.
consisting of 4 images, we have to perform four multiplications In this implementation, input images are processed in real
of the same set of pixels by different sets of weights. If there is time, and the original image is displayed along with the result.
only one convolutional block, this takes at least 4 cycles, but if Classification of one image requires about 230 thousand clock
there are 4 such blocks, then only one clock cycle is needed, cycles and we achieve overall processing speed with the large
thus Convolution layer calculation speeds up 4 times. margin over 150 frames/sec.
2) Shift register: To perform an elementary convolution If performance is insufficient and spare logic cells are
operation, we have to get values of 9 neighboring pixels from available, we can speed up calculations by adding
an input image, then next 9 pixels, 6 of which have already convolutional blocks that perform computations in parallel.
been received in the previous step (see Fig. 5). To shorten the Table II shows number of clock cycles required to process one
time for the necessary data call up, shift register is developed to frame using different number of convolutional blocks. Table III
keep new data at their input and at the same time to ”push out” shows total resources required for entire FPGA based project
old data. Thus, each step requires only 3 new values instead of implementation for different weight dimensions and different
9. number of convolution blocks. Missing values denotes the
cases that Quartus could not synthesize due to the lack of
TABLE II. CLOCK CYCLES PER CONVOLUTION BLOCK
FPGA resources. Source code for both software and hardware
Clock cycles per 1 Processing implementations as well as video demonstrating the real-time
frame speed-up digit classification from the mobile camera video feed is
1st convolution 236746 -
block
available on GitHub [11].
2nd convolution 125320 1,89
block IV. CONCLUSIONS
4th convolution 67861 3,49 In this work we propose a design and an implementation of
block
FPGA-based CNN with fixed-point calculations that allows to
achieve the exact performance of the corresponding software
3) Storing of all data for one Convolution operation at the implementation on the live handwritten digit recognition
same address: When we call up data that are necessary for problem. Due to the reduced number of parameters we avoid
calculations, one clock cycle is used for each value. Therefore, common issues with memory bandwidth. Suggested method
in order to reduce time spent on downloading required data, as can be implemented on a very basic set of FPGAs, but also is
well as convenience of access, prior to putting into internal scalable for the use on FPGAs with large number of logical
memory of FPGA, data are stacked in blocks of 9 pieces, after cells. Additionally, we demonstrate how existing open datasets
which they are accessible at one address. With such memory can be modified in order to better adapt them for real-life
arrangement, we can perform the extraction of weights in one applications. Finally, in order to promote the reproducibility of
clock cycle and, thus, speed up calculations for convolutional results, facilitate open-scientific development, and enable
and fully connected layers. Example is shown in Fig. 6. collaborative validation, source code, documentation, and all
results from this study are made available online. There are
TABLE III. TOTAL RESOURCE USAGE FOR THE PROJECT many possible ways to improve performance of hardware
implementations of neural networks. While we explored and
Weight Convol Logical Memor Embed Critic Ma
dimensio utional cells y ded al x. implemented some of them in this work, only relatively
ns blocks M9 path FPS shallow neural networks were considered, without additional
elemen delay architectural features, such as skip connections. Implementing
ts even deeper networks with multiple dozens of layers is
1 3750 232111 25 21,84 193 problematic, since all layer weights would not fit into the
11 bit 2 4710 309727 41 22,628 352
FPGA memory and will require the use of the external RAM,
4 6711 464959 77 23,548 625
1 3876 253212 25 24,181 174
which can lead to the decrease in performance. Moreover, due
12 bit 2 4905 337884 41 24,348 327 to the large number of layers, error accumulation will increase
4 10064 589148 77 - - and will require wider bit range to store fixed-point weight
1 3994 274313 25 22,999 183 values. In the future, we plan FPGA-based implementation of
13 bit 2 5124 366041 41 25,044 318 specialized lightweight neural network architectures that are
4 8437 54949 77 - - currently successfully used on mobile devices. This will allow

1610
to use the same hardware implementation for different tasks by Arrays, ser.FPGA ’16. New York, NY, USA: ACM, 2016, pp. 26–35.
fine-tuning the architecture using pre-trained weights. [Online]. Available: http://doi.acm.org/10.1145/2847263.2847265
[6] C. Zhang et al., “Optimizing FPGA-based accelerator design for deep
ACKNOWLEDGMENT convolutional neural networks,” in Proceedings of the 2015
ACM/SIGDA International Symposium on Field-Programmable Gate
Research has been conducted with the financial support Arrays, ser. FPGA ’15. New York, NY, USA: ACM, 2015, pp. 161–170.
from the Russian Science Foundation (grant 17-19-01645). [Online]. Available: http://doi.acm.org/10.1145/2684746.2689060
[7] N. Suda et al., “Throughput-optimized OpenCL-based FPGA accelerator
REFERENCES for large-scale convolutional neural networks,” in Proceedings of the
2016ACM/SIGDA International Symposium on Field-Programmable
[1] Huang, Gao, et al. "Densely connected convolutional networks." CVPR. GateArrays, ser. FPGA ’16. New York, NY, USA: ACM, 2016, pp. 16–
Vol. 1. No. 2. 2017. 25. [Online]. Available: http://doi.acm.org/10.1145/2847263.2847276
[2] Chen, Liang-Chieh, et al. "Deeplab: Semantic image segmentation with [8] J. Duarte et al., “Fast inference of deep neural networks in fpgas for
deep convolutional nets, atrous convolution, and fully connected crfs." particle physics,” arXiv preprint arXiv:1804.06913, 2018.
IEEE transactions on pattern analysis and machine intelligence 40.4 [9] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
(2018): 834-848. applied to document recognition,” Proceedings of the IEEE, vol. 86,no.
[3] A. Shvets, A. Rakhlin, A. A. Kalinin, and V. Iglovikov, “Automatic 11, pp. 2278–2324, Nov 1998.
instrument segmentation in robot-assisted surgery using deep learning,” [10] R. A. Solovyev, A. G. Kustov, V. S. Ruhlov, A. N. Schelokov, and D. V.
arXiv preprint arXiv:1803.01207, 2018. Puzyrkov, “Hardware implementation of a CNN in FPGA based on
[4] Sandler M. et al. “Inverted residuals and linear bottlenecks: Mobile fixed point calculations,” Izvestiya SFedU. Engineering Sciences, July
networks for classification, detection and segmentation” arXiv preprint 2017, in Russian.
arXiv:1801.04381, 2018. [11] “Verilog generator of neural net digit detector for FPGA,” GitHub.
[5] J. Qiu et al., “Going deeper with embedded FPGA platform for [Online]. Available: https://github.com/ZFTurbo/Verilog-Generator-of-
convolutional neural network,” in Proceedings of the 2016 Neural-Net-Digit-Detector-for-FPGA
ACM/SIGDA International Symposium on Field-Programmable Gate

1611

FortiOS-7 2 1-Administration - Guide
100% (1)
FortiOS-7 2 1-Administration - Guide
716 pages
GCP Questions
No ratings yet
GCP Questions
71 pages
Where Can Buy Introduction To IoT 1st Edition Misra Mukherjee Roy Ebook With Cheap Price
100% (8)
Where Can Buy Introduction To IoT 1st Edition Misra Mukherjee Roy Ebook With Cheap Price
62 pages
Credit Card Ppt'
No ratings yet
Credit Card Ppt'
12 pages
Car Riding Opengl Project
No ratings yet
Car Riding Opengl Project
30 pages
OOP Project Report: On CGPA Calculator
No ratings yet
OOP Project Report: On CGPA Calculator
17 pages
Windows Event Log Cheat Sheet
No ratings yet
Windows Event Log Cheat Sheet
6 pages
Design and implementation of deep neural network hardware chip and its performance analysis
No ratings yet
Design and implementation of deep neural network hardware chip and its performance analysis
10 pages
Sit1503 Unit2
No ratings yet
Sit1503 Unit2
63 pages
A Survey of FPGA Based Accelerators For
No ratings yet
A Survey of FPGA Based Accelerators For
32 pages
Implementation of Deep Neural Networks Learning On Unmanned Aerial Vehicle Based Remote-Sensing
No ratings yet
Implementation of Deep Neural Networks Learning On Unmanned Aerial Vehicle Based Remote-Sensing
7 pages
DATASTAGE TS012708292 - Historic Job in Running State - IBM Support
No ratings yet
DATASTAGE TS012708292 - Historic Job in Running State - IBM Support
5 pages
LINUX Unit 1
No ratings yet
LINUX Unit 1
50 pages
A CNN Accelerator on FPGA Using Depthwise
No ratings yet
A CNN Accelerator on FPGA Using Depthwise
5 pages
Chapter 6 - Introducing - Classes
No ratings yet
Chapter 6 - Introducing - Classes
32 pages
Fpga Model To Implement Handwritten Digit Recognition
No ratings yet
Fpga Model To Implement Handwritten Digit Recognition
48 pages
NeuralNetworkforReal-TimeObjectDetectiononFPGA
No ratings yet
NeuralNetworkforReal-TimeObjectDetectiononFPGA
6 pages
Rational For Data Structure Lab
No ratings yet
Rational For Data Structure Lab
51 pages
A CNN Accelerator On FPGA With A Flexible Structure
No ratings yet
A CNN Accelerator On FPGA With A Flexible Structure
6 pages
Cafpga: An Automatic Generation Model For CNN Accelerator
No ratings yet
Cafpga: An Automatic Generation Model For CNN Accelerator
30 pages
Chap 1 Abhiram Ranade
0% (1)
Chap 1 Abhiram Ranade
40 pages
Finn RTL
No ratings yet
Finn RTL
22 pages
541 - Literature Review
No ratings yet
541 - Literature Review
19 pages
Electronics 08 00065
No ratings yet
Electronics 08 00065
19 pages
Research On Opencl Optimization For Fpga Deep Learning Application
No ratings yet
Research On Opencl Optimization For Fpga Deep Learning Application
19 pages
Mini Project Dbms
No ratings yet
Mini Project Dbms
17 pages
Electronics 10 02859 v2
No ratings yet
Electronics 10 02859 v2
16 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
Fpga 11893295 - 122
No ratings yet
Fpga 11893295 - 122
2 pages
fin_irjmets1684902949
No ratings yet
fin_irjmets1684902949
6 pages
High Throughput and Low Bandwidth Demand Accelerating CNN Inference Block-By-block on FPGAs
No ratings yet
High Throughput and Low Bandwidth Demand Accelerating CNN Inference Block-By-block on FPGAs
9 pages
Relnote
No ratings yet
Relnote
12 pages
Efficient Hardware Architectures For Deep Convolutional Neural Network
No ratings yet
Efficient Hardware Architectures For Deep Convolutional Neural Network
13 pages
Sensors 19 00350
No ratings yet
Sensors 19 00350
14 pages
An Implementation of Convolutional Neural Networks
No ratings yet
An Implementation of Convolutional Neural Networks
23 pages
Fast Generation of Custom Floating-Point Spatial Filters On Fpgas
No ratings yet
Fast Generation of Custom Floating-Point Spatial Filters On Fpgas
12 pages
Compiler Design
No ratings yet
Compiler Design
23 pages
C Language Programming
100% (1)
C Language Programming
41 pages
2022 Review of FPGA-Based Accelerators of Deep Convolutional Neural Networks
No ratings yet
2022 Review of FPGA-Based Accelerators of Deep Convolutional Neural Networks
7 pages
applsci-15-00688-v3
No ratings yet
applsci-15-00688-v3
21 pages
FPGA Convolution Network Acceleration
No ratings yet
FPGA Convolution Network Acceleration
9 pages
Article Report Final
No ratings yet
Article Report Final
9 pages
Day 1 and Day 2: CISES2023: Paper Presentation Schedule
No ratings yet
Day 1 and Day 2: CISES2023: Paper Presentation Schedule
11 pages
Case Statement
No ratings yet
Case Statement
10 pages
A Reconfigurable CNN-based Accelerator Design For
No ratings yet
A Reconfigurable CNN-based Accelerator Design For
9 pages
286-1006-1-PB (3)
No ratings yet
286-1006-1-PB (3)
8 pages
SCHWING Telematics Brochure (EN)
No ratings yet
SCHWING Telematics Brochure (EN)
8 pages
Emerging Field of NoC-Based Convolution Architecture With Systolic Arrays
No ratings yet
Emerging Field of NoC-Based Convolution Architecture With Systolic Arrays
6 pages
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
No ratings yet
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
6 pages
Power Efficient Design of High-Performance Convolu
No ratings yet
Power Efficient Design of High-Performance Convolu
14 pages
10.1109VDAT50263.2020.9190274
No ratings yet
10.1109VDAT50263.2020.9190274
6 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Design of a Lightweight Convolutional Neural Network Accelerated by FPGA
No ratings yet
Design of a Lightweight Convolutional Neural Network Accelerated by FPGA
4 pages
CNN hw1
No ratings yet
CNN hw1
13 pages
A High-Performance Hardware Accelerator For Sparse Convolutional Neural Network On FPGA
No ratings yet
A High-Performance Hardware Accelerator For Sparse Convolutional Neural Network On FPGA
7 pages
FPGA Based Implementation of Neural Network
No ratings yet
FPGA Based Implementation of Neural Network
5 pages
Reconfigurable Hardware Design Approach For Economic Neural Network
No ratings yet
Reconfigurable Hardware Design Approach For Economic Neural Network
5 pages
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
No ratings yet
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
5 pages
Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F
No ratings yet
Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F
5 pages
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
No ratings yet
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
18 pages
AI-Hardware
No ratings yet
AI-Hardware
4 pages
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
No ratings yet
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
10 pages
Research On FPGA Based Convolutional Neural Network Acceleration Method
No ratings yet
Research On FPGA Based Convolutional Neural Network Acceleration Method
4 pages
Accelerating VGG16 DCNN With An FPGA: Dongjoon Park, Pranoti Dhamal
No ratings yet
Accelerating VGG16 DCNN With An FPGA: Dongjoon Park, Pranoti Dhamal
7 pages
Power Off Reset Reason Backup
No ratings yet
Power Off Reset Reason Backup
4 pages
Systematic Analysis of FPGA-based Hardware Acceler
No ratings yet
Systematic Analysis of FPGA-based Hardware Acceler
9 pages
The Basics of Binary Language
No ratings yet
The Basics of Binary Language
2 pages
JTECCNN
No ratings yet
JTECCNN
6 pages
FAQ-for Portal
No ratings yet
FAQ-for Portal
5 pages
Fully Convolutional
No ratings yet
Fully Convolutional
4 pages
rongshi2019
No ratings yet
rongshi2019
4 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
10.1109@MWSCAS48704.2020.9184436
No ratings yet
10.1109@MWSCAS48704.2020.9184436
4 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
No ratings yet
A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
8 pages
Arduino PID Control Tutorial - Make Your Project Smarter
No ratings yet
Arduino PID Control Tutorial - Make Your Project Smarter
7 pages
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
No ratings yet
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
5 pages
Can't Disable Press-And-Hold-Right-Click Functionality in Windows - Microsoft Community
No ratings yet
Can't Disable Press-And-Hold-Right-Click Functionality in Windows - Microsoft Community
1 page
FPGA Implementation of Convolutional Neural Networ PDF
No ratings yet
FPGA Implementation of Convolutional Neural Networ PDF
10 pages
CNP: An FPGA-based Processor For Convolutional Networks
No ratings yet
CNP: An FPGA-based Processor For Convolutional Networks
2 pages
17428-2019-Winter-Model-Answer-Paper (Msbte Study Resources)
No ratings yet
17428-2019-Winter-Model-Answer-Paper (Msbte Study Resources)
31 pages
Btech Cs 5 Sem Application of Soft Computing kcs056 2022
No ratings yet
Btech Cs 5 Sem Application of Soft Computing kcs056 2022
1 page
An FPGA-Based Reconfigurable CNN Accelerator For YOLO
No ratings yet
An FPGA-Based Reconfigurable CNN Accelerator For YOLO
5 pages
A General Neural Network Hardware Architecture On FPGA
No ratings yet
A General Neural Network Hardware Architecture On FPGA
6 pages
75 Inch Smart Interactive Touch Panel Andriod
No ratings yet
75 Inch Smart Interactive Touch Panel Andriod
3 pages
CS Debonair
No ratings yet
CS Debonair
1 page
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Computer Fundamentals
100% (3)
Computer Fundamentals
17 pages
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet

Fixed-Point CNN For FPGA

Uploaded by

Fixed-Point CNN For FPGA

Uploaded by

Fixed-Point Convolutional Neural Network for Real-

Time Video Processing in FPGA

1605 978-1-7281-0339-6/19/$31.00 ©2019 IEEE

Fig. 1. DE0-Nano development board and external devices

D. CNN architecture design

Since weights w i, j and bias b are known, it is possible to

F. FPGA-based hardware implementation

In FPGA-based realization, SDRAM is used to store a

You might also like