[go: up one dir, main page]

0% found this document useful (0 votes)
47 views7 pages

Fixed-Point CNN For FPGA

CNN accelerator on FPGA

Uploaded by

manh.tranduc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views7 pages

Fixed-Point CNN For FPGA

CNN accelerator on FPGA

Uploaded by

manh.tranduc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Fixed-Point Convolutional Neural Network for Real-

Time Video Processing in FPGA


Roman Solovyev, Alexander Kustov, Dmitry Alexandr Kalinin,
Telpukhov, Vladimir Rukhlov Department of Computational Medicine and Bioinformatics,
Institute for Design Problems in Microelectronics of University of Michigan,
Russian Academy of Sciences (IPPM RAS) Ann Arbor, MI, USA
Moscow, Russia
zf-turbo@yandex.ru

Abstract — Modern mobile neural networks with a reduced blocks. In the case of field programmable gate arrays (FPGAs),
number of weights and parameters do a good job with image this allows to program the network on different types of
classification tasks, but even they may be too complex to be FPGAs, providing different processing speed. For example,
implemented in an FPGA for video processing tasks. The article implementation of higher number of convolutional blocks on
proposes neural network architecture for the practical task of an FPGA can directly lead to a speed-up in processing.
recognizing images from a camera, which has several advantages in
terms of speed. This is achieved by reducing the number of Related direction in neural network research considers
weights, moving from a floating-point to a fixed-point arithmetic, adapting NNs for the use on mobile devices [4]. Mobile
and due to a number of hardware-level optimizations associated networks typically have reduced number of weights and require
with storing weights in blocks, a shift register, and an adjustable relatively small number of arithmetic operations. However,
number of convolutional blocks that work in parallel. The article they are still executed at the software level and use floating-
also proposed methods for adapting the existing data set for solving point calculations. For some tasks such as real-time video
a different task. As the experiments showed, the proposed neural analysis that requires processing of 30 frames per second
network copes well with real-time video processing even on the mobile networks still can be not fast enough without further
cheap FPGAs. optimization. In order to use an already trained neural network
in a mobile device, a set of optimizations can be used to speed
Keywords—Neural network hardware; Field programmable gate
up computation. There exist a number of approaches to do so,
arrays; Fixed-point arithmetic; 2D convolution
including weight compression or computation using low-bit
I. INTRODUCTION data representations. Since hardware requirements for neural
networks keep increasing, there is a need for design and
Recent research in artificial neural networks has demon- development of specialized hardware block for the use in ASIC
strated their ability to perform well on a wide range of tasks and FPGA. The speed up can be achieved by following:
[1], [2]. Most of the modern neural network architectures for
computer vision include convolutional layers and thus are • hardware implementation of the convolution
called convolutional neural networks (CNNs). They have high operation, which is faster than software convolution;
computational requirements. However, there is a compelling • using fixed-point arithmetic instead of floating-point
need for the use of deep convolutional neural networks on calculations;
mobile devices and in embedded systems. This is particularly
important for video processing in, for example, autonomous • reducing the network size while preserving the
cars and medical devices [3]. performance;
Following properties of many modern high-performing • modifying structure of a network architecture while
CNN architectures make their hardware implementation preserving the same level of performance and
feasible: decreasing footprint of the hardware implementation
and saved weights.
• high regularity: all commonly used layers have similar
structure (Conv3x3, Conv1x1, MaxPooling, For example, Zhang C. et al. [6] quantitatively analyzed
FullyConnected, GlobalAvgPooling); computing throughput and required memory bandwidth for
various CNNs using optimization techniques, such as loop
• typically small size of convolutional filters: 3 × 3;
tiling and transformation. This allowed their implementation to
• ReLU activation function (comparison of the value achieve peak performance of 61.62 GFLOPS. Qiu J. et al. [5]
with zero): easier to compute compared to previously proposed an FPGA implementation of pre-trained deep neural
used Sigmoid and Tanh functions. networks from VGG family. They used dynamic-precision
Due to high regularity, size of the network can be easily quantization with 48-bit data representation and singular vector
varied, for example, by changing the number of convolutional decomposition to reduce the size of fully-connected layers,

1605 978-1-7281-0339-6/19/$31.00 ©2019 IEEE


which led to smaller number of weights that had to be passed • Intel (Altera) FPGA is installed on this board, which is
from the device to the external memory. Higher level solution mass-produced and cheap;
is proposed in [7], which considers the use of the OpenGL • Cyclone IV FPGA has rather low performance and small
compiler for deep networks, such as AlexNet and VGG. Duarte number of logic cells, assuming increased performance
et al. [8] have recently suggested the protocol for automatic if re-implemented with most of other modern FPGAs;
conversion of neural network implementations in high-level
programming language to intermediate format (HLS) and then • it makes connecting peripherals, such as camera and
into FPGA implementation. However, their work is mostly touchscreen, easier;
focused on the implementation of fully-connected layers. In • the board itself has 32 MB of RAM, which can be used to
this work we propose a design and implementation of FPGA- store weights of a neural network. The general scheme
based CNN with fixed-point calculations that allows achieving of the board and external devices is shown in Fig. 1.
the exact performance of the corresponding software
OV7670 camera module is chosen for image acquisition
implementation on the live hand written digit recognition
due to the high quality/price ratio. In this application, high
problem. Due to the reduced number of parameters we avoid
resolution video is not required, since every image is reduced
common issues with memory bandwidth. Suggested method to the size of 28 × 28 pixels and converted to grayscale. The
can be implemented on very basic FPGAs, but also is scalable camera module also has a simple connection mechanism.
for the use on FPGAs with large number of logical cells.
Additionally, we demonstrate how existing open datasets can Display module with 320 × 240 resolution TFT screen is
be modified in order to better adapt them for real-life chosen as the output device. Display module driven by
applications. Finally, in order to promote the reproducibility microcontroller is equipped by 2.4 inches color touchscreen
of results, facilitate open-scientific development, and enable (18-bitcolor, 262,144 color variations). It also has backlight
collaborative validation we make our source code, and is convenient to use due to the large viewing angle.
documentation, and all results from this study available Contrast and dynamic properties of H24TM84A LCD indicator
allow displaying video. LCD controller contains RAM buffer
online.
that lowers the requirements for the device microcontroller.
Final scheme for connecting camera and screen modules to
De0-Nano board is shown in Fig. 2.

Fig. 1. DE0-Nano development board and external devices

II. METHODS

A. Implementation requirements
Fig. 2. Scheme for connecting camera and screen modules to De0-Nano
To demonstrate our approach, we implement a solution for board. Pins with the same name are connected. Pins marked as ’x’ remain
the problem of recognizing handwritten digits received from a unconnected
camera in real time. The results are displayed on an electronic
LED screen. The minimal speed of digit recognition should C. Dataset preparation
exceed 30 FPS, that is, neural network should be able to
process a single image in 33ms. The resulting hardware The MNIST dataset for handwritten digit recognition [9] is
implementation should be ready for transfer to separate custom widely used in the computer vision community. However, it is
VLSI device for mass production. not well suited for training a neural network in our application,
since it differs greatly from the camera images (Fig. 3).
B. Hardware specifications Major differences include:

We use the compact development board DE0-Nano due to • MNIST images are light digits over dark background,
the following reasons: opposite to those from the camera feed;
•camera produces color images, while MNIST is grayscale;

1606
•size of MNIST image is 28 × 28 pixels, while camera In order to use MNIST images for training a neural
image size is 320 × 240 pixels; network, on-the-fly data augmentation is used. This method
implies that during the creation of the next mini-batch for
• unlike centrally placed digits and homogenious training, a set of different filters is arbitrarily applied to each
background in MNIST images, digits can be shifted and image. This technique is used to easily increase the dataset
slightly rotated in camera images, sometimes with noise size, as well as to bring images to the required form, as in our
in the background; case.
• MNIST does not have a separate class of images without The following filter set was used for augmenting MNIST
digits. images:
• color inversion;
• random 10 degrees rotation in both directions;
• random expansion or reduction of an image by 4-pixels;
• random variation of image intensity (from 0 to 80);
• adding random noise from 0% to 10%.
Optionally, images from camera can be mixed into mini-
batches.

D. CNN architecture design


Fig. 3. Different appearance of (A) an image from the MNIST dataset; and
(B) an image from the camera feed
Despite recent developments in CNN architectures, the
Given that the recognition performance on the MNIST essence remains the same: the input size decreases from layer
to layer and the number of filters increases. At the end of a
dataset is very high, we reduce the size of image from camera
network, a set of characteristics is formed that are fed to the
to 28 × 28 pixels and convert them into grayscale. This helps
classification layer (or layers), and the output neurons indicate
us to address following problems: the likelihood that the image belongs to a particular class.
• there is no significant loss in accuracy, as even in small
images digits are still easily recognized by humans; The following set of rules for constructing a neural network
architecture is proposed to minimize the total number of stored
• color information is excessive for digit recognition; weights (which is critical for mobile systems) and facilitate the
• noisy images from camera can be cleaned by reducing transfer to fixed-point calculations:
and averaging neighboring pixels. • minimize number of fully connected layers, which
Since image transformation is also performed at hardware consume major part of memory for storing weights;
level, it is necessary to consider in advance a minimum set of • reduce number of filters of each convolution layer as
arithmetic functions that can effectively bring image to the much as possible without degrading the classification
desired form. The suggested algorithm for modifying camera performance;
images goes as following:
• stop using bias, which is important when shifting from
1) We crop a central part measuring 224 × 224 pixels from floating-point to fixed-point, because adding a constant
a 320 × 240 image, which subsequently allows easy transition hinders monitoring range of values, and rounding bias
to the desired image size, since 224 = 28 × 8. error over each layer tends to accumulate;
2) Then, a cropped image part is converted to a grayscale
• use simple type activation, such as RELU, since other
image. Because of the peculiarities of human visual perception,
activations, such as Sigmoid and Tahn, contain division,
we take weighted, rather than simple, average. To facilitate the
exponentiation, and other functions that are harder to
conversion at the hardware level, the following formula is
implement in hardware;
used:
• minimize number of heterogeneous layers, so that one
B W = (8 × G + 5 × R + 3 × B)/16 (1)
hardware unit can perform calculations at a large
Multiplication by 8 and division by 16 are implemented number of flow stages.
using shifts.
Before translating the neural network onto hardware, we
3) Finally, 224 × 224 image is split into 8 × 8 blocks. We train it on a prepared dataset and save the software
calculate average value for each of these blocks, forming a implementation for testing. We create software implementation
corresponding pixel in 28 × 28 image. using Keras with Tensorflow backend.
Resulting algorithm is simple and works very fast at the In our previous work, we have proposed a VGG Simple
hardware level. neural network [10], which is a lightweight modification of the
popular VGG architecture [9]. Despite the high performance,

1607
the major disadvantage of this model is the number of weights, At the layer input is a two-dimensional matrix (original
size of which exceeds FPGA capacity. Besides, the exchange picture) 28 × 28 with values from [0; 1). It is also known that if
with external memory imposes additional time costs. a ∈ [−1, 1] and b ∈ [−1, 1], then a · b ∈ [−1, 1].
Moreover, this model involves a ”bias” term, which also has to
be stored, requires additional processing blocks, and tends to For 3 × 3 convolution, the value of the certain pixel (i, j) in
accumulate error if implemented in fixed-point representation. the second layer can be calculated as follows:
Therefore, we propose a further modification of this
architecture that we call Low Weights Digit Detector (LWDD). (2)

Since weights w i, j and bias b are known, it is possible to


calculate potential minimum mn and maximum mx of the
second layer. Let M = max (|mn|, |mx|). If we divide wi,jand b
by the value of M, we can guarantee that for any configuration
of input data, the value on the second layer does not exceed 1.
We call M a reduction coefficient of the layer. For the second
layer, we use the same principle, namely, the value at layer
input belongs to interval [−1;1], so we can repeat our
reasoning. For the proposed neural network after all weight
reductions to the last layer, the position of the maximum of the
last neuron is not changed, that is, the network works
equivalently to the neural network without reductions from the
point of view of floating-point calculations.
After performing this reduction on each layer, we can move
from floating-point calculations to fixed-point calculations,
since we know exactly the range of values at each stage of
computation. We use the following notation to represent the
numbers of bits: x b= [x · 2 N].
If z = x + y, then addition can be expressed as: z‘=x b+ y
Fig. 4. Low Weight Digit Detector (LWDD) neural network architecture b= [x · 2 N] +[y · 2 N] = [(x + y) · 2 N] = [z · 2 N] = zb.
Multiplication can be expressed as: z’= x b+ y b= [x · 2 N] ·[y
First, we remove large fully connected layers and bias · 2 N] = [(x · y) ·2 N · 2 N] = [z · 2 N · 2 N] = [zb · 2 N], that
terms. Then, GlobalMaxPooling layer is added to neural is, we have to divide multiplication result by 2 N to get the real
network, instead of GlobalAvgPooling, which is traditionally value, or just shift it by N positions.
used, for example, in ResNet50. The efficiency of these layers
is approximately the same, while the hardware complexity of If we sort through all possible input images and focus on
finding a maximum is much simpler than mean value the potential minimum and maximum values, we can get very
calculation from the computational point of view. These large reduction coefficients, such that the accuracy will rapidly
introduced changes do not lead to the decrease in network decrease from layer to layer. This can require a large width of
performance. New architecture is shown in Fig. 4. Changes in fixed-point representation of weights and intermediate
neural network structure allow to reduce number of weights computational results. To avoid this, we can use all (or a part)
from 25,000 to approximately 4,500, and to store all weights in of the training set to find most likely maximum and minimum
the internal memory of the FPGA. On the modified MNIST values in each layer. As our experiments are show, usage of the
dataset with image augmentations, LWDD neural network training set makes it possible to decrease reduction coefficients.
achieves 96% accuracy. At that, we should scale up coefficients by a small margin,
either focusing on the value of 3σ or increasing the maximum
E. Fixed-point calculation implementation by several per cent.
However, under certain conditions, overflow and violation
In neural networks, calculations are traditionally performed of the calculated range are possible. To address this issue, a
with floating point either on GPU (fast) or CPU (slow), for hardware implementation requires a detector of such cases and
example, using float32 type. When implemented at the the mechanism for replacement of overflowed values with the
hardware level, floating-point calculations are slower than maximum for given layer. This can be achieved by minor
fixed-point due to the difficulty of controlling the mantissa and modifications of a convolutional unit.
the exponent for various operations.
For fixed-point calculations with the limited width of
Let’s consider the first convolutional layer of a neural weights and intermediate results, rounding errors inevitably
network, which is the main building block of convolutional arise, accumulate from layer to layer, and can lead to
architectures. ”inaccurate” predictions. We consider “inaccurate” predictions

1608
to happen when the predicted value is compared with the there is downloaded through the controller to the small
prediction by the software implementation, rather then with the memory unit for the further use. In the hardware realization,
true image label. To validate the “accuracy” of predictions, we not all layers of neural network under test are used; some of
run all test images through both the floating-point software them are replaced by other functions. For example, there is no
implementation and fixed-point software implementation (or ZeroPadding layer, instead of it module of intermediate image
Verilog benchmark) and then compare predictions. Ratio of edge detection is applied, which allows to reduce chip memory
mismatches to the total number of tests is a measure of is usage. GlobalMaxPooling layer is replaced by the function
“inaccuracy” measure for the given width of weights and from the Convolution layer that immediately gets
intermediate results. We choose the bit width at which the GlobalMaxPooling layer result by finding the largest value in
number of errors is 0. the intermediate image. The rest of the layers are implemented
as separate modules. Since Convolution and Dense layers can
When using fixed-point calculations with convolution use convolutional blocks for calculations, both of them have
blocks, two different strategies are possible: access to these blocks. Modules contain ReLU activation
• rounding after each elementary operation of addition and function, which is used as needed. In the last layer, Softmax
multiplication; activation function is applied. It is implemented as traditional
• calculation with full accuracy and rounding at the very end of Maximum, because position of a neuron with the maximum
convolution operation. value is always the same for these functions. To implement the
Two experiments are carried out to determine the most neural network, the specialized Convolution block is used,
effective approach. To achieve zero difference from the which performs convolution of 3 × 3 in one clock cycle. This
floating-point model, the number storage requires 17 bits in block is a scalar product of vectors and contains 9
case of rounding at the beginning, and only 12 bits in case of multiplications and 8 additions. The same block is used for
rounding at the end. calculations in the fully connected Dense layer due to splitting
the entire set of additions and multiplications into blocks of 9
Rounding after each operation slightly increases neurons.
performance, and significantly increases memory overhead.
Therefore, it is advantageous to perform rounding after
convolution block.

F. FPGA-based hardware implementation

In FPGA-based realization, SDRAM is used to store a


video frame from camera. In SDRAM memory on De0-
Nanocard used in this study, two equal areas for two frames are Fig. 5. Shift register operation: blue indicates data for previous
allocated — current frame is recorded in the first area, and convolution operation obtained at the previous step
previous frame is read from the other memory area. And after
the output is finished, these areas change their roles. When
using SDRAM memory in this study, we consider two
important issues. First, memory operates at high frequency of
143MHz, thus, we face one more problem of transferring of
data from the clock domain of camera to the clock domain of
SDRAM. Second, in order to achieve maximum speed, writing Fig. 6. Storage of all layer weights as single block
to SDRAM should be performed by whole transactions, or in
“burst”. FIFO directly built in FPGA memory is the best way TABLE I. INFORMATION ON RESOURCES USED ON FPGA AFTER PLACE
to solve both of these problems. Basic idea is that camera fills & ROUTE STAGE
FIFO at low frequency, then SDRAM controller reads data at Logiccells 9-bit elements Internal PLLs
high frequency, and immediately writes them to memory in one (available: (132) memory (4)
transaction. Data output to TFT screen is organized in the same 22320) (available:
way. Data from SDRAM are written to screen FIFO, and then 608256 bit)
are read at the frequency of 10MHz. After FIFO has been Input image 964 (4%) 0 (0%) 0 (0%) 0 (0%)
converter
cleared, the operation is repeated. Neural network 4014 (18%) 23 (17%) 285428 (47%) 0 (0%)
A picture from the camera, after passing through SDRAM, Weights database 0 (0%) 0 (0%) 70980 (12%) 0 (0%)
is displayed on the screen as is, and also is fed to neural
network for its recognition through block that converts image Storage of
intermediate 1 ( < 1%) 0 (0%) 214448 (35%) 0 (0%)
to grayscale and decreases resolution. When neural network calculation results
operation is finished, the result is also output directly to the Total usage 5947 (27%) 23 (17%) 371444 (61%) 2(50%)
screen.
After conversion, input image is stored in the database, G. Additional optimization of calculations
which also stores weight coefficients for each layer that were
calculated and wired-in beforehand. As necessary, data from

1609
To increase the performance, a number of techniques are 7
applied that made it possible to reduce the number of cycles
required for one image classification. III. PERFORMANCE RESULTS
1) Increasing of convolution blocks number: If there is The proposed design is successfully implemented in FPGA.
enough free space in FPGA, we can improve the performance Details on logic cells number and memory usage are given in
by increasing the number of convolution blocks, thereby Table I. These numerical results demonstrate low hardware
multiplying productivity. Consider the second convolutional requirements of the proposed model architecture. Moreover,
block in the proposed neural network LWDD. There are 4 of for this implementation, the depth of the neural network can be
28× 28 images at the layer input, and 16 blocks of weights are further increased without exhausting the resources of this
given. To calculate the set of outputs for this layer, also specific hardware.
consisting of 4 images, we have to perform four multiplications In this implementation, input images are processed in real
of the same set of pixels by different sets of weights. If there is time, and the original image is displayed along with the result.
only one convolutional block, this takes at least 4 cycles, but if Classification of one image requires about 230 thousand clock
there are 4 such blocks, then only one clock cycle is needed, cycles and we achieve overall processing speed with the large
thus Convolution layer calculation speeds up 4 times. margin over 150 frames/sec.
2) Shift register: To perform an elementary convolution If performance is insufficient and spare logic cells are
operation, we have to get values of 9 neighboring pixels from available, we can speed up calculations by adding
an input image, then next 9 pixels, 6 of which have already convolutional blocks that perform computations in parallel.
been received in the previous step (see Fig. 5). To shorten the Table II shows number of clock cycles required to process one
time for the necessary data call up, shift register is developed to frame using different number of convolutional blocks. Table III
keep new data at their input and at the same time to ”push out” shows total resources required for entire FPGA based project
old data. Thus, each step requires only 3 new values instead of implementation for different weight dimensions and different
9. number of convolution blocks. Missing values denotes the
cases that Quartus could not synthesize due to the lack of
TABLE II. CLOCK CYCLES PER CONVOLUTION BLOCK
FPGA resources. Source code for both software and hardware
Clock cycles per 1 Processing implementations as well as video demonstrating the real-time
frame speed-up digit classification from the mobile camera video feed is
1st convolution 236746 -
block
available on GitHub [11].
2nd convolution 125320 1,89
block IV. CONCLUSIONS
4th convolution 67861 3,49 In this work we propose a design and an implementation of
block
FPGA-based CNN with fixed-point calculations that allows to
achieve the exact performance of the corresponding software
3) Storing of all data for one Convolution operation at the implementation on the live handwritten digit recognition
same address: When we call up data that are necessary for problem. Due to the reduced number of parameters we avoid
calculations, one clock cycle is used for each value. Therefore, common issues with memory bandwidth. Suggested method
in order to reduce time spent on downloading required data, as can be implemented on a very basic set of FPGAs, but also is
well as convenience of access, prior to putting into internal scalable for the use on FPGAs with large number of logical
memory of FPGA, data are stacked in blocks of 9 pieces, after cells. Additionally, we demonstrate how existing open datasets
which they are accessible at one address. With such memory can be modified in order to better adapt them for real-life
arrangement, we can perform the extraction of weights in one applications. Finally, in order to promote the reproducibility of
clock cycle and, thus, speed up calculations for convolutional results, facilitate open-scientific development, and enable
and fully connected layers. Example is shown in Fig. 6. collaborative validation, source code, documentation, and all
results from this study are made available online. There are
TABLE III. TOTAL RESOURCE USAGE FOR THE PROJECT many possible ways to improve performance of hardware
implementations of neural networks. While we explored and
Weight Convol Logical Memor Embed Critic Ma
dimensio utional cells y ded al x. implemented some of them in this work, only relatively
ns blocks M9 path FPS shallow neural networks were considered, without additional
elemen delay architectural features, such as skip connections. Implementing
ts even deeper networks with multiple dozens of layers is
1 3750 232111 25 21,84 193 problematic, since all layer weights would not fit into the
11 bit 2 4710 309727 41 22,628 352
FPGA memory and will require the use of the external RAM,
4 6711 464959 77 23,548 625
1 3876 253212 25 24,181 174
which can lead to the decrease in performance. Moreover, due
12 bit 2 4905 337884 41 24,348 327 to the large number of layers, error accumulation will increase
4 10064 589148 77 - - and will require wider bit range to store fixed-point weight
1 3994 274313 25 22,999 183 values. In the future, we plan FPGA-based implementation of
13 bit 2 5124 366041 41 25,044 318 specialized lightweight neural network architectures that are
4 8437 54949 77 - - currently successfully used on mobile devices. This will allow

1610
to use the same hardware implementation for different tasks by Arrays, ser.FPGA ’16. New York, NY, USA: ACM, 2016, pp. 26–35.
fine-tuning the architecture using pre-trained weights. [Online]. Available: http://doi.acm.org/10.1145/2847263.2847265
[6] C. Zhang et al., “Optimizing FPGA-based accelerator design for deep
ACKNOWLEDGMENT convolutional neural networks,” in Proceedings of the 2015
ACM/SIGDA International Symposium on Field-Programmable Gate
Research has been conducted with the financial support Arrays, ser. FPGA ’15. New York, NY, USA: ACM, 2015, pp. 161–170.
from the Russian Science Foundation (grant 17-19-01645). [Online]. Available: http://doi.acm.org/10.1145/2684746.2689060
[7] N. Suda et al., “Throughput-optimized OpenCL-based FPGA accelerator
REFERENCES for large-scale convolutional neural networks,” in Proceedings of the
2016ACM/SIGDA International Symposium on Field-Programmable
[1] Huang, Gao, et al. "Densely connected convolutional networks." CVPR. GateArrays, ser. FPGA ’16. New York, NY, USA: ACM, 2016, pp. 16–
Vol. 1. No. 2. 2017. 25. [Online]. Available: http://doi.acm.org/10.1145/2847263.2847276
[2] Chen, Liang-Chieh, et al. "Deeplab: Semantic image segmentation with [8] J. Duarte et al., “Fast inference of deep neural networks in fpgas for
deep convolutional nets, atrous convolution, and fully connected crfs." particle physics,” arXiv preprint arXiv:1804.06913, 2018.
IEEE transactions on pattern analysis and machine intelligence 40.4 [9] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
(2018): 834-848. applied to document recognition,” Proceedings of the IEEE, vol. 86,no.
[3] A. Shvets, A. Rakhlin, A. A. Kalinin, and V. Iglovikov, “Automatic 11, pp. 2278–2324, Nov 1998.
instrument segmentation in robot-assisted surgery using deep learning,” [10] R. A. Solovyev, A. G. Kustov, V. S. Ruhlov, A. N. Schelokov, and D. V.
arXiv preprint arXiv:1803.01207, 2018. Puzyrkov, “Hardware implementation of a CNN in FPGA based on
[4] Sandler M. et al. “Inverted residuals and linear bottlenecks: Mobile fixed point calculations,” Izvestiya SFedU. Engineering Sciences, July
networks for classification, detection and segmentation” arXiv preprint 2017, in Russian.
arXiv:1801.04381, 2018. [11] “Verilog generator of neural net digit detector for FPGA,” GitHub.
[5] J. Qiu et al., “Going deeper with embedded FPGA platform for [Online]. Available: https://github.com/ZFTurbo/Verilog-Generator-of-
convolutional neural network,” in Proceedings of the 2016 Neural-Net-Digit-Detector-for-FPGA
ACM/SIGDA International Symposium on Field-Programmable Gate

1611

You might also like