[go: up one dir, main page]

0% found this document useful (0 votes)
41 views14 pages

Real-Time Ssdlite Object Detection On Fpga

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 14

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Real-Time SSDLite Object Detection on FPGA


Suchang Kim , Student Member, IEEE, Seungho Na, Byeong Yong Kong , Member, IEEE,
Jaewoong Choi , Student Member, IEEE, and In-Cheol Park , Senior Member, IEEE

Abstract— Deep neural network (DNN)-based object detection


has been investigated and applied to various real-time applica-
tions. However, it is hard to employ the DNNs in embedded
systems due to their high computational complexity and deep-
layered structure. Although several field-programmable gate
array (FPGA) implementations have been presented recently for
real-time object detection, they suffer from either low throughput
or low detection accuracy. In this article, we propose an effi-
cient computing system for real-time SSDLite object detection
on FPGA devices, which includes novel hardware architecture
and system optimization techniques. In the proposed hardware
architecture, a neural processing unit (NPU) that consists of
heterogeneous units, such as band processing, scaling, and accu-
mulating, and data fetching and formatting units is designed to
accelerate the DNNs efficiently. In addition, system optimization
techniques are presented to improve the throughput further.
A task control unit is employed to balance the workload and
increase the utilization of heterogeneous units in the NPU, and the
object detection algorithm is refined accordingly. The proposed
architecture is realized on an Intel Arria 10 FPGA and enhances
the throughput by up to 13.6× compared to the state-of-the-art
FPGA implementation.
Index Terms— Deep neural network (DNN),
field-programmable gate array (FPGA), object detection,
real-time applications, very-large-scale integration (VLSI) Fig. 1. Object detection and demonstration. (a) Example of object detection.
architecture. (b) Live demonstration of the proposed object detector running on an FPGA
board [41].
I. I NTRODUCTION
energy-constrained devices such as embedded system-on-chips
D EEP neural networks (DNNs) have widely been inves-
tigated and proved to have high accuracy in many
applications, such as image classification, object detection,
and portable terminals. In real-time applications, high through-
put is significant in making the users sense the response imme-
speech recognition, and machine translation [12]–[26]. Despite diately and detecting the potential risks promptly. On the other
the outstanding performance, however, the DNNs are asso- hand, energy efficiency is important in energy-constrained
ciated with high computational complexity and deep-layered devices.
structure, restricting its applicability to real-time applications Object detection is a computer vision task that identifies the
such as machine translation and object detection and to class and the location of objects in digital images or videos,
as shown in Fig. 1(a). Recently, DNNs have actively been
Manuscript received November 16, 2020; revised February 16, 2021 and explored for object detection, and diverse network structures
March 3, 2021; accepted March 4, 2021. This work was supported in have been proposed [22]–[26], [50]–[52]. Among them, you
part by the National Research Foundation of Korea under Grant NRF-
2017R1E1A1A01076992 and in part by the MSIT (Ministry of Science only look once (YOLO) series [24], [25] and single-shot
and ICT), Korea, under the ITRC (Information Technology Research multibox detection (SSD) [26] have attained prominent detec-
Center) support program (IITP-2020-0-01847) supervised by the IITP (Insti- tion accuracy at the expense of low computation complexity.
tute of Information & Communications Technology Planning & Evaluation).
(Corresponding author: In-Cheol Park.) To mitigate the computational complexity and the amount
Suchang Kim, Jaewoong Choi, and In-Cheol Park are with the School of of model parameters further, MobileNetV2, instead of VGG,
Electrical Engineering, Korea Advanced Institute of Science and Technol- is recently applied to SSD as a base network, and the convo-
ogy (KAIST), Daejeon 34141, South Korea (e-mail: sckim@ics.kaist.ac.kr;
jwchoi@ics.kaist.ac.kr; icpark@kaist.edu).
lutions in SSD are replaced with depthwise separable convo-
Seungho Na is with Anapass Inc., Seoul 08375, South Korea (e-mail: lutions (DSCs) [21]. The resulting network called SSDLite
shna@ics.kaist.ac.kr). shows a compatible detection accuracy at the cost of the
Byeong Yong Kong is with the Division of Electrical, Electronic, and relatively lower computational complexity and smaller amount
Control Engineering, Kongju National University, Cheonan 31080, South
Korea (e-mail: bykong@kongju.ac.kr). of model parameters than other networks, making it suitable
Digital Object Identifier 10.1109/TVLSI.2021.3064639 for high-throughput object detection.
1063-8210 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 2. Overview of the SSDLite object detection.

DNNs have been implemented by utilizing general-purpose dilated convolutions and convolutions associated with a stride
graphics processing units (GPGPUs), field-programmable gate of two were replaced with convolutions with a stride of one in
arrays (FPGAs), and application-specific integrated circuits order to share hardware resources, and a dynamic quantization
(ASICs). As the GPGPU has numerous parallel computation method was used to preserve the accuracy. Fang et al. [5]
cores and high memory bandwidth, it is relatively easy to presented a heterogeneous architecture named Aristotle and
achieve high performance. Moreover, well-defined software a full stack of software tools needed for network quantiza-
frameworks, such as Caffe, TensorFlow, and PyTorch, promote tion, pruning, and FPGA deployment. Although the codesign
the GPGPU as a powerful tool for realizing DNN-based approach is effective in enhancing the throughput, the repet-
object detection [17]–[26]. Since the GPGPU consumes large itive redesign of the network increases the development cost,
power, however, it is appropriate only for cloud servers and the quantization method in [4] induces a drop of accuracy
or workstations that are not constrained in energy con- even at the expense of additional hardware resources compared
sumption. On the other hand, as the FPGA and the ASIC to the simple integer arithmetic unit whose precision is enough
can achieve high throughput with much less power than to support the dynamic range of each layer. In addition, the
the GPGPU, they are appropriate in energy-constrained workload required in performing the object detection has not
devices [1]–[11], [44], [45], [48], [49]. Specifically, the FPGA, been fully analyzed, and the architectures did not consider the
which consists of programmable logic blocks, DSP blocks, balance of tasks, so there is still plenty of room to optimize
and block RAMs, has gained great attention in accelerat- in implementing a high-throughput object detector on FPGA
ing DNNs due to its reconfigurability and low development devices.
cost [1]–[5], [44], [45], [48], [49]. In addition, many DNN In this article, we propose an efficient computing system that
quantization techniques [27]–[34], [43], which have been involves novel hardware architecture and system optimization
developed to reduce the amount of data to access and to allow techniques for real-time SSDLite object detection on FPGA
fixed-point (fxp) arithmetic operations, play a considerable devices. An efficient neural processing unit (NPU) is proposed,
role in enhancing the power efficiency and throughput of an which consists of heterogeneous units such as band processing
FPGA implementation. (BP), scaling and accumulating (SA), and data fetching and
For the implementation of real-time object detection on formatting (DFF) units. The BP and SA units are optimized for
FPGA devices, a binarized network model and its hardware the depthwise CL (DCL) and the pointwise CL (PCL), effec-
architecture were proposed [1]–[3]. Nguyen et al. [1] quan- tively reducing memory accesses. The DFF unit arranges the
tized the weights of YOLO to binarized ones and used low-bit data into a form suitable for the BP and SA units and operates
feature maps, making it possible to store the entire network in parallel with them, reducing the data formatting latency.
model and the intermediate feature maps into the block RAM. In addition, system optimization techniques are devised to
In addition, all convolutional layers (CLs) were pipelined enhance the throughput further. A task control unit (TCU) is
to achieve high throughput. A lightweight YOLOv2 was proposed to balance the workload and improve the utilization
proposed in [2], which uses the binarized network model of heterogeneous units in the NPU. The detection algorithm is
for feature extraction and the parallel support vector regres- refined to remove the postprocess latency and quantize the
sion (SVR) for classification and localization. A quantized net- feature and parameter representations. The proposed object
work model was also proposed in [3], where the hidden layers detector is implemented on FPGA boards, and a live demon-
are binarized and the input and output layers are quantized to stration shown in Fig. 1(b) can be found on YouTube [41],
eight-bit fxp values to ensure the accuracy. To increase the where the camera module captures the image displayed on
throughput, an advanced extension called NEON was devel- the tablet and the LCD display module shows the detection
oped based on the single instruction multiple data (SIMD) results.
architecture. The binarized network model is effective in The rest of this article is organized as follows.
achieving high throughput, but it requires hard retraining Section II describes SSDLite in detail and the fundamen-
processes to restore the accuracy. The binarization of weights tal architecture of object detection. The proposed hardware
often leads to overquantization that makes it hard to achieve architecture is described in Section III, and system opti-
the desired accuracy. Sometimes, a network model and its mization techniques are explained in Section IV. Section V
hardware architecture were codesigned together [4], [5]. In [4], summarizes and evaluates the characteristics of FPGA

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KIM et al.: REAL-TIME SSDLITE OBJECT DETECTION ON FPGA 3

implementations. Finally, concluding remarks are made in Algorithm 1 Baseline Postprocess [42]
Section VI.

II. BACKGROUND
In this section, we first explain the baseline SSDLite object
detection algorithm [21] and then elaborate the primitive
operations in the network and the postprocess. Subsequently,
we analyze the previous architectures in the literature.

A. SSDLite Overview
The baseline SSDLite is based on a feedforward convolu-
tional network and trained with the VOC 2007 data set [37].
It produces a collection of bounding boxes and classification
scores of object instances in those boxes and can suggest up
to 3000 bounding boxes and recognize 20 classes, such as
humans, cars, and buses. As shown in Fig. 2, SSDLite consists
of three processes: preprocess, NN-process, and postprocess.
In the preprocess, a raw image is divided into different chan-
nels according to the color in order to make the image conform
to the input format of the NN-process. In the meantime, all
pixel values are normalized to have a predefined mean and
variance.
The NN-process is partitioned into feature extraction and
convolution prediction. In the feature extraction, a deep con-
volutional neural network, which consists of the truncated
MobileNetV2 base network and the additional layers following
the base network, extracts features gradually from the input
image. The basic building block of the feature extractor is an
inverted residual block (IRB) composed of two PCLs and one
DCL between the PCLs. By using six multiscale intermediate
feature maps generated by the feature extractor, which is
shown on top of the feature extractor in Fig. 2, the convolution
predictors predict objects. The basic building block of the
convolution predictors is a DSC consisting of a DCL and a map, as shown in Fig. 3(a). The computation in the DCL is
PCL. There are 12 DSCs, and two of them are paired. Each defined as
pair produces the classification scores and the shape offsets of  k−1 k−1 
bounding boxes. The convolution predictors use six multiscale 
intermediate feature maps to make 2166, 600, 150, 54, 24, and oxyz = wlmz × f (sx+l)(sy+m)z + bz (1)
l=0 m=0
6 object proposals.
The postprocess is to obtain final object detections. where w, f , b, and s denote the kernel, the input feature,
It adjusts the bounding boxes by using the shape offsets the bias, and the stride size, respectively.
and performs the nonmaximum suppression to suppress some In a PCL, an h × w output feature map is generated by
proposals that are associated with low classification scores or convolving an h × w × d input feature map with a 1 × 1 × d
have high intersections with other bounding boxes. kernel. c 1 × 1 × d kernels are convolved with the same
h × w × d input feature map to produce an h × w × c
B. Primitive Operations output feature map, as shown in Fig. 3(b). The computation
in the PCL is defined as
As described in Section II-A, most layers in SSDLite are  d−1 
DCLs or PCLs whose operations are shown in Fig. 3. As 2-D 
kernels and small-sized 3-D kernels are used in the DCL and o xyz = wnz × f(sx)(sy)n + bz . (2)
n=0
PCL, SSDLite is effective in reducing model parameters and
computation complexity compared to other network models The primitive operation in the DCL is the channelwise 2-D
based on the conventional CLs. convolution that does not accumulate in the depth direction
In a DCL, a k × k kernel and an h × w input feature map and that in the PCL is to scale 2-D input feature maps and
are convolved to form an e × f output feature map. There accumulate in the depth direction. Though both primitive
are d k × k kernels and d h × w input feature maps, and a operations are a kind of convolution operation, their data
e × f × d output feature map is generated by conducting 2-D flows are so different that separate accelerators are required
convolution for each pair of the kernel and the input feature to process the layers efficiently.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 5. Proposed architecture of object detection.

tasks needed in object detection. In addition, the host processor


typically processes the postprocess in software since there are
many operations such as exponential and division functions
that are hard to realize in hardware, and the computational
complexity of postprocessing depends dynamically on the
number of objects in the input image.
In the architecture, the workload may be unevenly distrib-
uted to the blocks. The host processor has a large control over-
head as it not only schedules the whole tasks but also handles
Fig. 3. Primitive operations in SSDLite. (a) Depthwise convolution. (b) point- the PE array, the buffers, and the DMA unit simultaneously.
wise convolution.
It configures the DMA unit by calculating and sending the base
addresses of the source and destination, and the number of
data transmitted between the buffers and the off-chip memory
controls the buffers to feed the data into the PE array and
configures the data flow of the PE array to make it do the
desired workload. The postprocess also increases the workload
of the host processor. On the other hand, the PE array formats
Fig. 4. Fundamental architecture of object detection.
the input data before processing them, which prevents the
parallel execution of the data formatting and the processing,
imposing a high workload on the PE array. This imbalanced
C. Postprocess
workload can significantly degrade the throughput of object
The conventional postprocess is formalized in Algo- detection.
rithm 1 [42], where bi , gi , and si, j denote the default box,
the bounding-box shape offset of the i th proposal, and the
III. P ROPOSED A RCHITECTURE
confidence score of the j th class in the i th proposal, respec-
tively. N and M stand for the number of proposals and classes, This section first describes the overall architecture of the
respectively. Thconfi , and ThIoU represent the threshold values proposed network computing system, which is composed of a
used to determine the confidence score and the intersection host processor, an NPU, and a TCU. Then, the data path of the
of union (IoU), respectively. The postprocess outputs the final NPU is elaborated in detail, and some techniques to optimize
detections D composed of the surviving bounding boxes and the NN-process are explained.
their object classes.
A. Overall Architecture
D. Previous Architecture The architecture of the proposed object detector is shown
The fundamental architecture of object detection employed in Fig. 5. The host processor, which consists of a 32-bit RISC
in the literature [1]–[5] is shown in Fig. 4. The NPU, which processor, a main memory, and a shared memory, not only
consists of buffers and an array of processing elements (PEs), controls the entire system through the control channel but
is used to accelerate the network model. The PE array is pro- also performs the postprocess. The shared memory is con-
grammable and able to support various layers by configuring nected to the NPU through the high-bandwidth data channel,
hyperparameters, such as the sizes of the feature map, kernel, which is used to deliver the data to be postprocessed in the
and stride. The NPU is homogeneous in that the same PE host processor. The TCU manages the NPU instead of the
array is used to process the entire network model by switching host processor, which generates fine-grained commands by
only the data flow. To provide input data in a suitable form to interpreting coarse-grained commands coming from the host
process, the data are formatted by the data router in the PE processor and handles the interrupts of the NPU.
array. A direct memory access (DMA) unit is responsible for To accelerate the NN-process, the NPU consists of a BP
fetching the data from the off-chip memory and forwarding unit, an SA unit, and a DFF unit. The BP unit computes con-
them to the buffers. A host processor controls the PE array, ventional 3-D convolution as well as depthwise convolution,
the buffers, and the DMA unit as well as the entire flow of whereas the SA unit handles pointwise convolution. Both units

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KIM et al.: REAL-TIME SSDLITE OBJECT DETECTION ON FPGA 5

Fig. 6. Structure of the NPU.

include PE arrays and several data buffers to reuse the data and
access them rapidly. The PE arrays in the BP and SA units
are optimized for the DCL and the PCL, respectively, and
Fig. 7. Structure of the BP block.
the activation function is specially implemented in hardware.
The data buffers are connected to the DFF unit through the
high-bandwidth data channel. The DFF unit rearranges the data map, and the sum of them. The network model of SSDLite
fetched from the off-chip memory into a form suitable for the has a residual connection in a PCL, which is realized in the
BP and SA units, and vice versa. SA unit by making the 3-to-1 multiplexer choose the sum of
The high-bandwidth data channel is realized separately from the bias and the output feature.
the control channel to secure the large data bandwidth required Fig. 7 shows the pipelined structure of the BP block,
in the system. The inference process necessitates enormous which is optimized for the processing of the DCL. There are
amounts of feature maps and parameters to be stored in and three PE arrays in the block, each of which performs 1-D
loaded from the off-chip memory, and the final result of convolution. Accumulating the 1-D convolution results of the
the NPU moves to the host processor, being postprocessed three PE arrays leads to a row of the output feature map of
there. The high-bandwidth data channels supporting the direct 3 × 3 convolution. Each PE in a PE array computes the dot
connections among the off-chip memory, the NPU, and the product given three inputs and three weights and adds an extra
host processor play a significant role in enhancing the data input. The extra input is the output of the 2-to-1 multiplexer
transmission rate drastically. in Fig. 6 in case of PE array 0 and the 1-D convolution result of
the previous PE array in case of PE arrays 1 and 2. One row of
the 3×3 kernel stays in a PE array and is broadcasted to all the
B. Neural Processing Unit PEs in the PE array, whereas one row of the input feature map
The structure of the NPU is presented in Fig. 6. In the BP is loaded into all the three PE arrays simultaneously. A new
and SA units, input buffers (IBs), weight buffers (WBs), and row of the input feature map is fed into the PE arrays every
bias buffers (BBs) are used to store input features, weights, cycle while the kernel stays, and a row of the output feature
and biases, respectively, and feed the data to the BP and SA map is generated in PE array 2 with a latency of two cycles.
blocks. The BP block generates a 2-D output feature map by In consequence, the BP block generates an output feature map
performing 3 × 3 convolution for a 2-D input feature map, of (n −2)×T for T +2 cycles. As the generated output feature
and the SA block generates a 2-D output feature map by map looks like a vertical band of width n − 2, the block is
conducting scalar-matrix multiplication and matrix addition for called a band-processing block. An example of generating a
a set of 2-D input feature maps. Passing through the activation 1-D output feature map is also presented in Fig. 7. The first
hardware, the outputs of the BP and SA blocks are stored in the row of the input feature map is convolved at cycle t with
output buffers (OBs). Note that every buffer in the NPU is in the first row of the kernel in PE array 0, and the result is
fact realized with a pair of registers in order to make it possible forwarded to PE array 1. The second input row is processed
to conduct data processing and data fetching concurrently. The at cycle t + 1 in PE array 1, and the convolution result is
switch under a buffer controls which register is connected to added to the forwarded one. To generate a row of the output
the BP and SA block. The multiplexer located after the BB is feature map, the third input row is convolved at cycle t + 2 in
to handle a CL in the BP unit and a residual block in the SA PE array 2, and the convolution result is accumulated to the
unit. The 2-to-1 multiplexer in the BP unit is to choose either result forwarded from PE array 1. The BP block can support
a bias or the output feature map, and the 3-to-1 multiplexer in the larger kernel by using the filter decomposition technique
the SA unit is to select one among a bias, the output feature described in [46].

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 8. Structure of the SA block.

Fig. 8 shows the pipelined structure of the SA block


performing scalar-matrix multiplication and matrix addition,
which is optimized for the PCL processing. There are multiple
PE arrays in the block, each of which performs 1-D scaling and
accumulation. In a cycle, the PE array scales the input vector
by the weight and accumulates the result to the data coming Fig. 9. (a) Structure of the DFF unit. (b) Role of the data formatter.
from either the pipeline register or the output of the 3-to-1
multiplexer in Fig. 6. The parallelism required in a pointwise blocks simultaneously. To load data from the off-chip memory,
convolution is in general so large that multiple PE arrays the data fetcher generates access requests to the off-chip
are integrated in the SA block to increase the throughput. memory by using the information on the address to access,
As a channel of the 3-D input feature map is scaled by the which is stored in the address register, and the size of data,
same weight, the weight is shared with the PE arrays, but a which is in the counter. The data formatter rearranges the
different row of the input feature map is loaded into a PE data coming from the data fetcher or the NPU based on
array. As a result, the SA block takes T cycles in processing the information stored in the configuration register, the data
an n × k × T input feature map with k PE arrays containing n buffer, and the data combiner. The role of the data formatter
multipliers each. An example is also shown in Fig. 8. At cycle is exemplified in Fig. 9(b). To load input features, i 10 , . . . , i 14 ,
t, the first row of the input feature map is processed with which are placed in two rows of the off-chip memory, the data
weight w0 in PE array 0, and the second row with the same formatter first reads the two rows into two data buffers, Input0
weight in PE array 1 at the same cycle. At the next cycle, and Input1 , and then makes the desired output by combining
a new channel of the input feature map and the new weight the two data buffers as indicated by the configuration register.
corresponding to the channel are loaded into the PE arrays, The configuration register contains the number of input vectors
and the processing results are added to the previous results to and the indices of the head and tail. When storing the output
compute the pointwise convolution incrementally. In addition, feature of a layer into the off-chip memory, the data formatter
the SA block can be used to process fully connected layers. reduces the data in the IB according to the stride information
In that case, a vector-matrix multiplication is realized by stored in the configuration register.
multiplying one element of the vector by a row of the matrix
and accumulating the results of such multiplications.
In the BP unit, an input feature map is reused k × k times, C. Task Control Unit
where k is the height and width of the kernel, since it is The TCU is designed to maximize the hardware utilization
broadcasted to all the PEs that have the weight to be convolved of the NPU by managing the internal DFF, BP, and SA units,
with the features, while the weights are reused h × w times, and to reduce the control workload of the host processor
where h and w are the height and width of the feature map, by generating fine-grained commands from the coarse-grained
respectively, since the weights are stayed in the BP block until host commands.
the corresponding feature map is completed. In this case, both Fig. 10(a) shows the structure of the TCU and the coarse-
the kernel and the feature map are reused maximally. In the SA and fine-grained commands. The coarse-grained command is
unit, the feature map is reused c times, where c is the number stored in the coarse-grained command queue, which enables
of output channels, since the feature map is loaded into the IB the host processor to continue the remaining NN-process
and reused until it is convolved with all the related weights. without waiting for the end of the command. The task man-
In this case, the feature map is reused as many as possible. ager fetches a coarse-grained command at a time from the
The reason for maximizing the reuse of the feature map is queue and generates fine-grained commands to be issued to
that the feature map is usually larger than the kernel in PCLs. the corresponding internal units by taking into account the
Due to the efficient BP and SA units, therefore, the proposed processing sizes of the BP and SA units and the size of
architecture reduces a large amount of memory accesses. the input feature map to be processed. Each fine-grained
The structure of the DFF unit is shown in Fig. 9(a). The command is stored into one of the three fine-grained queues,
controller is a state machine designed to handle the other waiting for the corresponding unit to be ready to execute it.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KIM et al.: REAL-TIME SSDLITE OBJECT DETECTION ON FPGA 7

Fig. 11. Workload analysis of the baseline system. (a) Workload flow among
the units. (b) Breakdowns of workload in the Stratix IV and Arria 10 FPGA
implementations.

with an n × (T + 2) input feature map at maximum. Let us


consider another coarse-grained command that performs a 3-D
convolution for a PCL. In this case, fine-grained commands
Fig. 10. (a) Structure of the TCU and commands and (b) examples showing are generated for the SA unit to perform a part of the 3-D
the relationship between the coarse-grained and the fine-grained commands
for processing a DCL and a PCL. convolution. The SA unit can handle an n ×k ×T input feature
map at maximum at a time. When the size of the input feature
map is larger than the maximum size, the BP or SA unit is
The NPU handler checks the status of the internal units invoked multiple times in the raster scan order to realize a
and notifies it to the command scheduler. The status is used coarse-grained command, as shown in Fig. 10(b).
in the command scheduler to determine when to issue the
IV. S YSTEM O PTIMIZATION T ECHNIQUES
fine-grained commands to the internal units. As soon as an
internal unit is ready to accept a new command, the command This section describes workload optimizations enabled by
scheduler fetches a command from the corresponding queue utilizing the TCU and the heterogeneous units in the NPU and
and issues it to the internal unit. then explains algorithm optimization methods to enhance the
The coarse-grained command consists of an operation to be throughput.
performed, the base addresses of features and parameters, and
the layer configuration such as the sizes of features and ker- A. Workload Optimization
nels, and the type of activation. There are three coarse-grained Fig. 11(a) shows how the conventional object detection
commands: conventional convolution, depthwise convolution, system works, where the internal units of the NPU are directly
and pointwise convolution. The fine-grained command con- controlled by the host processor. The host processor controls
figures the registers in an internal unit so as to perform the the overall flow and generates the commands to be performed
operation of a coarse-grained command. Each BP unit has in the DFF, BP, and SA units. In addition, it prepares several
two registers to control the operation to be performed in the interrupt service routines (ISRs) to cope with the interrupts
unit, and each SA unit has such registers, too. The registers requested from the NPU and to communicate with the DFF,
contain the status of the unit, the addresses of buffers to access, BP, and SA units. The heterogeneous units generate an inter-
and the information to configure the switches, multiplexers, rupt to notify that the requested operation is completed. More
and activation function. The DFF unit has eight registers to specifically, the DFF unit produces an interrupt when data
store the status of the unit, the base address of the off-chip loading or data storing is completed, and the BP and SA units
memory to be accessed, and the configurations of the layer to make an interrupt when the requested processing is done. The
be processed, and the data formation. host processor also checks whether the NPU is idle and then
Fig. 10(b) shows how a DCL or a PCL is processed with sends the next command when it is. The large workload of
the coarse- and fine-grained commands. Let a coarse-grained the host processor often makes the NPU idle, impeding the
command indicate a 2-D convolution for a DCL. In this case, efficient flow and limiting the throughput of object detection.
fine-grained commands are generated to make the BP unit Furthermore, it prevents the host processor from executing the
perform a part of the 2-D convolution. The maximum size postprocess.
of the part to be processed in the BP unit is predetermined The entire workload conducted in the NPU and the host
at the design time. The BP unit mentioned earlier can deal processor is normalized and shown in Fig. 11(b). The two

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Algorithm 2 Refined Postprocess

Fig. 12. Proposed system enhanced by the TCU. (a) Workload flow in the
proposed system and (b) resulting throughput improvement in the Stratix IV
and Arria 10 implementations.

workloads increases the portions of the flow control and the


ISR workloads.
Fig. 12(a) shows a new flow of workload, which is improved
by employing the proposed TCU. The command to the NPU,
which has induced a big burden on the host processor,
is replaced with the command to the TCU. As described in
Section III, the command for the TCU is much coarser-grained
than before, relaxing the burden on the host processor signif-
icantly. In addition, the ISR processing is moved to the NPU
handler in the TCU. The NPU handler is a hard-wired circuit
that checks the status of the DFF, BP, and SA units in place
of the host processor, relaxing the communication workload
with the NPU.
The command queues in the TCU enable the host processor
implementations have similar breakdowns of workload. The and the task manager to generate the command continuously
NPU takes about half of the whole workload, including around even if the preceding command is not completed. For example,
20% for the DFF unit and 30% for the BP and SA units. Due to the host processor continues to generate the second command
the separate heterogeneous architecture employed in the NPU, immediately after it pushes the first command into the queue,
the workload seems to be properly balanced. On the other while the task manager is processing the first command of
hand, the host processor takes the other half of the total work- the host processor to make fine-grained commands to be
load consisting of around 15% in controlling the flow, 10% in executed in the internal units of the NPU. Similarly, the task
processing the ISR, 15% in generating DFF commands, and manager operates without waiting for the interrupts of the
8% in producing BP and SA commands. The largest workload NPU. The perpetual operation increases the utilization of
of the host processor is to generate DFF commands or control the NPU, enhancing the system throughput accordingly. Each
the flow, whereas the smallest one is to make BP and SA queue interrupts the host processor or the task manager only
commands. The ISR workload also accounts for a large portion when the command queue is full, and the TCU interrupts the
of the whole workload. In the Arria 10 FPGA implementation, host processor only to make it start the postprocess.
the portions of the NPU and the NPU command workloads The TCU plays a significant role in achieving a balanced
decrease compared to the Stratix IV FPGA implementation, system by relaxing the workload of the host processor and
as the number of SA blocks and the size of buffers in the making the DFF, BP, and SA units work in parallel. The
BP and SA units are doubled. Meanwhile, decreasing the performance of a parallel system is in general restricted by

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KIM et al.: REAL-TIME SSDLITE OBJECT DETECTION ON FPGA 9

Fig. 13. Processing flows of (a) baseline SSDLite and (b) SSDLite with
optimization.

the unit associated with the largest processing time. In the


proposed heterogeneous NPU, the DFF unit performs data
formatting as well as data fetching in parallel with the BP
and SA units, which enhances the overall throughput. As a
consequence, the TCU improves the throughput of the Stratix
IV and Arria 10 FPGA implementations by about 1.75×
and 1.92×, respectively, compared to the baseline system,
as shown in Fig. 12(b). The throughput increment of the
Arria 10 FPGA implementation is higher than that of the
Stratix IV FPGA implementation since the TCU of the former
implementation relaxes the larger part of the entire workload.

B. Algorithm Optimization
The processing flow of SSDLite is shown in Fig. 13, where
the gray boxes are processed in software on the host processor
and the others are processed in hardware. In the baseline, Fig. 14. Feature and parameter distributions. (a) Feature distribution and
the postprocess stalls until the NN-process is completed as (b) parameter distribution obtained with and without the BNF technique.
shown in Fig. 13(a) since the host processor is busy to manage
the NN-process and the baseline postprocess waits for the total process is incrementally conducted in the refined postprocess.
N proposals to be generated as described in Algorithm 1. The detection results of the i th postprocess are included in the
The N proposals consist of six groups that are generated nonmaximum suppression process of the (i +1)th postprocess.
by the six convolution predictors. The number of proposals The final detections are determined at the end of the last
in a group decreases from 2166 to 6, decreasing by a factor postprocess, having the same results as the baseline post-
of one-fourth in the next group. The postprocess spends most process. The details of the refined postprocess are described in
of its time in processing earlier groups generating a larger Algorithm 2, where L and Nl are the numbers of decomposed
number of proposals. In particular, the number of proposals postprocesses and the number of proposals produced in the lth
produced in the first group is more than two-thirds of the group, respectively.
entire proposals. To quantize the baseline SSDLite, we first analyze the distri-
We can make the proposals available in the middle of the bution of model parameters and that of feature maps. Fig. 14(a)
NN-process by executing the convolution predictors immedi- shows the mean of top-100 maximum and minimum absolute
ately after the corresponding input feature map is created. values of the feature maps obtained from the VOC 2007 test
Specifically, 40th, 52nd, 55th, 58th, 61st, and 64th layers set, and Fig. 14(b) shows the maximum and minimum absolute
in the feature extractor generate the feature maps for the values of the model parameters. The ranges of the feature maps
convolution predictors. However, the baseline postprocess is do not fluctuate largely across the layers, so the feature map
not able to process each group separately. By modifying the can be quantized to a certain fxp representation for all the
baseline postprocess to process the proposals as soon as they layers.
are generated, therefore, the overall throughput is enhanced, On the other hand, the range of the model parameters
as shown in Fig. 13(b). varies largely across the layers, so the parameter representation
The proposed refined postprocess decomposes the base- requires a large number of bits in a fxp representation or
line postprocess into six separate postprocesses, and each takes various formats to make the quantization error insignif-
postprocess is in charge of handling a single proposal icant. To alleviate the range fluctuation, we apply the batch
group generated by a convolution predictor. The whole post- normalization folding (BNF) technique [43], which applies

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

recovers three color components of red, green, and blue (RGB)


per pixel by carrying out an interpolation process called
demosaicking. The display controller presents the input image
stored in the external SRAM and depicts the bounding boxes
for the objects detected by the postprocess running on the host
processor. The DDR2 DRAM supports a maximum bandwidth
of 6.4 GB, and its controller has been implemented by employ-
ing a 33-kB buffer on the FPGA. The host processor designed
based on the ARM architecture has a main memory of 32 kB
and a shared memory of 16 kB, and the high-bandwidth data
channel provides 256 bits at a time. Both the host processor
and the data channel operate at the system clock frequency.
The TCU has a queue to store the host commands and
three queues to store the TCU commands separately, and the
former queue can hold up to 16 and the latter queues hold
64 commands in total. The BP unit has a number of buffers,
each of which is differently configured considering the role: an
IB of 9 kB, a WB of 5 kB, a BB of 1 kB, and an OB of 8 kB.
In addition, the BP unit contains one BP block that computes
a 2-D output feature map of size 32 × 64. The BP block has
three PE arrays, and each PE array is constructed with 32 PEs.
It takes 66 cycles to calculate a 2-D output feature map.
Fig. 15. Prototype FPGA implementation. (a) Block diagram of the prototype. The SA unit has also several buffers that are sized differently
(b) Photograph of the FPGA board. from the BP unit: an IB of 12 kB, a WB of 65 kB, a BB of 1
kB, and an OB of 98 kB. As the computational complexity of a
the batch normalization effect to the preceding CL to reduce PCL is in general much larger than that of a DCL, in addition,
the complexity of the batch normalization. The distribution of it includes eight SA blocks, each of which is in charge of
new model parameters indicates that the range fluctuation can processing a 16 × 6 × 64 input feature map. In other words,
be reduced by modifying the parameters of a convolutional the SA block is designed with six PE arrays that deal with
with considering the following batch normalization, which vectors of length 16 and takes 64 cycles to process a 3-D
allows the lower precision in quantization. In the FPGA feature map of size 16 × 6 × 64. The NPU is implemented on
implementation of SSDLite, therefore, the BNF technique the FPGA with memories of about 398 kB in total, and the
plays a critical role in reducing the number of bits required to object detector is realized with memories of around 569 kB.
represent data and in simplifying hardware operators. A Terasic DE4 board with an Intel Stratix IV FPGA has
been used to implement the object detector, and the imple-
V. FPGA I MPLEMENTATION AND E VALUATIONS mentation characteristics are summarized in Table I. Fig. 15(b)
shows a photograph of the FPGA board combined with the
This section describes the proposed computing system
camera module, which captures an input image frame, and the
implemented on FPGA boards and analyzes the effective-
LCD display module, which displays the input image frame
ness. To justify the superiority of the proposed architecture,
as well as the bounding boxes of detected objects. In addition,
in addition, the implementation is compared with the previous
a Terasic Han Pilot Platform with an Intel Arria 10 FPGA has
state-of-the-art works.
been used to evaluate the proposed architecture on a recent
FPGA. In the Arria 10 FPGA implementation, the number of
A. FPGA Implementation SA blocks and the size of buffers in the BP and SA units are
We have described the proposed architecture in Verilog doubled, but the other configurations are unchanged compared
HDL and compiled it on FPGAs using Intel Quartus Prime. to the Intel Stratix FPGA implementation. The implementation
A prototype object detector implemented on an FPGA board characteristics are summarized in Table I.
is shown in Fig. 15. The overall block diagram is shown In the FPGA implementation, the DFF unit operates in
in Fig. 15(a), which includes the proposed architecture as well parallel with either the BP unit or the SA unit. The throughput
as additional controllers designed for external devices such as may increase somewhat by overlapping the BP and SA units.
DDR2 DRAM, SRAM, NAND flash memory, LCD display Running the two units simultaneously, however, is complicated
module, and camera module. The flash, display, and camera due to the limitation of off-chip memory bandwidth and the
controllers are connected to the control channel through which control overhead. In addition, there is no need to run the two
the host processor manages them. The flash memory stores units in parallel, as the throughput required for the real-time
the network parameters to be forwarded to the DRAM when applications has already been achieved without overlapping
booting the object detector. them.
Each image frame captured by the camera module is stored The structure of the DSP block in Intel Stratix IV FPGA
in the external SRAM. As the stored image frame has only is shown in Fig. 16 [47]. The DSP block is composed of
one color component per pixel, the image transformation unit four multipliers and a two-stage adder to accumulate the

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KIM et al.: REAL-TIME SSDLITE OBJECT DETECTION ON FPGA 11

TABLE I TABLE II
C HARACTERISTICS OF FPGA S U SED IN P ROTOTYPE I MPLEMENTATIONS P RECISION AND D ETECTION A CCURACY A CHIEVED
W HEN T RAINED W ITH VOC07 + 12 + COCO

baseline employed different fxp formats in representing the


feature and the parameter to make the accuracy better, but the
detection accuracy was decreased by about 9 mAP, where mAP
is a measure of detection accuracy described in [40]. Applying
the BNF enables us to use the same format in representing
the feature and the parameter while maintaining the detection
accuracy. Moreover, the baseline did not work with the 10-bit
fxp representation, but applying the BNF leads to an accuracy
of 61 mAP. We adopted the SSDLite that applies the BNF and
realized it with the 16-bit fxp representation to maintain the
accuracy.
Fig. 16. Structure of a DSP block in Intel Stratix IV FPGA. The layer-by-layer time decomposition obtained by process-
ing SSDLite on the Intel Stratix IV FPGA board is shown
in Fig. 17, where the processing time for the CL or DCL,
multiplication results as well as the previous result in the
which is computed in the BP unit, is colored in light gray, and
output register. The PE structure employed in the BP block
that for the PCL, computed in the SA unit, is represented in
is designed to make it similar to the DSP block, so the BP
dark gray. In processing a DCL, the BP unit needs large mem-
block is well suited for the FPGA. The number of DSP blocks
ory accesses to load input features and store output features,
used for the BP block is 96, which is as many as the number
so the on-chip memory is actively used to provide a sufficient
of the PEs in the BP block.
bandwidth. In addition, the IB and OB are implemented
separately to increase the bandwidth and double-buffered to
B. Evaluations access the off-chip memory in parallel with the processing.
This section evaluates the FPGA implementations. As the off-chip memory bandwidth is insufficient to meet the
We demonstrate how effective the proposed computing required bandwidth, the processing time of a DCL mainly
system and the optimization techniques are in enhancing the depends on the off-chip memory bandwidth. In Fig. 17, the
throughput by comparing the prototype implementations with DCL processing time in layer 2 is larger than that in layer 3,
the state-of-the-art FPGA implementations. though layer 3 has the larger computational complexity.
The detection accuracies of the baseline SSDLite and the The processing time of a CL or a PCL, on the other
SSDLite with the BNF are shown in Table II, where the hand, depends on the computation parallelism provided in the
detector was trained with VOC 2007, VOC 2012, and COCO BP and SA units since most of the data are stored in the
data set [37]–[39] at first and not retrained after quantization. on-chip memory and reused without accessing the off-chip
If the IEEE 754 32-bit floating-point number system is used memory. The input features in the IB are reused to produce
to represent features and parameters, both of them result in multiple-output feature maps, and the partial sums are stored
almost the same accuracy. However, if we take the fxp repre- and accumulated into the OB to make final output features.
sentation and reduce the bit-length of the fxp representation, The implementation characteristics obtained from this work
the resulting accuracies become different from each other. and the previous works are summarized in Table III. Almost no
In the fxp representation, f and p stand for the formats used drop of accuracy is achieved in the proposed implementation
to represent the feature and the parameter, respectively, and without conducting any retraining process, whereas most of
(i , f ) denotes the bit-lengths of the integer part and the the others have experienced the lower detection accuracy
fractional part, respectively. The floating-point and the fxp even with the retraining process. On Intel Stratix IV FPGA,
representations show similar accuracies for the 32- and 24-bit the baseline architecture has achieved a frame rate of 14.2/s,
fxp representations, as a representation of 24 bits or more has whereas the proposed architecture enhances it to 24.9/s at
a range large enough to cover the data range of features and the cost of a negligible hardware overhead. The numbers of
parameters. Let us focus on the 16-bit fxp representation. The DSP blocks used in the baseline architecture and the proposed

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 17. Layer-by-layer decomposition of SSDLite processing time on Intel Stratix IV FPGA.

TABLE III
FPGA I MPLEMENTATIONS OF DNN-BASED O BJECT D ETECTION

implementation are the same since the number of PEs in the architecture achieves the higher detection accuracy without
BP and SA units is unchanged, whereas the logic complexity conducting the retraining process and the higher throughput
and the Block RAM utilization are slightly increased by but for the overly compressed network models suffering from
employing the TCU in the proposed architecture. accuracy degradation. For real-time applications, the latency
On Intel Arria 10 FPGA, the proposed architecture has is as important as the throughput. To evaluate the latency,
achieved a frame rate of 84.8/s, which is higher than those the processing time taken for a batch size of one is summarized
obtained from other FPGA implementations except for [1], [3], in Table III. The proposed architecture takes 11.79 ms on
and [5]. As [1], [3], and [5] compress the network model Intel Arria 10 FPGA, which is lower than all the other FPGA
excessively to improve the throughput, however, their detection implementations.
accuracies are much lower than that of the proposed architec- The proposed object detectors have less hardware resources
ture. For a fair comparison, the frame rate is normalized by compared to the previous works in Table III. The proposed
considering the area taken in implementation. The normalized BP and SA units are optimized for the DCL and the PCL as
frame rate achieved in our implementation on Intel Arria 10 is SSDLite is mainly composed of DCLs and PCLs to reduce the
13.6× and 11.6× higher than the Stratix 10 and Arria 10 size of model parameters and the complexity of computation.
implementations in [4], respectively. In short, the proposed The power consumption has been estimated using a power

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KIM et al.: REAL-TIME SSDLITE OBJECT DETECTION ON FPGA 13

analyzer tool provided by Intel. The proposed architecture [3] T. B. Preuser, G. Gambardella, N. Fraser, and M. Blott, “Inference of
consumes 5.1 W on Intel Stratix IV FPGA and 9.88 W quantized neural networks on heterogeneous all-programmable devices,”
in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2018,
on Intel Arria 10 FPGA, which is lower than other FPGA pp. 833–838.
implementations except for [2] that uses the binarized network [4] Y. Ma, T. Zheng, Y. Cao, S. Vrudhula, and J.-S. Seo, “Algorithm-
model. The proposed architecture on Intel Arria 10 FPGA hardware co-design of single shot detector for fast object detection on
FPGAs,” in Proc. Int. Conf. Comput.-Aided Design, Nov. 2018, pp. 1–8.
consumes about twice as much power as [2] but supports
[5] S. Fang et al., “Real-time object detection and semantic segmentation
about twice higher frame rate than [2], which means that both hardware system with deep learning networks,” in Proc. Int. Conf. Field-
architectures consume almost the same energy in processing Program. Technol. (FPT), Dec. 2018, pp. 389–392.
a frame. The energy consumed in processing a frame is [6] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.-W. Tai, “Exploring heteroge-
neous algorithms for accelerating deep convolutional neural networks on
summarized in Table III. The proposed architecture on Intel FPGAs,” in Proc. 54th Annu. Design Autom. Conf., Jun. 2017, pp. 1–6.
Arria 10 FPGA has almost the same energy consumption as the [7] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
implementations in [1], [2], and [5] that have excessively com- FPGA-based accelerator design for deep convolutional neural networks,”
pressed network models and 24.2× and 20.8× smaller than the in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2015,
pp. 161–170.
Stratix-10 and Arria-10 implementations in [4], respectively. [8] A. Shawahna, S. M. Sait, and A. El-Maleh, “FPGA-based accelerators
of deep learning networks for learning and classification: A review,”
VI. C ONCLUSION IEEE Access, vol. 7, pp. 7823–7859, Dec. 2019.
[9] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: A spatial archi-
This article has proposed novel hardware architecture and tecture for energy-efficient dataflow for convolutional neural networks,”
system optimization techniques effective in realizing real-time IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, Jan. 2017.
DNN-based object detection. In the proposed architecture, an [10] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing
of deep neural networks: A tutorial and survey,” Proc. IEEE, vol. 105,
NPU consisting of heterogeneous units, BP, SA, and DFF no. 12, pp. 2295–2329, Dec. 2017.
units, was devised to efficiently accelerate the neural network [11] J. Jo, S. Cha, D. Rho, and I.-C. Park, “DSIP: A scalable inference
process. The BP and SA units were optimized to process the accelerator for convolutional neural networks,” IEEE J. Solid-State
Circuits, vol. 53, no. 2, pp. 605–618, Feb. 2018.
DCL and the PCL, effectively reducing memory accesses.
[12] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
The DFF unit arranges the data into a form suitable for no. 7553, pp. 436–444, May 2015, doi: 10.1038/nature14539.
the BP and SA units, removing the data formatting latency. [13] A. Hannun et al., “Deep speech: Scaling up end-to-end speech
For system optimization, a TCU was developed to relax the recognition,” 2014, arXiv:1412.5567. [Online]. Available: http://arxiv.
org/abs/1412.5567
excessive workload of the host processor and increase the [14] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition
utilization of heterogeneous units in the NPU. In addition, with deep recurrent neural networks,” in Proc. IEEE Int. Conf.
the detection algorithm was optimized to remove the latency Acoust., Speech Signal Process., Vancouver, BC, Canada, May 2013,
of the postprocess and quantize the feature and parameter pp. 6645–6649.
[15] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
representations. Two prototype object detectors, which were with neural networks,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS),
implemented on Intel Stratix IV and Intel Arria 10 FPGAs, 2014, pp. 3104–3112.
revealed that the proposed system is associated with the higher [16] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic
grasps,” Int. J. Robot. Res., vol. 34, nos. 4–5, pp. 705–724, Mar. 2015.
throughput, the lower latency, and the higher energy efficiency
[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
at high detection accuracy than the previous state-of-the-art with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
works. Process. Syst. (NIPS), 2012, pp. 1097–1105.
The proposed optimization techniques are expected to be [18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in Proc. Int. Conf. Learn. Represent.
used in other object detectors to improve the throughput (ICLR), 2015, pp. 1–14.
and maintain the accuracy, as most object detectors did not [19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
consider imbalanced workloads and most DNN-based object image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
detection algorithms contain the postprocess and the batch (CVPR), Jun. 2016, pp. 770–778.
[20] A. G. Howard et al., “MobileNets: Efficient convolutional neural
normalization. For example, the refined postprocess can be networks for mobile vision applications,” 2017, arXiv:1704.04861.
utilized in [21]–[26] and [50]–[52] to process the NN-process [Online]. Available: http://arxiv.org/abs/1704.04861
and the postprocess in parallel. In addition, the BNF technique [21] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and
L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,”
can be applied to [21], [23], [25], and [50]–[52] that include in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
the batch normalization. pp. 4510–4520.
[22] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis.
ACKNOWLEDGMENT (ICCV), Dec. 2015, pp. 1440–1448.
[23] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
The authors would like to thank the IC Design Education real-time object detection with region proposal networks,” in Proc. Adv.
Center (IDEC), South Korea, for supporting the EDA tool. Neural Inf. Process. Syst., 2015, pp. 91–99.
[24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput.
R EFERENCES Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
[1] D. T. Nguyen, T. N. Nguyen, H. Kim, and H.-J. Lee, “A high-throughput [25] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in
and power-efficient FPGA implementation of YOLO CNN for object Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
detection,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, pp. 7263–7271.
no. 8, pp. 1861–1873, Aug. 2019. [26] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf.
[2] H. Nakahara, H. Yonekawa, T. Fujii, and S. Sato, “A lightweight Comput. Vis. (ECCV), 2016, pp. 21–37.
YOLOv2: A binarized CNN with a parallel support vector regression [27] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressin
for an FPGA,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate deep neural networks with pruning, trained quantization, and Huffman
Arrays, Feb. 2018, pp. 31–40. coding,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1–14.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

[28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: [52] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and effi-
ImageNet classification using binary convolutional neural networks,” in cient object detection,” 2019, arXiv:1911.09070. [Online]. Available:
Proc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 525–542. http://arxiv.org/abs/1911.09070
[29] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” 2016,
arXiv:1605.04711. [Online]. Available: http://arxiv.org/abs/1605.04711 Suchang Kim (Student Member, IEEE) received
[30] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantiza- the B.S. degree in electrical engineering from the
tion,” 2016, arXiv:1612.01064. [Online]. Availabl: http://arxiv.org/abs/ Korea Aerospace University, Goyang, South Korea,
1612.01064 in 2016, and the M.S. degree in electrical engi-
[31] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network neering from the Korea Advanced Institute of Sci-
quantization: Towards lossless CNNs with low-precision weights,” 2017, ence and Technology (KAIST), Daejeon, South
arXiv:1702.03044. [Online]. Available: http://arxiv.org/abs/1702.03044 Korea, in 2018, where he is currently working
[32] J. Choi, B. Y. Kong, and I.-C. Park, “Retrain-less weight quantization toward the Ph.D. degree at the School of Electrical
for multiplier-less convolutional neural networks,” IEEE Trans. Circuits Engineering.
Syst. I, Reg. Papers, vol. 67, no. 3, pp. 972–982, Mar. 2020. His current research interests include VLSI
[33] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training architectures for neural network accelerators and
deep neural networks with binary weights during propagations,” in Proc. computer arithmetic.
Adv. Neural. Inf. Process. Syst. (NIPS), 2015, pp. 3123–3131.
[34] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, Seungho Na received the B.S. degree in electronic
“Binarized neural networks: Training deep neural networks with weights engineering from Sungkyunkwan University, Suwon,
and activations constrained to +1 or −1,” 2016, arXiv:1602.02830. South Korea, in 2017, and the M.S. degree in elec-
[Online]. Available: http://arxiv.org/abs/1602.02830 trical engineering from the Korea Advanced Insti-
[35] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating tute of Science and Technology (KAIST), Daejeon,
deep network training by reducing internal covariate shift,” 2015, South Korea, in 2020.
arXiv:1502.03167. [Online]. Available: http://arxiv.org/abs/1502.03167 Since 2020, he has been an Engineer with Anapass
[36] H. Wong, V. Betz, and J. Rose, “Comparing FPGA vs. Custom Inc., Seoul, South Korea. His current research inter-
CMOS and the impact on processor microarchitecture,” in Proc. 19th ests include VLSI architectures for neural network
ACM/SIGDA Int. Symp. Field Program. Gate Arrays (FPGA), 2011, accelerators.
pp. 5–14.
[37] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser-
man. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Byeong Yong Kong (Member, IEEE) received the
Results. Accessed: Feb. 8, 2019. [Online]. Available: http://www.pascal- B.S., M.S., and Ph.D. degrees in electrical engineer-
network.org/challenges/VOC/voc2007/workshop/index.html ing from the Korea Advanced Institute of Science
[38] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser- and Technology (KAIST), Daejeon, South Korea,
man. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) in 2011, 2013, and 2017, respectively.
Results. Accessed: Feb. 8, 2019. [Online]. Available: http://www.pascal- From 2017 to 2018, he was a Senior Researcher
network.org/challenges/VOC/voc2012/workshop/index.html with the Agency for Defense Development, Daejeon,
[39] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” 2014, where he was involved in the research of guided
arXiv:1405.0312. [Online]. Available: http://arxiv.org/abs/1405.0312 missile systems. From 2018 to 2019, he was a
[40] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and Post-Doctoral Researcher and a Research Assistant
A. Zisserman. The PASCAL Visual Object Classes (VOC) Chal- Professor with KAIST. Since 2019, he has been an
lenge. Accessed: Feb. 8, 2019. [Online]. Available: http://citeseerx.ist. Assistant Professor with the Division of Electrical, Electronic, and Control
psu.edu/viewdoc/download?doi=10.1.1.157.5766&rep=rep1&type=pdf Engineering, Kongju National University, Cheonan, South Korea. His current
[41] Object Detection on a FPGA Board. Accessed: Aug. 22, 2019. [Online]. research interests include algorithms and very-large-scale integration architec-
Available: https://www.youtube.com/embed/9lJryP1fU2w tures for digital signal processing and wireless communications.
[42] L. Wan, D. Eigen, and R. Fergus, “End-to-end integration of a convolu- Dr. Kong was a recipient of the First Place Award at the Altera FPGA
tional network, deformable parts model and non-maximum suppression,” Design Contest in 2015.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,
pp. 851–859. Jaewoong Choi (Student Member, IEEE) received
[43] B. Jacob et al., “Quantization and training of neural networks for the B.S. degree in electronic engineering from
efficient integer-arithmetic-only inference,” in Proc. IEEE/CVF Conf. Hanyang University, Seoul, South Korea, in 2018,
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 2704–2713. and the M.S. degree in electrical engineering from
[44] H. Fan et al., “A real-time object detection accelerator with compressed the Korea Advanced Institute of Science and Tech-
SSDLite on FPGA,” in Proc. Int. Conf. Field-Program. Technol. (FPT), nology (KAIST), Daejeon, South Korea, in 2020.
Dec. 2018, pp. 14–21. His current research interests include structure
[45] D. Wu et al., “A high-performance CNN processor based on FPGA for of neural networks accelerator and digital signal
MobileNets,” in Proc. 29th Int. Conf. Field Program. Log. Appl. (FPL), processing.
Sep. 2019, pp. 136–143.
[46] L. Du et al., “A reconfigurable streaming deep convolutional neural
network accelerator for Internet of Things,” IEEE Trans. Circuits Syst. In-Cheol Park (Senior Member, IEEE) received the
I, Reg. Papers, vol. 65, no. 1, pp. 198–208, Jan. 2018. B.S. degree in electronic engineering from Seoul
[47] Stratix IV Device Handbook, Intel Co., Santa Clara, CA, USA, 2016, National University, Seoul, South Korea, in 1986,
pp. 81–118. and the M.S. and Ph.D. degrees in electrical engi-
[48] D. T. Nguyen, H. Kim, and H.-J. Lee, “Layer-specific optimization neering from the Korea Advanced Institute of Sci-
for mixed data flow with mixed precision in FPGA design for CNN- ence and Technology (KAIST), Daejeon, South
based object detectors,” 2020, arXiv:2009.01588. [Online]. Available: Korea, in 1988 and 1992, respectively.
http://arxiv.org/abs/2009.01588 Since June 1996, he has been an Assistant
[49] X. Zhang et al., “DNNBuilder: An automated tool for building high- Professor with the School of Electrical Engineering,
performance DNN hardware accelerators for FPGAs,” in Proc. Int. Conf. KAIST, where he is currently a Professor. Prior to
Comput.-Aided Design, Nov. 2018, pp. 1–8. joining KAIST, he was with the IBM T. J. Watson
[50] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for Research Center, Yorktown, NY, USA, from May 1995 to May 1996, where
dense object detection,” 2017, arXiv:1708.02002. [Online]. Available: he researched high-speed circuit design. His current research interests include
http://arxiv.org/abs/1708.02002 computer-aided design algorithms for high-level synthesis and very-large-scale
[51] J. Redmon and A. Farhadi, “YOLOv3: An incremental improve- integration architectures for general-purpose microprocessors.
ment,” 2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/abs/ Dr. Park received the Best Design Award at ASP-DAC in 1997 and the Best
1804.02767 Paper Award at ICCD in 1999.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.

You might also like