Real-Time Ssdlite Object Detection On Fpga
Real-Time Ssdlite Object Detection On Fpga
Real-Time Ssdlite Object Detection On Fpga
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
DNNs have been implemented by utilizing general-purpose dilated convolutions and convolutions associated with a stride
graphics processing units (GPGPUs), field-programmable gate of two were replaced with convolutions with a stride of one in
arrays (FPGAs), and application-specific integrated circuits order to share hardware resources, and a dynamic quantization
(ASICs). As the GPGPU has numerous parallel computation method was used to preserve the accuracy. Fang et al. [5]
cores and high memory bandwidth, it is relatively easy to presented a heterogeneous architecture named Aristotle and
achieve high performance. Moreover, well-defined software a full stack of software tools needed for network quantiza-
frameworks, such as Caffe, TensorFlow, and PyTorch, promote tion, pruning, and FPGA deployment. Although the codesign
the GPGPU as a powerful tool for realizing DNN-based approach is effective in enhancing the throughput, the repet-
object detection [17]–[26]. Since the GPGPU consumes large itive redesign of the network increases the development cost,
power, however, it is appropriate only for cloud servers and the quantization method in [4] induces a drop of accuracy
or workstations that are not constrained in energy con- even at the expense of additional hardware resources compared
sumption. On the other hand, as the FPGA and the ASIC to the simple integer arithmetic unit whose precision is enough
can achieve high throughput with much less power than to support the dynamic range of each layer. In addition, the
the GPGPU, they are appropriate in energy-constrained workload required in performing the object detection has not
devices [1]–[11], [44], [45], [48], [49]. Specifically, the FPGA, been fully analyzed, and the architectures did not consider the
which consists of programmable logic blocks, DSP blocks, balance of tasks, so there is still plenty of room to optimize
and block RAMs, has gained great attention in accelerat- in implementing a high-throughput object detector on FPGA
ing DNNs due to its reconfigurability and low development devices.
cost [1]–[5], [44], [45], [48], [49]. In addition, many DNN In this article, we propose an efficient computing system that
quantization techniques [27]–[34], [43], which have been involves novel hardware architecture and system optimization
developed to reduce the amount of data to access and to allow techniques for real-time SSDLite object detection on FPGA
fixed-point (fxp) arithmetic operations, play a considerable devices. An efficient neural processing unit (NPU) is proposed,
role in enhancing the power efficiency and throughput of an which consists of heterogeneous units such as band processing
FPGA implementation. (BP), scaling and accumulating (SA), and data fetching and
For the implementation of real-time object detection on formatting (DFF) units. The BP and SA units are optimized for
FPGA devices, a binarized network model and its hardware the depthwise CL (DCL) and the pointwise CL (PCL), effec-
architecture were proposed [1]–[3]. Nguyen et al. [1] quan- tively reducing memory accesses. The DFF unit arranges the
tized the weights of YOLO to binarized ones and used low-bit data into a form suitable for the BP and SA units and operates
feature maps, making it possible to store the entire network in parallel with them, reducing the data formatting latency.
model and the intermediate feature maps into the block RAM. In addition, system optimization techniques are devised to
In addition, all convolutional layers (CLs) were pipelined enhance the throughput further. A task control unit (TCU) is
to achieve high throughput. A lightweight YOLOv2 was proposed to balance the workload and improve the utilization
proposed in [2], which uses the binarized network model of heterogeneous units in the NPU. The detection algorithm is
for feature extraction and the parallel support vector regres- refined to remove the postprocess latency and quantize the
sion (SVR) for classification and localization. A quantized net- feature and parameter representations. The proposed object
work model was also proposed in [3], where the hidden layers detector is implemented on FPGA boards, and a live demon-
are binarized and the input and output layers are quantized to stration shown in Fig. 1(b) can be found on YouTube [41],
eight-bit fxp values to ensure the accuracy. To increase the where the camera module captures the image displayed on
throughput, an advanced extension called NEON was devel- the tablet and the LCD display module shows the detection
oped based on the single instruction multiple data (SIMD) results.
architecture. The binarized network model is effective in The rest of this article is organized as follows.
achieving high throughput, but it requires hard retraining Section II describes SSDLite in detail and the fundamen-
processes to restore the accuracy. The binarization of weights tal architecture of object detection. The proposed hardware
often leads to overquantization that makes it hard to achieve architecture is described in Section III, and system opti-
the desired accuracy. Sometimes, a network model and its mization techniques are explained in Section IV. Section V
hardware architecture were codesigned together [4], [5]. In [4], summarizes and evaluates the characteristics of FPGA
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
implementations. Finally, concluding remarks are made in Algorithm 1 Baseline Postprocess [42]
Section VI.
II. BACKGROUND
In this section, we first explain the baseline SSDLite object
detection algorithm [21] and then elaborate the primitive
operations in the network and the postprocess. Subsequently,
we analyze the previous architectures in the literature.
A. SSDLite Overview
The baseline SSDLite is based on a feedforward convolu-
tional network and trained with the VOC 2007 data set [37].
It produces a collection of bounding boxes and classification
scores of object instances in those boxes and can suggest up
to 3000 bounding boxes and recognize 20 classes, such as
humans, cars, and buses. As shown in Fig. 2, SSDLite consists
of three processes: preprocess, NN-process, and postprocess.
In the preprocess, a raw image is divided into different chan-
nels according to the color in order to make the image conform
to the input format of the NN-process. In the meantime, all
pixel values are normalized to have a predefined mean and
variance.
The NN-process is partitioned into feature extraction and
convolution prediction. In the feature extraction, a deep con-
volutional neural network, which consists of the truncated
MobileNetV2 base network and the additional layers following
the base network, extracts features gradually from the input
image. The basic building block of the feature extractor is an
inverted residual block (IRB) composed of two PCLs and one
DCL between the PCLs. By using six multiscale intermediate
feature maps generated by the feature extractor, which is
shown on top of the feature extractor in Fig. 2, the convolution
predictors predict objects. The basic building block of the
convolution predictors is a DSC consisting of a DCL and a map, as shown in Fig. 3(a). The computation in the DCL is
PCL. There are 12 DSCs, and two of them are paired. Each defined as
pair produces the classification scores and the shape offsets of k−1 k−1
bounding boxes. The convolution predictors use six multiscale
intermediate feature maps to make 2166, 600, 150, 54, 24, and oxyz = wlmz × f (sx+l)(sy+m)z + bz (1)
l=0 m=0
6 object proposals.
The postprocess is to obtain final object detections. where w, f , b, and s denote the kernel, the input feature,
It adjusts the bounding boxes by using the shape offsets the bias, and the stride size, respectively.
and performs the nonmaximum suppression to suppress some In a PCL, an h × w output feature map is generated by
proposals that are associated with low classification scores or convolving an h × w × d input feature map with a 1 × 1 × d
have high intersections with other bounding boxes. kernel. c 1 × 1 × d kernels are convolved with the same
h × w × d input feature map to produce an h × w × c
B. Primitive Operations output feature map, as shown in Fig. 3(b). The computation
in the PCL is defined as
As described in Section II-A, most layers in SSDLite are d−1
DCLs or PCLs whose operations are shown in Fig. 3. As 2-D
kernels and small-sized 3-D kernels are used in the DCL and o xyz = wnz × f(sx)(sy)n + bz . (2)
n=0
PCL, SSDLite is effective in reducing model parameters and
computation complexity compared to other network models The primitive operation in the DCL is the channelwise 2-D
based on the conventional CLs. convolution that does not accumulate in the depth direction
In a DCL, a k × k kernel and an h × w input feature map and that in the PCL is to scale 2-D input feature maps and
are convolved to form an e × f output feature map. There accumulate in the depth direction. Though both primitive
are d k × k kernels and d h × w input feature maps, and a operations are a kind of convolution operation, their data
e × f × d output feature map is generated by conducting 2-D flows are so different that separate accelerators are required
convolution for each pair of the kernel and the input feature to process the layers efficiently.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
include PE arrays and several data buffers to reuse the data and
access them rapidly. The PE arrays in the BP and SA units
are optimized for the DCL and the PCL, respectively, and
Fig. 7. Structure of the BP block.
the activation function is specially implemented in hardware.
The data buffers are connected to the DFF unit through the
high-bandwidth data channel. The DFF unit rearranges the data map, and the sum of them. The network model of SSDLite
fetched from the off-chip memory into a form suitable for the has a residual connection in a PCL, which is realized in the
BP and SA units, and vice versa. SA unit by making the 3-to-1 multiplexer choose the sum of
The high-bandwidth data channel is realized separately from the bias and the output feature.
the control channel to secure the large data bandwidth required Fig. 7 shows the pipelined structure of the BP block,
in the system. The inference process necessitates enormous which is optimized for the processing of the DCL. There are
amounts of feature maps and parameters to be stored in and three PE arrays in the block, each of which performs 1-D
loaded from the off-chip memory, and the final result of convolution. Accumulating the 1-D convolution results of the
the NPU moves to the host processor, being postprocessed three PE arrays leads to a row of the output feature map of
there. The high-bandwidth data channels supporting the direct 3 × 3 convolution. Each PE in a PE array computes the dot
connections among the off-chip memory, the NPU, and the product given three inputs and three weights and adds an extra
host processor play a significant role in enhancing the data input. The extra input is the output of the 2-to-1 multiplexer
transmission rate drastically. in Fig. 6 in case of PE array 0 and the 1-D convolution result of
the previous PE array in case of PE arrays 1 and 2. One row of
the 3×3 kernel stays in a PE array and is broadcasted to all the
B. Neural Processing Unit PEs in the PE array, whereas one row of the input feature map
The structure of the NPU is presented in Fig. 6. In the BP is loaded into all the three PE arrays simultaneously. A new
and SA units, input buffers (IBs), weight buffers (WBs), and row of the input feature map is fed into the PE arrays every
bias buffers (BBs) are used to store input features, weights, cycle while the kernel stays, and a row of the output feature
and biases, respectively, and feed the data to the BP and SA map is generated in PE array 2 with a latency of two cycles.
blocks. The BP block generates a 2-D output feature map by In consequence, the BP block generates an output feature map
performing 3 × 3 convolution for a 2-D input feature map, of (n −2)×T for T +2 cycles. As the generated output feature
and the SA block generates a 2-D output feature map by map looks like a vertical band of width n − 2, the block is
conducting scalar-matrix multiplication and matrix addition for called a band-processing block. An example of generating a
a set of 2-D input feature maps. Passing through the activation 1-D output feature map is also presented in Fig. 7. The first
hardware, the outputs of the BP and SA blocks are stored in the row of the input feature map is convolved at cycle t with
output buffers (OBs). Note that every buffer in the NPU is in the first row of the kernel in PE array 0, and the result is
fact realized with a pair of registers in order to make it possible forwarded to PE array 1. The second input row is processed
to conduct data processing and data fetching concurrently. The at cycle t + 1 in PE array 1, and the convolution result is
switch under a buffer controls which register is connected to added to the forwarded one. To generate a row of the output
the BP and SA block. The multiplexer located after the BB is feature map, the third input row is convolved at cycle t + 2 in
to handle a CL in the BP unit and a residual block in the SA PE array 2, and the convolution result is accumulated to the
unit. The 2-to-1 multiplexer in the BP unit is to choose either result forwarded from PE array 1. The BP block can support
a bias or the output feature map, and the 3-to-1 multiplexer in the larger kernel by using the filter decomposition technique
the SA unit is to select one among a bias, the output feature described in [46].
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 11. Workload analysis of the baseline system. (a) Workload flow among
the units. (b) Breakdowns of workload in the Stratix IV and Arria 10 FPGA
implementations.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 12. Proposed system enhanced by the TCU. (a) Workload flow in the
proposed system and (b) resulting throughput improvement in the Stratix IV
and Arria 10 implementations.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 13. Processing flows of (a) baseline SSDLite and (b) SSDLite with
optimization.
B. Algorithm Optimization
The processing flow of SSDLite is shown in Fig. 13, where
the gray boxes are processed in software on the host processor
and the others are processed in hardware. In the baseline, Fig. 14. Feature and parameter distributions. (a) Feature distribution and
the postprocess stalls until the NN-process is completed as (b) parameter distribution obtained with and without the BNF technique.
shown in Fig. 13(a) since the host processor is busy to manage
the NN-process and the baseline postprocess waits for the total process is incrementally conducted in the refined postprocess.
N proposals to be generated as described in Algorithm 1. The detection results of the i th postprocess are included in the
The N proposals consist of six groups that are generated nonmaximum suppression process of the (i +1)th postprocess.
by the six convolution predictors. The number of proposals The final detections are determined at the end of the last
in a group decreases from 2166 to 6, decreasing by a factor postprocess, having the same results as the baseline post-
of one-fourth in the next group. The postprocess spends most process. The details of the refined postprocess are described in
of its time in processing earlier groups generating a larger Algorithm 2, where L and Nl are the numbers of decomposed
number of proposals. In particular, the number of proposals postprocesses and the number of proposals produced in the lth
produced in the first group is more than two-thirds of the group, respectively.
entire proposals. To quantize the baseline SSDLite, we first analyze the distri-
We can make the proposals available in the middle of the bution of model parameters and that of feature maps. Fig. 14(a)
NN-process by executing the convolution predictors immedi- shows the mean of top-100 maximum and minimum absolute
ately after the corresponding input feature map is created. values of the feature maps obtained from the VOC 2007 test
Specifically, 40th, 52nd, 55th, 58th, 61st, and 64th layers set, and Fig. 14(b) shows the maximum and minimum absolute
in the feature extractor generate the feature maps for the values of the model parameters. The ranges of the feature maps
convolution predictors. However, the baseline postprocess is do not fluctuate largely across the layers, so the feature map
not able to process each group separately. By modifying the can be quantized to a certain fxp representation for all the
baseline postprocess to process the proposals as soon as they layers.
are generated, therefore, the overall throughput is enhanced, On the other hand, the range of the model parameters
as shown in Fig. 13(b). varies largely across the layers, so the parameter representation
The proposed refined postprocess decomposes the base- requires a large number of bits in a fxp representation or
line postprocess into six separate postprocesses, and each takes various formats to make the quantization error insignif-
postprocess is in charge of handling a single proposal icant. To alleviate the range fluctuation, we apply the batch
group generated by a convolution predictor. The whole post- normalization folding (BNF) technique [43], which applies
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I TABLE II
C HARACTERISTICS OF FPGA S U SED IN P ROTOTYPE I MPLEMENTATIONS P RECISION AND D ETECTION A CCURACY A CHIEVED
W HEN T RAINED W ITH VOC07 + 12 + COCO
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 17. Layer-by-layer decomposition of SSDLite processing time on Intel Stratix IV FPGA.
TABLE III
FPGA I MPLEMENTATIONS OF DNN-BASED O BJECT D ETECTION
implementation are the same since the number of PEs in the architecture achieves the higher detection accuracy without
BP and SA units is unchanged, whereas the logic complexity conducting the retraining process and the higher throughput
and the Block RAM utilization are slightly increased by but for the overly compressed network models suffering from
employing the TCU in the proposed architecture. accuracy degradation. For real-time applications, the latency
On Intel Arria 10 FPGA, the proposed architecture has is as important as the throughput. To evaluate the latency,
achieved a frame rate of 84.8/s, which is higher than those the processing time taken for a batch size of one is summarized
obtained from other FPGA implementations except for [1], [3], in Table III. The proposed architecture takes 11.79 ms on
and [5]. As [1], [3], and [5] compress the network model Intel Arria 10 FPGA, which is lower than all the other FPGA
excessively to improve the throughput, however, their detection implementations.
accuracies are much lower than that of the proposed architec- The proposed object detectors have less hardware resources
ture. For a fair comparison, the frame rate is normalized by compared to the previous works in Table III. The proposed
considering the area taken in implementation. The normalized BP and SA units are optimized for the DCL and the PCL as
frame rate achieved in our implementation on Intel Arria 10 is SSDLite is mainly composed of DCLs and PCLs to reduce the
13.6× and 11.6× higher than the Stratix 10 and Arria 10 size of model parameters and the complexity of computation.
implementations in [4], respectively. In short, the proposed The power consumption has been estimated using a power
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
analyzer tool provided by Intel. The proposed architecture [3] T. B. Preuser, G. Gambardella, N. Fraser, and M. Blott, “Inference of
consumes 5.1 W on Intel Stratix IV FPGA and 9.88 W quantized neural networks on heterogeneous all-programmable devices,”
in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2018,
on Intel Arria 10 FPGA, which is lower than other FPGA pp. 833–838.
implementations except for [2] that uses the binarized network [4] Y. Ma, T. Zheng, Y. Cao, S. Vrudhula, and J.-S. Seo, “Algorithm-
model. The proposed architecture on Intel Arria 10 FPGA hardware co-design of single shot detector for fast object detection on
FPGAs,” in Proc. Int. Conf. Comput.-Aided Design, Nov. 2018, pp. 1–8.
consumes about twice as much power as [2] but supports
[5] S. Fang et al., “Real-time object detection and semantic segmentation
about twice higher frame rate than [2], which means that both hardware system with deep learning networks,” in Proc. Int. Conf. Field-
architectures consume almost the same energy in processing Program. Technol. (FPT), Dec. 2018, pp. 389–392.
a frame. The energy consumed in processing a frame is [6] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.-W. Tai, “Exploring heteroge-
neous algorithms for accelerating deep convolutional neural networks on
summarized in Table III. The proposed architecture on Intel FPGAs,” in Proc. 54th Annu. Design Autom. Conf., Jun. 2017, pp. 1–6.
Arria 10 FPGA has almost the same energy consumption as the [7] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
implementations in [1], [2], and [5] that have excessively com- FPGA-based accelerator design for deep convolutional neural networks,”
pressed network models and 24.2× and 20.8× smaller than the in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2015,
pp. 161–170.
Stratix-10 and Arria-10 implementations in [4], respectively. [8] A. Shawahna, S. M. Sait, and A. El-Maleh, “FPGA-based accelerators
of deep learning networks for learning and classification: A review,”
VI. C ONCLUSION IEEE Access, vol. 7, pp. 7823–7859, Dec. 2019.
[9] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: A spatial archi-
This article has proposed novel hardware architecture and tecture for energy-efficient dataflow for convolutional neural networks,”
system optimization techniques effective in realizing real-time IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, Jan. 2017.
DNN-based object detection. In the proposed architecture, an [10] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing
of deep neural networks: A tutorial and survey,” Proc. IEEE, vol. 105,
NPU consisting of heterogeneous units, BP, SA, and DFF no. 12, pp. 2295–2329, Dec. 2017.
units, was devised to efficiently accelerate the neural network [11] J. Jo, S. Cha, D. Rho, and I.-C. Park, “DSIP: A scalable inference
process. The BP and SA units were optimized to process the accelerator for convolutional neural networks,” IEEE J. Solid-State
Circuits, vol. 53, no. 2, pp. 605–618, Feb. 2018.
DCL and the PCL, effectively reducing memory accesses.
[12] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
The DFF unit arranges the data into a form suitable for no. 7553, pp. 436–444, May 2015, doi: 10.1038/nature14539.
the BP and SA units, removing the data formatting latency. [13] A. Hannun et al., “Deep speech: Scaling up end-to-end speech
For system optimization, a TCU was developed to relax the recognition,” 2014, arXiv:1412.5567. [Online]. Available: http://arxiv.
org/abs/1412.5567
excessive workload of the host processor and increase the [14] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition
utilization of heterogeneous units in the NPU. In addition, with deep recurrent neural networks,” in Proc. IEEE Int. Conf.
the detection algorithm was optimized to remove the latency Acoust., Speech Signal Process., Vancouver, BC, Canada, May 2013,
of the postprocess and quantize the feature and parameter pp. 6645–6649.
[15] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
representations. Two prototype object detectors, which were with neural networks,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS),
implemented on Intel Stratix IV and Intel Arria 10 FPGAs, 2014, pp. 3104–3112.
revealed that the proposed system is associated with the higher [16] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic
grasps,” Int. J. Robot. Res., vol. 34, nos. 4–5, pp. 705–724, Mar. 2015.
throughput, the lower latency, and the higher energy efficiency
[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
at high detection accuracy than the previous state-of-the-art with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
works. Process. Syst. (NIPS), 2012, pp. 1097–1105.
The proposed optimization techniques are expected to be [18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in Proc. Int. Conf. Learn. Represent.
used in other object detectors to improve the throughput (ICLR), 2015, pp. 1–14.
and maintain the accuracy, as most object detectors did not [19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
consider imbalanced workloads and most DNN-based object image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
detection algorithms contain the postprocess and the batch (CVPR), Jun. 2016, pp. 770–778.
[20] A. G. Howard et al., “MobileNets: Efficient convolutional neural
normalization. For example, the refined postprocess can be networks for mobile vision applications,” 2017, arXiv:1704.04861.
utilized in [21]–[26] and [50]–[52] to process the NN-process [Online]. Available: http://arxiv.org/abs/1704.04861
and the postprocess in parallel. In addition, the BNF technique [21] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and
L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,”
can be applied to [21], [23], [25], and [50]–[52] that include in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
the batch normalization. pp. 4510–4520.
[22] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis.
ACKNOWLEDGMENT (ICCV), Dec. 2015, pp. 1440–1448.
[23] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
The authors would like to thank the IC Design Education real-time object detection with region proposal networks,” in Proc. Adv.
Center (IDEC), South Korea, for supporting the EDA tool. Neural Inf. Process. Syst., 2015, pp. 91–99.
[24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput.
R EFERENCES Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
[1] D. T. Nguyen, T. N. Nguyen, H. Kim, and H.-J. Lee, “A high-throughput [25] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in
and power-efficient FPGA implementation of YOLO CNN for object Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
detection,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, pp. 7263–7271.
no. 8, pp. 1861–1873, Aug. 2019. [26] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf.
[2] H. Nakahara, H. Yonekawa, T. Fujii, and S. Sato, “A lightweight Comput. Vis. (ECCV), 2016, pp. 21–37.
YOLOv2: A binarized CNN with a parallel support vector regression [27] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressin
for an FPGA,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate deep neural networks with pruning, trained quantization, and Huffman
Arrays, Feb. 2018, pp. 31–40. coding,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1–14.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: [52] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and effi-
ImageNet classification using binary convolutional neural networks,” in cient object detection,” 2019, arXiv:1911.09070. [Online]. Available:
Proc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 525–542. http://arxiv.org/abs/1911.09070
[29] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” 2016,
arXiv:1605.04711. [Online]. Available: http://arxiv.org/abs/1605.04711 Suchang Kim (Student Member, IEEE) received
[30] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantiza- the B.S. degree in electrical engineering from the
tion,” 2016, arXiv:1612.01064. [Online]. Availabl: http://arxiv.org/abs/ Korea Aerospace University, Goyang, South Korea,
1612.01064 in 2016, and the M.S. degree in electrical engi-
[31] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network neering from the Korea Advanced Institute of Sci-
quantization: Towards lossless CNNs with low-precision weights,” 2017, ence and Technology (KAIST), Daejeon, South
arXiv:1702.03044. [Online]. Available: http://arxiv.org/abs/1702.03044 Korea, in 2018, where he is currently working
[32] J. Choi, B. Y. Kong, and I.-C. Park, “Retrain-less weight quantization toward the Ph.D. degree at the School of Electrical
for multiplier-less convolutional neural networks,” IEEE Trans. Circuits Engineering.
Syst. I, Reg. Papers, vol. 67, no. 3, pp. 972–982, Mar. 2020. His current research interests include VLSI
[33] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training architectures for neural network accelerators and
deep neural networks with binary weights during propagations,” in Proc. computer arithmetic.
Adv. Neural. Inf. Process. Syst. (NIPS), 2015, pp. 3123–3131.
[34] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, Seungho Na received the B.S. degree in electronic
“Binarized neural networks: Training deep neural networks with weights engineering from Sungkyunkwan University, Suwon,
and activations constrained to +1 or −1,” 2016, arXiv:1602.02830. South Korea, in 2017, and the M.S. degree in elec-
[Online]. Available: http://arxiv.org/abs/1602.02830 trical engineering from the Korea Advanced Insti-
[35] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating tute of Science and Technology (KAIST), Daejeon,
deep network training by reducing internal covariate shift,” 2015, South Korea, in 2020.
arXiv:1502.03167. [Online]. Available: http://arxiv.org/abs/1502.03167 Since 2020, he has been an Engineer with Anapass
[36] H. Wong, V. Betz, and J. Rose, “Comparing FPGA vs. Custom Inc., Seoul, South Korea. His current research inter-
CMOS and the impact on processor microarchitecture,” in Proc. 19th ests include VLSI architectures for neural network
ACM/SIGDA Int. Symp. Field Program. Gate Arrays (FPGA), 2011, accelerators.
pp. 5–14.
[37] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser-
man. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Byeong Yong Kong (Member, IEEE) received the
Results. Accessed: Feb. 8, 2019. [Online]. Available: http://www.pascal- B.S., M.S., and Ph.D. degrees in electrical engineer-
network.org/challenges/VOC/voc2007/workshop/index.html ing from the Korea Advanced Institute of Science
[38] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser- and Technology (KAIST), Daejeon, South Korea,
man. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) in 2011, 2013, and 2017, respectively.
Results. Accessed: Feb. 8, 2019. [Online]. Available: http://www.pascal- From 2017 to 2018, he was a Senior Researcher
network.org/challenges/VOC/voc2012/workshop/index.html with the Agency for Defense Development, Daejeon,
[39] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” 2014, where he was involved in the research of guided
arXiv:1405.0312. [Online]. Available: http://arxiv.org/abs/1405.0312 missile systems. From 2018 to 2019, he was a
[40] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and Post-Doctoral Researcher and a Research Assistant
A. Zisserman. The PASCAL Visual Object Classes (VOC) Chal- Professor with KAIST. Since 2019, he has been an
lenge. Accessed: Feb. 8, 2019. [Online]. Available: http://citeseerx.ist. Assistant Professor with the Division of Electrical, Electronic, and Control
psu.edu/viewdoc/download?doi=10.1.1.157.5766&rep=rep1&type=pdf Engineering, Kongju National University, Cheonan, South Korea. His current
[41] Object Detection on a FPGA Board. Accessed: Aug. 22, 2019. [Online]. research interests include algorithms and very-large-scale integration architec-
Available: https://www.youtube.com/embed/9lJryP1fU2w tures for digital signal processing and wireless communications.
[42] L. Wan, D. Eigen, and R. Fergus, “End-to-end integration of a convolu- Dr. Kong was a recipient of the First Place Award at the Altera FPGA
tional network, deformable parts model and non-maximum suppression,” Design Contest in 2015.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,
pp. 851–859. Jaewoong Choi (Student Member, IEEE) received
[43] B. Jacob et al., “Quantization and training of neural networks for the B.S. degree in electronic engineering from
efficient integer-arithmetic-only inference,” in Proc. IEEE/CVF Conf. Hanyang University, Seoul, South Korea, in 2018,
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 2704–2713. and the M.S. degree in electrical engineering from
[44] H. Fan et al., “A real-time object detection accelerator with compressed the Korea Advanced Institute of Science and Tech-
SSDLite on FPGA,” in Proc. Int. Conf. Field-Program. Technol. (FPT), nology (KAIST), Daejeon, South Korea, in 2020.
Dec. 2018, pp. 14–21. His current research interests include structure
[45] D. Wu et al., “A high-performance CNN processor based on FPGA for of neural networks accelerator and digital signal
MobileNets,” in Proc. 29th Int. Conf. Field Program. Log. Appl. (FPL), processing.
Sep. 2019, pp. 136–143.
[46] L. Du et al., “A reconfigurable streaming deep convolutional neural
network accelerator for Internet of Things,” IEEE Trans. Circuits Syst. In-Cheol Park (Senior Member, IEEE) received the
I, Reg. Papers, vol. 65, no. 1, pp. 198–208, Jan. 2018. B.S. degree in electronic engineering from Seoul
[47] Stratix IV Device Handbook, Intel Co., Santa Clara, CA, USA, 2016, National University, Seoul, South Korea, in 1986,
pp. 81–118. and the M.S. and Ph.D. degrees in electrical engi-
[48] D. T. Nguyen, H. Kim, and H.-J. Lee, “Layer-specific optimization neering from the Korea Advanced Institute of Sci-
for mixed data flow with mixed precision in FPGA design for CNN- ence and Technology (KAIST), Daejeon, South
based object detectors,” 2020, arXiv:2009.01588. [Online]. Available: Korea, in 1988 and 1992, respectively.
http://arxiv.org/abs/2009.01588 Since June 1996, he has been an Assistant
[49] X. Zhang et al., “DNNBuilder: An automated tool for building high- Professor with the School of Electrical Engineering,
performance DNN hardware accelerators for FPGAs,” in Proc. Int. Conf. KAIST, where he is currently a Professor. Prior to
Comput.-Aided Design, Nov. 2018, pp. 1–8. joining KAIST, he was with the IBM T. J. Watson
[50] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for Research Center, Yorktown, NY, USA, from May 1995 to May 1996, where
dense object detection,” 2017, arXiv:1708.02002. [Online]. Available: he researched high-speed circuit design. His current research interests include
http://arxiv.org/abs/1708.02002 computer-aided design algorithms for high-level synthesis and very-large-scale
[51] J. Redmon and A. Farhadi, “YOLOv3: An incremental improve- integration architectures for general-purpose microprocessors.
ment,” 2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/abs/ Dr. Park received the Best Design Award at ASP-DAC in 1997 and the Best
1804.02767 Paper Award at ICCD in 1999.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 17,2021 at 02:14:22 UTC from IEEE Xplore. Restrictions apply.