[go: up one dir, main page]

0% found this document useful (0 votes)
21 views14 pages

Origami

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 14

ARXIV PREPRINT 1

Origami: A 803 GOp/s/W


Convolutional Network Accelerator
Lukas Cavigelli, Student Member, IEEE, and Luca Benini, Fellow, IEEE

Abstract—An ever increasing number of computer vision and production environment [3], [8], [11]. These companies are
image/video processing challenges are being approached using mainly interested in running such algorithms on powerful
deep convolutional neural networks, obtaining state-of-the-art re- compute clusters in large data centers.
sults in object recognition and detection, semantic segmentation, With the increasing number of imaging devices the im-
action recognition, optical flow and superresolution. Hardware portance of digital signal processing in imaging continues
acceleration of these algorithms is essential to adopt these
to grow. The amount of on- and near-sensor computation is
arXiv:1512.04295v2 [cs.CV] 19 Jan 2016

improvements in embedded and mobile computer vision systems.


We present a new architecture, design and implementation as well rising to thousands of operations per pixel, requiring powerful
as the first reported silicon measurements of such an accelerator, energy-efficient digital signal processing solutions, often co-
outperforming previous work in terms of power-, area- and I/O- integrated with the imaging circuitry itself to reduce overall
efficiency. The manufactured device provides up to 196 GOp/s on system cost and size [12]. Such embedded vision systems that
3.09 mm2 of silicon in UMC 65 nm technology and can achieve extract meaning from imaging data are enabled by more and
a power efficiency of 803 GOp/s/W. The massively reduced more energy-efficient, low-cost integrated parallel processing
bandwidth requirements make it the first architecture scalable engines (multi-core DSPs, GPUs, platform FPGAs). This per-
to TOp/s performance. mits a new generation of distributed computer vision systems,
Keywords—Computer Vision, Convolutional Networks, VLSI. which can bring huge value to a vast range of applications
by reducing the costly data transmission, forwarding only the
I. I NTRODUCTION desired information [1], [13].
Many opportunities for challenging research and innovative
T ODAY computer vision technologies are used with great
success in many application areas, solving real-world
problems in entertainment systems, robotics and surveil-
applications will pan out from the evolution of advanced
embedded video processing and future situational awareness
systems. As opposed to conventional visual monitoring sys-
lance [1]. More and more researchers and engineers are tems (CCTVs, IP cameras) that send the video data to a data
tackling action and object recognition problems with the help center to be stored and processed, embedded smart cameras
of brain-inspired algorithms, featuring many stages of feature process the image data directly on board. This can significantly
detectors and classifiers, with lots of parameters that are opti- reduce the amount of data to be transmitted and the required
mized using the wealth of data that has recently become avail- human intervention – the sources of the two most expensive
able. These “deep learning” techniques are achieving record- aspects of video surveillance [14]. Embedding convolutional
breaking results on very challenging problems and datasets, network classifiers in distributed computer vision systems,
outperforming either more mature concepts trying to model seems a natural direction of evolution, However, deep neural
the specific problem at hand [2], [3], [4], [5], [6] or joining networks are commonly known for their demand of computing
forces with traditional approaches by improving intermediate power, making it challenging to bring this computational load
steps [7], [8]. Convolutional Networks (ConvNets) are a prime within the power envelope of embedded systems – in fact, most
example of this powerful, yet conceptually simple paradigm state-of-the-art neural networks are currently not only trained,
[9], [10]. They can be applied to various data sources and but also evaluated on workstations with powerful GPUs to
perform best when the information is spatially or temporally achieve reasonable performance.
well-localized, but still has to be seen in a more global context Nevertheless, there is strong demand for mobile vision
such as in images. solutions ranging from object recognition to advanced human-
As a testimony of the success of deep learning approaches, machine interfaces and augmented reality. The market size is
several research programs have been launched, even by major estimated to grow to many billions of dollars over the next
global industrial players (e.g. Facebook, Google, Baidu, Mi- few years with an annual growth rate of more than 13% [15].
crosoft, IBM), pushing towards deploying services based on This has prompted many new commercial solutions to become
brain-inspired machine learning to their customers within a available recently, specifically targeting the mobile sector [16],
L. Cavigelli and L. Benini are with the Department of Electrical Engineering [17], [18].
and Information Technology, ETH Zurich, 8092 Zurich, Switzerland. E-mail: In this paper we present:
{cavigelli, benini}@iis.ee.ethz.ch. • The architecture of a novel convolutional network ac-
This work was funded by armasuisse Science & Technology and the ERC celerator, which is scalable to TOP/s performance while
MultiTherman project (ERC-AdG-291125).
The authors would like to thank David Gschwend, Christoph Mayer and remaining area- and energy-efficient and keeping I/O
Samuel Willi for their contributions during design and testing of Origami. throughput within the limits of economical packages and
low power budgets. This extends our work in [19].
ARXIV PREPRINT 2

• An implementation of this architecture with optimized linear unit (ReLU) [2], [4], [5], which designates the function
precision using fixed-point evaluations constrained for x 7→ max(0, x). The activation function introduces non-
an accelerator-sized ASIC. linearity into neural networks, giving them the potential to be
• Silicon measurements of the taped-out ASIC, providing more powerful than linear methods. Typical filter sizes range
experimental characterization of the silicon. from 5 × 5 to 9 × 9, sometimes even 11 × 11 [2], [4], [21].
• A thorough comparison to and discussion of previous
work. v = actReLU (y), vo (j, i) = max(yo (j, i), 0) (5)
Organization of the paper: Section II shortly introduces
convolutional networks and highlights the need for accelera- The feature extractor with the convolutional layers is usually
tion. Previous work is investigated in Section III, discussing followed by a classification step with fully-connected neural
available software, FPGA and ASIC implementations and network layers interspersed with activation functions, reducing
explaining the selection of our design objectives. In Section IV the dimensionality from several hundred or even thousands
we present our architecture and its properties. The implemen- down to the number of classes. In case of scene labeling these
tation aspects are shown in Section V. We present our results fully-connected layers are just applied on a per-pixel basis with
in Section VI and discuss and compare them in Section VII. inputs being the values of all the channels at any given pixel
We conclude the paper in Section VIII. pixel [22].

II. C ONVOLUTIONAL N ETWORKS A. Measuring Computational Complexity


Most convolutional networks (ConvNets) are built from the
Convolutional networks and deep neural networks in general
same basic building blocks: convolution layers, activation lay-
are advancing into more and more domains of computer
ers and pooling layers. One sequence of convolution, activation
vision and are becoming increasingly more accurate in their
and pooling is considered a stage, and modern, deep networks
traditional application area of object recognition and detection.
often consist of multiple stages. The convolutional network
ConvNets are now able to compute highly accurate optical
itself is used as a feature extractor, transforming raw data
flow [5], [6], [23], super-resolution [20] and more. The newer
into a higher-dimensional, more meaningful representation.
networks are usually deeper and require more computational
ConvNets particularly preserve locality through their limited
effort, and those for the newly tapped topics have already
filter size, which makes them very suitable for visual data (e.g.,
been very deep from the beginning. Research is done on
in a street scene the pixels in the top left corner contain little
various platforms and computing devices are evolving rapidly,
information on what is going on in the bottom right corner of
making time measurements meaningless. The deep learning
an image, but if there are pixels showing the sky all around
community has thus started to measure the complexity of
some segment of the image, this segment is certainly not a
deep learning networks in a way that is more independent
car). The feature extraction is then followed by a classifier,
of the underlying computing platform, counting the additions
such as a normal neural network or a support vector machine.
and multiplications of the synapses of these networks. For
A stage of a ConvNet can be captured mathematically as
a convolutional layer with nin input feature maps of size
y(`) = conv(x(`) , k(`) ) + b(`) , (1) hin × win , a filter kernel size of hk × wk , and nout output
(`+1) (`) feature maps, this number amounts to
x = pool(act(y )), (2)
where ` = 1 . . . 3 indexes the stages and where we start with 2nout nin hk wk (hin − hk + 1)(win − wk + 1), (6)
x(1) being the input image. The key operation on which we
focus is the convolution, which expands to where nout is the number of output channels |Cout |, nin is
X X the number of |Cin |, hin × win is the size of the image and
yo(`) (j, i) = bo(`) + (`)
ko,c (b, a)x(`)
c (j − b, i − a),
hk ×wk is the size of the filter in spatial domain. The factor of
(`) (`) two is because the multiplications and additions are counted as
c∈Cin (b,a)∈S
separate operations in this measure, which is the most common
(3)
in neural network literature [24], [25], [26], [27].
(`) However, this way of measuring complexity still does not
where o indexes the output channels Cout and c indexes the
(`) allow to perfectly determine how a network performs on
input channels Cin . The pixel is identified by the tuple (j, i)
different platforms. Accelerators might need to be initialized
and Sk denotes the support of the filters. In recently published
or have to suspend computation to load new filter values, often
networks [20], [21], [3], the pooling operation determines the
performing better for some artificially large or small problems.
maximum in a small neighborhood for each channel, often on
For this reason we distinguish between the throughput obtained
2 × 2 areas and with a stride of 2 × 2. x = poolmax,2×2 (v):
with a real network (actual throughput or just throughput),
xo (j, i) = max(vo (2j, 2i), vo (2j, 2i + 1), measurements obtained with a synthetic benchmark optimized
vo (2j + 1, 2i), vo (2j + 1, 2i + 1)) (4) to squeeze out the largest possible value (peak throughput),
and the maximum throughput of the computation units without
The activation function is applied point-wise for every pixel caring for bandwidth limits often stated in the device specifi-
and every channel. A currently popular choice is the rectified cations of non-specialized processors (theoretical throughput).
ARXIV PREPRINT 3

TABLE I. PARAMETERS OF THE T HREE S TAGES OF O UR R EFERENCE pixel


S CENE L ABELING C ONVOLUTIONAL N ETWORK . Act. Pooling Act. Pooling Act. class.

Stage 1 Stage 2 Stage 3 Classif. CPU Conv Conv Conv

Input size 240×320 117×157 55×75 49×69


# Input ch. 3 16 64 256 GPU Conv Conv Conv
# Output ch. 16 64 256 8 0% 20% 40% 60% 80% 100%
# Operations 346 MOp 1682 MOp 5428 MOp 115 MOp
# Filter val. 2.4k 50k 803k 17k Fig. 1. Computation time spent in different stages of our reference scene
labeling convolutional network [27].

TABLE II. N UMBER OF O PERATIONS R EQUIRED TO E VALUATE


Software and hardware implementations alike often come W ELL -K NOWN C ONVOLUTIONAL N ETWORKS .
with a throughput dependent on the actual size of the convo- name type challenge/dataset # GOp
lutional layer. While we make sure our chip can run a large
[27] SS 320×240 scene labeling stanford backgr., 74.8% 7.57
range of ConvNets efficiently, we use the one presented in [27] [27] SS full-HD scene labeling stanford backgr., 74.8% 259.5
as a reference for performance evaluation. It has three stages [27] MS 320×240 scene labeling stanford backgr., 80.6% 16.1
and we assume input images of size 240 × 320. The resulting AlexNet image recog. imagenet/ILSVRC 2012 1.7
OverFeat fast image recog. imagenet/ILSVRC 2013 5.6
sizes and complexities of the individual layers are summarized OverFeat accurate image recog. imagenet/ILSVRC 2013 10.7
in Table I and use a filter of size 7 × 7 for all of them. The GoogLeNet image recog. imagenet/ILSVRC 2014 3.6
total number of operations required is 7.57 GOp/frame. To give VGG Oxfordnet A image recog. imagenet/ILSVRC 2014 15.2
FlowNetS(-ft) optical flow synthetic & KITTI, Sintel 68.9
an idea of the complexity of more well-known ConvNets, we
have listed some of them in Table II. If we take an existing
system like the NeuFlow SoC [25] which is able to operate at
III. P REVIOUS W ORK
490 GOp/s/W, we can see that very high quality, dense optical
flow on 384 × 512 video can be computed with 25 frame/s Convolutional Networks have been achieving amazing re-
at a power of just around 3.5 W if we could scale up the sults lately, even outperforming humans in image recognition
architecture. We can also see that an optimized implementation on large and complex datasets such as Imagenet. The top
on a high-end GPU can run at around 27 frame/s. performers have achieved a top-5 error rate (actual class in
top 5 proposals predicted by the network) of only 6.67%
(GoogLeNet [3]) and 7.32% (VGG Oxfordnet [28]) at the
ILSVRC 2014 competition [29]. The best performance of a
B. Computational Effort single human so far is 5.1% on this dataset and has been
exceeded since the last large image recognition competition
Because convolutional networks can be evaluated signifi-
[30]. Also in other subjects such as face recognition [8],
cantly faster than traditional approaches of comparable ac-
ConvNets are exceeding human performance. We have listed
curacy (e.g. graphical models), they are approaching an area
the required number of operations to evaluate some of these
where real-time applications become feasible on workstations
networks in Table II.
with one, or more often, several GPUs. However, most ap-
In the remainder of this section, we will focus on existing
plication areas require a complete solution to fit within the
implementations to evaluate such ConvNets. We compare soft-
power envelope of an embedded systems or even a mobile
ware implementations running on desktop workstations with
device. Taking the aforementioned scene labeling ConvNet as
CPUs and GPUs, but also DSP-based works to existing FPGA
an examples, its usage in a real-time setting at 25 frame/s
and ASIC implementations. In Section III-D we discuss why
amounts to 189 GOp/s, which is out of the scope of even the
many such accelerators are not suitable to evaluate networks
most recent commercially available mobile processors [27].
of this size and conclude the investigation into previous work
For a subject area changing as rapidly as deep learning, the by discussing the limitation of existing hardware architectures
long-term usability is an important objective when thinking in Section III-E.
about hardware acceleration of the building blocks of such
systems. While the structure of the networks is changing
from application to application and from year to year, and A. Software Implementations (CPU, GPU, DSP)
better activation and pooling operations are continuously being Acceleration of convolutional neural networks has been
published, there is a commonality between all these ConvNets: discussed in many papers. There are very fast and user-friendly
the convolutional layer. It has been around since the early frameworks publicly available such as Torch [31], Caffe [32],
90s and has not changed since [9], [4], [3]. Fortunately, this Nvidia’s cuDNN [33] and Nervana Systems’ neon [34], and
key element is also the computation-intensive part for well- GPU-accelerated training and evaluation are the commonly
optimized software implementations (approx. 89% of the total way of working with ConvNets.
computation time on the CPU, or 79% on the GPU) as shown These and other optimized implementations can be used
in Figure 1. The time for activation and pooling is negligible as to obtain a performance and power efficiency baseline on
well as the computation time for the pixel-wise classification desktop workstations and CUDA-compatible embedded pro-
with fully-connected layers. cessors, such as the Tegra K1. On a GTX780 desktop GPU,
ARXIV PREPRINT 4

the performance can reach up to 3059 GOp/s for some special a Spartan 3A DSP 3400 FPGA using 18 bit fixed-point
problems and about 1800 GOp/s on meaningful ConvNets. On arithmetic for the multiplications. Its architecture was designed
the Tegra K1 up to 96 GOp/s can be achieved, with 76 GOp/s to be self-contained, allowing it to execute the operations for
being achieved with an actual ConvNet. On both platforms an all common ConvNet layers, and coming with a soft CPU to
energy-efficiency of about 7 GOp/s/W considering the power of control the overall program flow. It also features a compiler,
the entire platform and 14.4 GOp/s/W with differential power converting network implementations with Torch directly to
measurements can be obtained [27]. Except for this evaluation CNP instructions.
the focus is usually on training speed, where multiple images The CNPs architecture does not allow easy scaling of its
are processed together in batches to attain higher performance performance, prompting the follow-up work NeuFlow which
(e.g. using the loaded filter values for multiple images). Batch uses multiple CNP convolution engines, an interconnect, and a
processing is not suitable for real-time applications, since it smart DMA controller. The data flow between the processing
introduces a delay of many frames. tiles can be be rerouted at runtime. The work published in
A comparison of the throughput of many optimized soft- 2011 features a Virtex 6 VLX240T to achieve 147 GOp/s at
ware implementations for GPUs based on several well-known 11 W using 16 bit fixed-point arithmetic.
ConvNets is provided in [35]. The list is lead by an im- To make use of the newly available platform ICs, NeuFlow
plementation by Nervana Systems of which details on how was ported to a Zynq XC7Z045 in 2014, further improved by
it works are not known publicly. They confirm that it is making use of the hard-wired ARM cores, and renamed to nn-
based on maxDNN [36], which started from an optimized X. It further increases the throughput to about 200 GOp/s at
matrix-matrix multiplication, adapted for convolutional layers 4 W (FPGA, memory and host) and uses 4 × 950 MB/s full-
and with fine-tuned assembly code. Their implementation is duplex memory interfaces.
tightly followed by Nvidia’s cuDNN library [33]. The edge of Only few alternatives to CNP/NeuFlow/nn-X exist. The two
these two implementations over others originates from using most relevant are a ConvNet accelerator based on Microsoft’s
half-precision floating point representations instead of single- Catapult platform in [41] with very little known details and
precision for storage in memory, thus reducing the required a HLS-based implementation [42] with a performance and
memory bandwidth, which is the currently limiting factor. New energy efficiency inferior to nn-X.
GPU-based platforms such as the Nvidia Tegra X1 are now
supporting half-precision computation [37], which can be used
to save power or provide further speedup, but no thorough C. ASIC Implementations
investigations have been published on this. More computer The NeuFlow architecture was implemented as an ASIC in
vision silicon has been presented recently with the Movidius 2012 on 12.5 mm2 of silicon for the IBM 45nm SOI process.
Myriad 2 device [16] which has been used in Google Tango, The results based on post-layout simulations were published
and the Mobileye EyeQ3 platform, but no benchmarking in [25], featuring a performance of about 300 GOp/s at 0.6 W
results regarding ConvNets are available yet. operating at 400 MHz with an external memory bandwidth of
A different approach to increase throughput is through the 4 × 1.6 GB/s full-duplex.
use of the Fourier transform, diagonalizing the convolution To explore the possibilities in terms of energy efficiency, a
operation. While this has a positive effect for kernels larger convolution accelerator suitable for small ConvNets was im-
than 9 × 9, the bandwidth problem generally becomes much plemented in ST 28nm FDSOI technology [43]. They achieve
worse and the already considerable memory requirements are 37 GOp/s with 206 GOp/s/W at 0.8 V and 1.39 GOp/s with
boosted further, since the filters have to be padded to the input 1375 GOp/s/W at 0.4 V during simulation (pre-silicon) with
image size [38], [27]. the same implementation, using aggressive voltage scaling
However optimized the software running on such platforms, combined with reverse body biasing available with FDSOI
it will always be constrained by the underlying architecture: technology.
the arithmetic precision cannot be adapted to the needs of Further interesting aspects are highlighted in ShiDian-
the computation, caches are used instead of optimized on- Nao [44], [45], which evolved from DianNao [26]. The original
chip buffers, instructions have to be loaded and decoded. This DianNao was tailored to fully-connected layers, but was also
pushes the need for specialized architectures to achieve high able to evaluate convolutional layers. However, its buffering
power- and area-efficiency. strategy was not making use of the 2D structure of the com-
putational problem at hand. This was improved in ShiDianNao.
B. FPGA Implementations Nevertheless, its performance strongly depends on the size
Embeddability and energy-efficiency is a major concern re- of the convolutional layer to be computed, only unfolding its
garding commercialization of ConvNet-based computer vision performance for tiny feature maps and networks. They achieve
systems and has hence prompted many researchers to approach a peak performance of 128 GOp/s with 320 mW on a core-only
this issue using FPGA implementations. Arguably the most area of 1.3mm2 in a TSMC 65 nm post-layout evaluation.
popular architecture is the one which started as CNP [39] and Another way to approach the problem at hand is to look at
was further improved and renamed to NeuFlow [24], [25] and general convolution accelerators, such as the ConvEngine [46]
later on to nn-X [40]. which particularly targets 1D and 2D convolutions common in
Published in 2009, CNP was the first ConvNet specific computer vision applications. It comes with an array of 64 10-
FPGA implementation and achieved 12 GOp/s at 15 W on bit ALUs and input and output buffers optimized for the task
ARXIV PREPRINT 5

at hand. Based on synthesis results, they achieve a core-only


power efficiency of 409 GOp/s/W.
In the last few months we have seen a wave of vision DSP
IP cores and SoCs becoming commercially available: CEVA-
XM4, Synopsys DesignWare EV5x, Cadence Tensilica Vision
P5. They are all targeted at general vision applications and
not specifically tailored to ConvNets. They are processor-based
and use vector engines or many small specialized processing
units. Many of the mentioned IP blocks have never been imple-
mented in silicon, and their architecture is kept confidential and
has not been peer reviewed, making a quantitative comparison
impossible. However, as they use instruction-based processing,
an energy efficiency gap of 10× or more with respect to
specialized ASICs can be expected.

D. General Neural Network Accelerators


Besides the aforementioned efforts, there are many accelera-
tors which are targeted at accelerating non-convolutional neural
networks. One such accelerator is the K-Brain [47], which Fig. 2. Data stored in the image bank and the image window SRAM per
was evaluated to achieve an outstanding power efficiency of input channel.
1.93 TOp/s/W in 65 nm technology. It comes with 216 KB
of SRAM to store the weights and the dataset. For most
applications this is by far insufficient (GoogLeNet [3]: 6.8M, been done on the required word width [52], [53], [54] and
VGG-Oxfordnet [28]: 133M parameters) and the presented compression [55], but validated only on very small datasets
architectures do not scale to larger networks, requiring exces- (MNIST, CIFAR-10).
sive amounts of on-chip memory [47], [48], [49], [50]. Other
neural network accelerators are targeted at more experimental IV. A RCHITECTURE
concepts like spiking neural networks, where thorough perfor-
In this section we first present the concept of operation of
mance evaluations are still missing [51].
our architecture in a simple configuration. We then explain
some changes which make it more suitable for an area-
E. Discussion efficient implementation. We proceed by looking into possible
Recent work on hardware accelerators for ConvNets shows inefficiencies when processing ConvNet data. We conclude this
that highly energy-efficient implementations are feasible, sig- section by presenting a system architecture suitable to embed
nificantly improving over software implementations. Origami in a SoC or a FPGA-based system.
However, existing architectures are not scalable to higher
performance applications as a consequence of their need for
a very wide memory interface. This manifests itself with A. Concept of Operation
the 299 I/O pins required to achieve 320 GOp/s using Neu- A top-level diagram of the architecture is shown in Figure 3.
Flow [25]. For many interesting applications, much higher It shows two different clock areas which are explained later
throughput is needed, e.g. scene labeling of full-HD frames on. The concept of operation for this architecture first assumes
requires 5190 GOp/s to process 20 frame/s, and the trend a single clock for the entire circuit for simplicity. In Figure 4
clearly points towards even more complex ConvNets. To we show a timeline with the input and output utilization. Note
underline the need for better options, we want to emphasize that the utilization of internal blocks corresponds to these
that linearly scaling NeuFlow would require almost 5000 I/O utilizations in a very direct way up to a short delay.
pins or 110 GB/s full-duplex memory bandwidth. This issue The input data (image with many channels) are fed in
is currently common to all related work, as long as the target stripes of configurable height into the circuit and stored in a
application is not limited to tiny networks which allow caching SRAM, which keeps a spatial window of the input image data.
of the entire data to be processed. The data is the loaded into the image bank, where a smaller
This work particularly focuses on this issue, reducing the window of the size of the filter kernel is kept in registers and
memory bandwidth required to achieve a high computational moved down on the image stripe before jumping to the next
throughput without using very large on-chip memories to column. This register-based memory provides the input for the
store the filters and intermediate results. For state-of-the-art sum-of-product (SoP) units, where the inner products of the
networks storing the learned parameters on-chip is not feasible individual filter kernels are computed. Each SoP unit is fed the
with GoogLeNet requiring 6.8 M and VGG Oxfordnet 135 M same image channel, but different filters, such that each SoP
parameters. The aforementioned scene labeling ConvNet re- computes the partial sum for a different output channel. The
quired 872 k parameters, of which 855 k parameters are filter circuit iterates over the channels of the input image while the
weights for the convolutional layers. Some experiments have partial sums are accumulated in the channel summer (ChSum)
ARXIV PREPRINT 6

Input Pixel 12+1


Image Window SRAM 12 nch
Image Bank Clock, Config &
3

Stream 12 nchhin, maxwk = 344 kbit SRAM 12 nchhkwk = 5.4 kbit registers Test Interface 6

12 hkwk
f = 250 MHz
ffast = 2 f
Filter 12 hkwk 12 hkwk 12 hkwk 12 hkwk
Bank
12 (nch/2)hkwk
12 nchnchhkwk =
37.6 kbit registers
SoP SoP SoP SoP
hkwk = 49 multipliers hkwk = 49 multipliers hkwk = 49 multipliers hkwk = 49 multipliers
and adders and adders and adders and adders

12 12 12 12

ChSum ChSum ChSum ChSum


12 12 12 12 12 12 12 12
12+1 Output Pixel
12 (nch/2) Stream

Fig. 3. Top-level block diagram of the proposed architecture for the chosen implementation parameters.
loa col wk − 1

1
k +
hts

hts

size, we want to keep the image bank as small as possible,


d w in
dc k
w

w
loa ol 1
2

1
eig

eig

while not requiring an excessive data rate from the SRAM.


ol

loa col

ol

ol

ol
dw

The size of the image bank was chosen as nch hk wk . In every


dc
dc

dc

dc
d
d

cycle a new row of wk of the current input channel elements


loa

loa

loa

loa

loa

loa

input data is loaded from the image window SRAM and shifted into the
t image bank. The situation is illustrated in Figure 2 for an
output data individual channel.
t
In order for the SRAM to be able to provide this minimum
amount of data, it needs to store a wk element wide window for
input data
t all nch channels and have selectable height hin ≤ h(in,max) . A
large image has to be fed into the circuit in stripes of maximum
output data t height hin,max with an overlap of hk −1 pixels. The overlap is
nch (hk − 1) out col 1 7 + nch out col 2 because an evaluation of the kernel will need a surrounding of
(hk − 1)/2 pixel in height and (wk /1)/2 pixel in width. When
Fig. 4. Time diagram of the input and output data transfers. Internal activity the image bank reaches the bottom of the image window stored
is strongly related up to small delays (blue = load col = image window SRAM
r/w active = image bank write active; red = output = ChSum active = SoP in SRAM, it jumps back to the top, but shifted one pixel to the
active = image bank reading = filter bank reading; green = load weights = right. This introduces a delay of nch (hk − 1) cycles, during
filter bank shifting/writing) which the rest of the circuit is idling. This delay is not only due
to the loading of the new values for the image bank, but also to
receive the new pixels for the image window SRAM through
unit to compute the complete result, which is then transmitted the external I/O. Choosing h(in,max) is thus mostly a trade-
out of the circuit. off between throughput and area. The performance penalty on
For our architecture we tile the convolutional layer into the overall circuit is about a factor of (hk − 1)/h(in,max) .
blocks with a fixed number of input and output channels nch . The same behavior can be observed at the beginning of the
We perform 2n2ch hk wk operations every nch clock cycles, horizontal stripe. During the first nch hin (wk − 1) cycles the
while transmitting and receiving nch values instead of n2ch . processing units are idling.
This is different from all previous work, and improves the 2) Filter Bank: The filter bank stores all the weights of the
throughput per bandwidth by a factor of nch . The architecture filters, these are nch nch hk wk values. In configuration mode
can also be formulated for non-equal block size for the input the filter values are shifted into these registers which are
and output channels, but there is no advantage doing so, thus clocked with at the lower frequency f . In normal operation,
we keep this constraint for simplicity of notation. We proceed the entire filter bank is read-only. In each cycle all the filter
by presenting the individual blocks of the architecture in more values supplied to the SoP have to be changed, this means
detail. that nch hk wk filter values are read per cycle. Because so
1) Image Window SRAM and Image Bank: The image many filters have to be read in parallel and they change so
window SRAM and the image bank are in charge of storing frequently, it is not possible to keep them in a SRAM. Instead,
new received image data and providing the SoP units with the it is implemented with registers and a multiplexer capable of
image patch required for every computation cycle. To minimize multiplexing selecting one of nch sets of nch hk wk weights.
ARXIV PREPRINT 7

TABLE III. T HROUGHPUT AND E FFICIENCY FOR THE I NDIVIDUAL


The size of the filter bank depends quadratically on the S TAGES OF OUR R EFERENCE C ONVOLUTIONAL N ETWORK FOR 320×240
number of channels processed, which results in a trade-off I NPUT I MAGES .
between area and I/O bandwidth efficiency. When doubling
the I/O efficiency (doubling nch , doubling I/O bandwidth, stage Stage 1 Stage 2 Stage 3
quadrupling the number of operations per data word), the # channels (3→16) (16→64) (64→256)
storage requirements for the filter bank are quadrupled. ηchIdle 0.38 1.00 1.00
Global memory structures which have to provide lots of data ηfilterLoad 0.99 0.98 0.91
at a high speed are often problematic during back end design. ηborder 0.96 0.91 0.82
It is thus important to highlight that while this filter bank can η 0.36 0.89 0.75
be seen as such a global memory structure, but is actually throughput 71 GOp/s 174 GOp/s 147 GOp/s
local: Each SoP unit only needs to access the filters of the # operations 0.35 GOp 1.68 GOp 5.43 GOp
output channel it processes, and no other SoP unit accesses run time 4.93 ms 9.65 ms 36.94 ms
these filters. Average throughput: 145 GOp/s → 19.4 frame/s @ 320×240
3) Sum-of-Products Units: A SoP unit calculates the inner
product between an image patch and a filter kernel. It is built
from hk wk multipliers and hk wk − 1 adders arranged in a
tree. the input image or takes a step forward. This makes each
P Mathematically the output of a SoP unit is described as SoP unit responsible for two output channels. While there is
(∆j,∆i)∈Sk ko,c (∆j, ∆i)xc (j − ∆j, i − ∆i).
While previous steps have only loaded and stored data, little change to the image bank, the values taken from the
we here perform a lot of arithmetic operations, which raises filter bank have to be switched at the faster frequency as
the question of numerical precision. A fixed-point analysis to well. Additionally, the ChSum units have to be adapted to
select the word-width is shown in Section V-B. In terms of alternatingly accumulate the inner products of the two different
architecture, the word width v is doubled by the multiplier, output channels.
and the adder tree further adds log2 (hk wk ) bits. We truncate The changes induced to the filter bank reduce the number
the result to the original word width with the same underlying of filter values to be read to nch hk wk /2 per cycle, however at
fixed-point representation. This truncation also reduces the twice the clock rate. The adapter filter bank has to be able to
accuracy with which the adder tree and the multipliers have read one of 2nch sets of nch hk wk /2 weights each at ff ast =
to be implemented. The idea of using the same fixed-point 2f .
representation for the input and output is motivated by the fact
that there are multiple convolutional layers and each output C. Throughput
will also serve again as an input.
4) Channel Summer Units: Each ChSum unit sums up the The peak throughput of this architecture is given by
inner products it receives from the SoP unit it is connected
to, reducing the amount of data to be transmitted out of the 2nSoP hk wk ff ast = 2nch hk wk f
circuit by a factor of 1/nch over the naive way of transmitting operations per second. Looking at the SoP units, they can
the individual convolution results. The SoP units are built to each calculate wk × hk multiplications and additions per
be able to perform this accumulation while still storing the cycle. As mentioned before, the clock is running at twice
old total results, which are one-by-one transmitted out of the the speed (ff ast = 2f ) to maximize area efficiency by using
circuit while the next computations are already running. The only nSoP = nch /2 SoP units. All the other blocks of the
ChSum units also perform their calculations at full precision circuit are designed to be able to sustain this maximum
and the results are truncated to the original fixed-point format. throughput. Nevertheless, several aspects may cause these core
operation units to stall. We discuss the aspects in the following
B. Optimizing for Area Efficiency paragraphs.
1) Border Effects: At the borders of an image no valid
To achieve a high area efficiency, it is essential that large
convolution results can be calculated, so the core has to wait
logic blocks are operated at a high frequency. We can pipeline
for the necessary data to be transferred to the device. These
the multipliers and the adder tree inside the SoP units to
waiting periods occur at the beginning of a new image while
achieve the desired clock frequency. The streaming nature of
wk − 1 columns are preloaded, and at the beginning of each
the overall architecture makes it very simple to vary this with-
new column while hk − 1 pixels are loaded in nch (hk − 1)
out drawbacks as encountered with closed-loop architectures.
cycles. The effective throughput thus depends on the size of
The limiting factor for the overall clock frequency is the
the image:
SRAM keeping the image window, which comes with a fixed
delay and a minimum clock period, and the speed of CMOS ηborder = (hin − hk + 1)(win − wk + 1)/(hin win ).
I/O pads. Because the SRAM’s maximum frequency is much
lower than the one of the computation-density- and power- The maximum hin is limited to some hin,max depending on
optimized SoP units, we have chosen to have them running the size of the image window SRAM. It is feasible to choose
at twice the frequency. This way each unit calculates two hin large enough to process reasonably sized image, otherwise
inner products until the image bank changes the channel of the image has to be tiled into multiple smaller horizontal
ARXIV PREPRINT 8

stripes with an overlap of hk − 1 rows with the corresponding


DDR3 Memory
additional efficiency loss. 32 bit, 800 MHz

Assuming hin,max is large enough and considering our


reference network, this factor is 0.96, 0.91 and 0.82 for data: 6.4 GB/s control
stages 1, . . . , 3, respectively in case of a 240 × 320 pixel input
image. For larger images this is significantly improved, e.g.
for a 480 × 640 image the Stage 3 will get an efficiency factor FPGA
of 0.91. However, the height of the input image is limited to
RGB 320x240 @ 19fps
512 pixel due to the memory size of the image bank.
2) Filter Loading: Before the image transmission can start, data: 375 MB/s FD control

the filters have to be loaded through the same bus used to


transmit the image data. This causes a loss of a few more Origami Origami Origami Origami
cycles. Instead of just the nch hin win input data values, an 1 2 3 4
additional n2ch hk wk words with the filter weights have to be
transferred. This results in an additional efficiency loss by a
factor of Fig. 5. Suggested system architecture using dedicated Origami chips. The
same system could also be integrated into a SoC.
nch hin win
ηfilterLoad = .
nch nch hk wk + nch hin win
If we choose nch = 8, this evaluates to 0.99, 0.98 and 0.91 the data in memory after preprocessing, but also to load and
for the three stages. store the data when applying the convolutional layer using the
3) Channel Idling: The number of output and input channels Origami chips and applying the fully-connected layers, there
usually does not correspond to the number of output and input has to be a DMA controller and a memory controller.
channels processed in parallel by this core. The output and The remaining steps of the ConvNet like summing over the
input channels are partitionned into blocks of nch × nch and partial results returned by the Origami chips, adding the bias,
filling in all-zero filters for the unused cases. The outputs of and applying the ReLU non-linearity and max-pooling have to
these blocks then have to be summed up pixel-wise off-chip. be done on the FPGA, but requires very little resources since
This processing in blocks can have a strong additional no multipliers are required. The only multipliers are required to
impact on the efficiency when not fully utilizing the core. apply the fully-connected layers following the ConvNet feature
While for the reasonable choice nch = 8 the stages 2 and 3 extraction, but these do not have to run very fast, since the
of our reference ConvNet can be perfectly split into nch × nch share of operations in this layer is less than 2% for the scene
blocks and thus no performance is lost, Stage 1 has only 3 labeling ConvNet in [27].
input channels and can load the core only with ηblocks = 3/8. For every stage of the ConvNet, we just tile the data into
However, stages with a small number of input and/or output blocks of height hin,max , nch input channels and nch output
channels generally perform much less operations and efficiency channels. We then sum up these blocks over the input channels
in these cases is thus not that important. and reassemble the final image in terms of output channels and
The total throughput with the reference ConvNet running on horizontal stripes.
this device with the configuration used for our implementation 2) Bandwidth Considerations: In the most naive setup, this
(cf. Section V) is summarized in Table III, alongside details means that we need to be able to provide memory accesses
on the efficiency of the individual stages. for the full I/O bandwidth of every connected Origami chip
together. However, we also need to load the previous value of
D. System Architecture each output pixel because the results are only the partial sums
and need to be added for the final result. In any case the ReLU
When designing the architecture it is important to keep in
operation and max-pooling can be done in a scan-line proce-
mind how it can be used in a larger system. This system
dure right after computing the final values of the convolutional
should be able to take a video stream from a camera, analyze
layers, requiring only a buffer of (` − 1)hin,max /` values for
the content of the images using ConvNets (scene labeling,
` × ` max-pooling since the max operation can be applied in
object detection, recognition, tracking), display the result, and
vertical and horizontal direction independently (one direction
transmit alerts or data for further analysis over the network.
can be done locally).
1) General Architecture: We elaborate one configuration (cf.
Fig. 5), based on which we show the advantages of our design. However, this is far from optimal and we can improve using
Besides the necessary peripherals and four Origami chips, there the same concept as in the Origami chip itself. We can arrange
is a 32 bit 800 MHz DDR3 or LPDDR3 memory. The FPGA the Origami chips such that they calculate the result of a larger
could be a Xilinx Zynq 7010 device1 . The FPGA has to be tile of input and output channels, making chips 1&2 and 3&4
configured to include a preprocessing core for rescaling, color share the same input data and chips 1&3 and 2&4 generate
space conversion, and local contrastive normalization. To store output data which can immediately be summed up before
writing it to memory. Analogous to the same principle applied
1 Favorable properties: low-cost, ARM core for control of the circuit and on-chip, this saves a factor of two for read and write access
external interfaces, decent memory interface. to the memory. Of course, the same limitations as explained
ARXIV PREPRINT 9

in the previous section also apply at the system level.


The pixel-wise fully-connected layers can be computed in a
single pass, requiring the entire image to be only loaded once.
For the scene labeling ConvNet we require 256 · 64 + 64 · 8 ≈
17k parameters, which can readily be stored within the FPGA
alongside 64 intermediate values during the computations.
This system can also be integrated into a SoC for reduced
system size and lower cost as well as improved energy effi-
ciency. This makes the low memory bandwidth requirement
the most important aspect of the system, being able to run
with a narrow and moderate-bandwidth memory interface
translates to lower cost in terms of packaging, and significantly
higher energy efficiency (cf. Section VII-C for a more detailed
discussion of this on a per-chip basis).
Fig. 6. Classification accuracy with filter coefficients stored with 12 bit
V. I MPLEMENTATION precision. The single precision implementation achieves an accuracy of 70.3%.
Choosing an input length of 12 bit results in an accuracy loss of less than 0.5%.
We first present the general implementation of the circuit.
Thereafter, we present the results of a fixed-point analyses
to determine the number format. We finalize this section by the top and bottom of the chip. A pair of Vdd and GND pads
summarizing the implementation figures and by taking a look for I/O power was placed close to each corner of the chip.
at implementation aspects of the entire system. The core clock of 500 MHz is above the capabilities of
standard CMOS pads and on-chip clock generation is unsuit-
A. General Implementation able for such a small chip, while also complicating testing. To
As discussed in Section IV, we operate our design with overcome this, two phase-shifted clocks of 250 MHz are fed
two clocks originating from the same source, where one is into the circuit. One of the clock directly drives the clock tree
running at twice the frequency of the other. The slower clock of the slower clock domain inside the chip. This clock is also
f = 250 MHz is driving the I/O and the SRAM, but also other XOR-ed with the second input clock signal to generate the
elements which do not need to run very fast, such as the image faster 500 MHz clock.
and filter bank. The SoP units and channel summers are doing
most of the computation and run at ff ast = 2f = 500 MHz to B. Fixed-Point Analysis
achieve a high area efficiency. To achieve this frequency, the Previous work is not conclusive on the required precision
multipliers and the subsequent adder tree are pipelined. We for ConvNets, 16 and 18 bit are the most common values [39],
have added two pipeline stages for each, the multipliers and [24], [40], [43]. To determine the optimal data width for our
the adder tree. design, we performed a fixed-point analysis based on our
For the taped-out chip we set the filter size hk = wk = reference ConvNet. We replaced all the convolution operations
7, since we found 7 × 7 and 5 × 5 to be the most common in our software model with fixed-point versions thereof and
filter sizes and a device capable of computing larger filter sizes evaluated the resulting precision depending on the input, output
is also capable to calculate smaller ones. For the maximum and weight data width. The quality was analyzed based on the
height of a horizontal image stripe we chose hin,max = 512, per-pixel classification accuracy of 150 test images omitted
requiring an image window SRAM size of 29k words. Due to during training. We used the other 565 images of the Stanford
its size and latency, we have split it into four blocks of 1024 backgrounds dataset [56] to train the network.
words each with a word width of 7 · 12, as can be seen in the Our results have shown that an output length of 12 bit is
floorplan (Figure 10). The alignment shown has resulted in sufficient to keep the implementation loss below a drop of
the best performance of various configurations we have tried. 0.5% in accuracy. Since the convolution layers are applied
Two of the RAM macro cells are placed besides each other repeatedly with little processing between them, we chose the
on the left and the right boundary of the chip, with two of the same signal width for the input, although we could have
cells flipped such that the ports face towards the center of the reduced them further. For the filter weights a signal width of
device. For silicon testing, we have included a built-in self-test 12 bit was selected as well.
for the memory blocks. For the implementation, we fixed the filter size hk = wk = 7
The pads of the input data bus were placed at the top of and chose nc h = 8. Smaller filters have to be zero-padded and
the chip around the image bank memory, in which it is stored larger filters have to be decomposed into multiple 7 × 7 filters
after one pipeline stage. The output data bus is located at the and added up. To keep the cycles lost during column changes
bottom-left together with an in-phase clock output and the test low also for larger image, we chose h(in,max) = 512. For the
interface at the bottom-right of the die. The control and clock SRAM, the technology libraries we used did not provide a fast
pads can be found around the center of the right side. Two Vdd enough module to accommodate 8 · 512 words of 7 · 12 bit at
and GND core pads were placed at the center of the left and 250 MHz. Therefore, the SRAM was split into 4 modules of
right side each, and one Vdd and GND core pad was placed at 1024 words each.
ARXIV PREPRINT 10

is shared between the image window SRAM, the image bank


(0.05 mm2 registers and 7614µm2 logic), and other circuitry
image SRAM (I/O registers, data output bus mux, control logic; 0.1 mm2 ).
23.6% filter bank
33.9%
The chip area is clearly dominated by logic and is thus suited
to benefit from voltage and technology scaling.
4.6%
image bank We have used Synopsys Design Compiler 2013.12 for syn-
others
6.2% thesis and Cadence SoC Encounter 13.14 for back-end design.
SoP units Synthesis and back-end design have been performed for a
31.8% clock frequency of f = 350 MHz with typical case corners
for the functional setup view and best case corners for the
functional hold view. Clock trees have been synthesized for the
Fig. 7. Final core area breakdown. fast and the slow clock with a maximum delay of 240 ps and
a maximum skew of 50 ps. For scan testing, a different view
with looser constraints was used. Reset and scan enable signals
C. System Implementation have also been inserted as clock trees with relaxed constraints.
The four Origami chips require 750 MB/s FD for the given Clock gating was not used. We performed a post-layout power
implementation(12 bit word length, nch = 8 input and output estimation based on switching activity information from a
channels, 250 MHz) using the input and output feature map simulation running parts of the scene labeling ConvNet. The
sharing discussed in Section IV-D to save a factor of 2. The total core power was estimated to be 620.8 mW, of which
inputs and outputs are read directly from and again written 35.5 mW are used in the ff ast and 41.7 mW are used in the
directly to memory. Since the chips output partial sums, these lower frequency clock tree. Each SoP unit uses 66.9 mW, the
have to be read again from memory and added up for the filter bank 122.5 mW, the image bank 18.3 mW, and the image
final convolution result. This can be combined with activation window SRAM 43.6 mW. The remaining power was used to
and pooling, adding slightly less than 750 MB/s read and power buffers connecting these blocks, I/O buffers and control
187.5 MB/s (for 2×2 pooling) write memory bandwidth. For logic. The entire core has only one power domain with a
the third stage in the scene labeling ConvNet there is no nominal voltage of 1.2 V and the pad frame uses 1.8 V. The
such pooling layer, instead there is the subsequent pixel- power used for the pads is 205 mW with line termination (50Ω
wise classification which can be applied directly and which towards 0.9 V).
reduces the feature maps from 256 to 8, yielding an even 2) Silicon Measurements: The ASIC has been named
lower memory write throughput requirement. To sum up, we Origami and has been taped-out in UMC 65nm CMOS
require 2.45 GB/s memory bandwidth during processing. To technology. The key measurement results of the ASIC have
achieve the maximum performance, we have to load the filters been compiled in Table IV. In the high-speed configuration
at full speed for all four chip independently at the beginning we can apply a 500 MHz clock to the core, achieving a peak
of each processing burst, requiring a memory bandwidth of throughput of 196 GOp/s. Running the scene labeling ConvNet
1.5 GB/s – less than during computation. This leaves enough from [27], we achieve an actual throughput of 145 GOp/s while
bandwidth available for some pre- and post-processing and the core (logic and on-chip memory) consumes 448 mW. This
memory access overhead. In the given configuration, 320×240 amounts to a power efficiency of 437 GOp/s/W, measured with
frames can be processed at over 75 frame/s or at an accordingly respect to the peak throughput for comparability to related
higher resolution. work. The I/O data interface consist of one input and one
output 12 bit data bus running at half of the core frequency,
providing a peak bandwidth of 375 MB/s full-duplex. We
VI. R ESULTS achieve a very high throughput density of 63.4 GOp/s/mm2
We have analyzed and measured multiple metrics in our despite the generously chosen core area of 3.09 mm2 (to
architecture: I/O bandwidth, throughput per area and power accommodate a large enough pad frame for all 55 pins), while
efficiency. We can run our chip at two operating points: a high- the logic and on-chip memory occupy a total area of just
speed configuration with Vdd = 1.2 V and a high-efficiency 1.31 mm2 , which would correspond to a throughput density
configuration with Vdd = 0.8 V. We have taped out this chip of 150 GOp/s/mm2 .
and the results below are actual silicon measurement results When operating our chip in the high-efficiency configu-
(cf. Table IV), as opposed to post-layout results which are ration, the maximum clock speed without inducing any er-
known to be quite inaccurate. We now proceed by presenting rors is 189 MHz. The throughput is scaled accordingly to
implementation results and silicon measurements. 74 GOp/s for the peak performance and 55 GOp/s running our
1) Implementation: We show the resulting final area break- reference ConvNet. The core’s power consumption is reduced
down in Figure 7. The filter bank accounts for more than a dramatically to only 93 mW, yielding a power-efficiency of
third of the area and consists of registers storing the filter 803 GOp/s/W. The required I/O bandwidth shrinks to 142 MB/s
weights (0.41 mm2 ) and the multiplexers switching between full-duplex or 1.92 MB/GOp. The throughput density amount
them (0.03 mm2 ). The SoP units take up almost another third to 23.9 GOp/s/mm2 for this configuration. The chip was orig-
of the circuit and consist mostly of logic (98188 µm2 /unit) and inally targeted at the 1.2 V operating point and has hold
some pipeline registers (26615 µm2 /unit). The rest of the space violations operating at 0.8 V at room temperature. Thus the
ARXIV PREPRINT 11

TABLE IV. M EASURED S ILICON K EY F IGURES . 1,000 250

Physical Characteristics

Efficiency [GOp/s/W]
800 200

Throughput [GOp/s]
Technology UMC 65 nm, 8 Metal Layers
Core/Pad Voltage 1.2 V / 1.8 V 600 150
Package QFN-56
# Pads 55 (i: 14, o: 13, clk/test: 8, pwr: 20) 400 100
Core Area 3.09 mm2
Circuit Complexitya 912 kGE (1.31 mm2 )
200 50
Logic (std. cells) 697 kGE (1.00 mm2 )
On-chip SRAM 344 kbit
0 0
0.8 0.9 1 1.1 1.2
Performance & Efficiency @1.2 V
Vcore [V]
Max. Clock Frequency core: 500 MHz, i/o: 250 MHz
Efficiency 25°C ( ), 125°C ( ), Throughput 25°C ( ), 125°C ( )
Powera @500 MHz 449 mW (core) + 205 mW (pads)
Peak Throughput 196 GOp/s Fig. 9. Measured energy efficiency and throughput in dependence of Vcore
Effective Throughput 145 GOp/s for 25°C and 125°C.
Core Power-Efficiency 437 GOp/s/W
Performance & Efficiency @0.8 V
in the interval [0.95 V, 1.25 V] increasing to 14.7% for a core
Max. Clock Frequency core: 189 MHz, i/o: 95 MHz voltage of 0.8 V at 125°C.
Powerb @189 MHz 93 mW (core) + 144 mW (pads)
Peak Throughput 74 GOp/s
Effective Throughput 55 GOp/s VII. D ISCUSSION
Core Power-Efficiency 803 GOp/s/W None of the previous work on ConvNet accelerators has
a
Including the SRAM blocks. silicon measurement results. We will thus compare to post-
b
The power usage was measured running with real data and at layout and post-synthesis results of state-of-the-art related
maximum load. works, although such simulation results are known to be
optimistic. We have listed the key figures of all these works in
Table V and discuss the various results in the sections below.

A. Area Efficiency
Our chip is the most area-efficient ConvNet accelerator
reported in literature. We measure the area in terms of 2-
input NAND gate equivalents to compensate for technology
differences to some extent. With 90.7 GOp/s/MGE our imple-
mentation is by far the most area-efficient, and even in high
power-efficiency configuration we outperform previous state-
of-the-art results. The next best implementation is a NeuFlow
design at 33.8 GOp/s/MGE, requiring a factor 3 more space
for the same performance. ShiDianNao is of comparable area-
efficiency with 26.3 GOp/s/MGE. Also note that the chip size
was limited by the pad-frame, and that the area occupied by
the standard cells and the on-chip SRAM is only 1.31 mm2
(0.91 MGE). We would thus achieve a throughput density of
Fig. 8. Shmoo plot showing number of incorrect results in dependence of an enormous 215 GOp/s/MGE in this very optimistic scenario.
frequency (f = ffast /2, x-axis, in MHz) and core voltage (Vcore , y-axis, in This would require a more complex and expensive pad-frame
V) at 125°C. Green means no errors. architecture, e.g. flip-chip with multiple rows of pads, which
we decided not to implement.
We see the reason for these good results in our approach
measurement have been obtained at a forced ambient temper- to compute multiple input and output channels in parallel.
ature of 125°C. The resulting Shmoo plot is shown in Figure 8. This way we have to buffer a window of 8 input channels to
Besides the two mentioned operating points there are many compute 64 convolutions, instead of buffering 64 input images,
more, allowing for a continuous trade-off between throughput a significant saving of storage, particularly also because the
and energy efficiency by changing the core supply voltage as window size of the input images that has to be buffered is a
evaluated empirically in Figure 9. As expected the figures are lot larger than the size of a single convolution kernel. Another
slightly worse for the measurements at higher temperature. 25% can be attributed to the use of 12 bit instead of 16 bit
Static power dissipation takes a share around 1.25% across words which expresses itself mostly with the size of the SRAM
the entire voltage range at 25°C and a share of about 10.5% and the filter kernel buffer.
ARXIV PREPRINT 12

TABLE V. S UMMARY OF R ELATED W ORK FOR A W IDE R ANGE OF P LATFORMS (CPU, GPU, FPGA, ASIC).

publication type platform theor.a peaka act.a powerb power eff. prec. Vcore areaj area eff.h
GOp/s GOp/s GOp/s W GOp/s/W V MGE GOp/s/MGE

Cavigelli et al. [27] CPU Xeon E5-1620v2 118 35d 230 0.15 float32
Cavigelli et al. [27] GPU GTX780 3977 3030 1908d sd:200 14g float32
cuDNN R3 [35] GPU Titan X 6600 6343 d:250 25.6g float32
Cavigelli et al. [27] SoC Tegra K1 365 95 84d s:11 8.6 float32

CNP [39] FPGA Virtex4 40 40 37 s:10 3.7 fixed16


NeuFlow [24] FPGA Virtex6 VLX240T 160 160 147 s:10 14.7 fixed16
nn-X [40] FPGA Zynq XC7Z045 227 227 200 s:8 d+m:4 25 fixed16
Zhang et al. [42] FPGA/HLS Virtex7 VX485T 62 62 s:18.6 3.3 float32

ConvEngine [46] synth. 45nm 410 c:1.0 409 fixed10 0.9


ShiDianNao [44] layout TSMC 65nm 128 c:0.32 400 fixed16 dh :4.86 26.3
NeuFlow [24] layout IBM 45nm SOI 1280 1280 1164 d:5 230 fixed16 1.0 d:38.46 33.3
NeuFlow [25] layout IBM 45nm SOI 320 320 294 c:0.6 490 fixed16 1.0 d:19.23 16.6
HWCE [43] layout ST 28nm FDSOI 37 37 c:0.18 206 fixed16 0.8
HWCE [43] layout ST 28nm FDSOI 1 1 c:0.73m 1375 fixed16 0.4

this work silicon umc 65nm 196 196 145d c:0.51f 437 fixed12 1.2 c:0.91 d:2.16 90.7
this work silicon umc 65nm 74 74 55d c:0.093f 803 fixed12 0.8 c:0.91 d:2.16 34.3
a
We distinguish between theoretical performance, where we consider the maximum throughput of the arithmetic units, the peak throughput, which is the maximum throughput for
convolutional layers of any size, and the actual throughput, which has been benchmarked for a real ConvNet and without processing in batches.
b
For the different types of power measurements, we abbreviate: s (entire system), d (device/chip), c (core), m (memory), io (pads), sd (system differential load-vs-idle).
c
We use the abbreviations c (core area, incl. SRAM), d (die size)
d
These values were obtained for the ConvNet described in [27].
f
The static power makes up for around 1.3% of the total power at 25°C for the entire range of feasible Vcore , and about 11% at 125°C.
g
The increased energy efficiency of the Titan X over the GTX780 is significant and can neither be attributed solely to technology (both 28 nm) nor the software implementation or
memory interface (both GDDR5). Instead the figures published by Nvidia suggest that architectural changes from Kepler to Maxwell are the source of this improvement.
h
We take the theoretical performance to be able to compare more works and the device/chip size for the area. ShiDianNao does not include a pad ring in their layout (3.09mm2 ),
so we added it for better comparability (height 90µm).
j
We measure area in terms of size of millions of 2-input NAND gates. 1GE: 1.44µm2 (umc 65nm), 1.17µm2 (TSMC 65nm), 0.65µm2 (45nm), 0.49µm2 (ST 28nm FDSOI).

B. Bandwidth Efficiency at 0.4 V and making use of reverse body biasing, available
The number of I/O pins is often one of the most con- only with FDSOI technology to this extent. Our chip is then
gested resources when designing a chip, and the fight for followed by NeuFlow (490 GOp/s/W), the Convolution Engine
bandwidth is even more present when the valuable memory (409 GOp/s/W) and ShiDianNao (400 GOp/s/W).
bandwidth of an SoC has to be shared with accelerators. We However, technology has a strong impact on the energy
achieve a bandwidth efficiency of 521 GOp/GB, providing an efficiency. Our design was done in UMC 65 nm, while Neu-
improvement by a factor of more than 10× over the best Flow was using IBM 45 nm SOI and HWCE even resorted
previous work – NeuFlow comes with a memory bandwidth of to ST 28 nm FDSOI. In order to analyze our architecture
6.4 GB/s to provide 320 GOp/s, i.e. it can perform 50 GOp/GB. independently of the particular implementation technology
ShiDianNao does not provide any information on the exter- used, we take a look at its effect. We take the simple model
nal bandwidth and the HWCE can do 6.1 GOp/GB. These  2
`new Vdd,new
large differences, particularly between this work and previous P̃ = P .
`old Vdd,old
results, can be attributed to us having focused on reducing
the required bandwidth while maximizing throughput on a The projected results are shown in Table VI. To obtain the
small piece of silicon. The architecture has been designed operating voltage in 28 nm technology, we scale the operating
to maximize reuse of the input data by calculating pixels voltage linearly with respect to the common operating voltage
of multiple output channels in parallel, bringing a significant of the used technology. This projection, although not based
improvement over caching as in [44] or accelerating individual on a very accurate model, gives an idea of how the various
2D convolutions [46], [43]. implementations perform in a recent technology. Clearly, the
only competitive results in terms of core power efficiency
are ShiDianNao and this work. Previous work has always
C. Power Efficiency excluded I/O power, although it is a major contributor to
Our chip performs second-best in terms of energy efficiency the overall power usage. We estimate this power based on
of the core with 803 GOp/s/W (high-efficiency configuration) an energy usage of 21 pJ/bit, which has been reported for a
and 437 GOp/s (high-performance configuration), being out- LPDDR3 memory model and the PHY on the chip in 28 nm
performed only by the HWCE. The HWCE can reach up [57], assuming a reasonable output load and a very high page
to 1375 GOp/s/W in its high-efficiency setup when running hit rate. For our chip, this amounts to an additional 63 mW or
ARXIV PREPRINT 13

TABLE VI. P ROJECTED P OWER AND P OWER -E FFICIENCY W HEN


S CALED TO 28 NM T ECHNOLOGY ous architectures. It is more area efficient than any previously
reported results and comes with the lowest-ever reported power
publication Vcore power efficiency consumption when compensating for technology scaling.
V mW GOp/s/W
Further work with newer technologies, programmable logic
ConvEngine [46] 0.72 398 1030 and further configurability to build an entire high-performance
ShiDianNao [44] 0.8 61.3 2098
NeuFlow [24] 0.8 239 1339 low-power system is planned alongside investigations into
HWCE [43] 0.8 180 260 the ConvNet learning-phase to adapt networks for very-low
HWCE [43] 0.4 0.73 1375 precision accelerators during training.
this work 0.8 86.1 2276
this work 0.53 7.81 9475
R EFERENCES
[1] F. Porikli, F. Bremond, S. L. Dockstader et al., “Video surveillance:
past, present, and now the future [DSP Forum],” IEEE Signal Process.
Mag., vol. 30, pp. 190–198, 2013.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet Classifica-
tion With Deep Convolutional Neural Networks,” in Adv. Neural Inf.
Process. Syst., 2012.
[3] C. Szegedy, W. Liu, Y. Jia et al., “Going Deeper with Convolutions,”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., sep 2015.
[4] P. Sermanet, D. Eigen, X. Zhang et al., “OverFeat: Integrated Recog-
nition, Localization and Detection using Convolutional Networks,” in
Int. Conf. Learn. Represent., dec 2014.
[5] P. Fischer, A. Dosovitskiy, E. Ilg et al., “FlowNet: Learning Optical
Flow with Convolutional Networks,” in arXiv:15047.06852, 2015.
[6] J. Revaud, P. Weinzaepfel, Z. Harchaoui et al., “EpicFlow: Edge-
preserving interpolation of correspondences for optical flow,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., jun 2015, pp. 1164–1172.
[7] J. Zbontar and Y. Lecun, “Computing the Stereo Matching Cost with a
Convolutional Neural Network,” Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2015.
[8] Y. Taigman and M. Yang, “Deepface: Closing the gap to human-level
performance in face verification,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., 2013.
[9] Y. LeCun, B. Boser, J. S. Denker et al., “Backpropagation Applied to
Handwritten Zip Code Recognition,” Neural Comput., vol. 1, no. 4, pp.
541–551, 1989.
Fig. 10. Floorplan and die shot of the final chip. In the floorplan view, the [10] Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional Networks
cells are colored by functional unit, making the low density in the sum-of- and Applications in Vision,” in Proc. IEEE Int. Symp. Circuits Syst.,
products computation units clearly visible. may 2010, pp. 253–256.
[11] T.-Y. Lin, M. Maire, S. Belongie et al., “Microsoft COCO: Common
Objects in Context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–
24 mW for the high-performance and high-throughput config- 755.
uration, respectively. For NeuFlow, due to the much higher I/O [12] C. Shi, J. Yang, Y. Han et al., “7.3 A 1000fps vision chip based on a
dynamically reconfigurable hybrid architecture comprising a PE array
bandwidth, it looks worse with an additional 1.08 W for their and self-organizing map neural network,” in Proc. IEEE Int. Conf. Solid-
320 GOp/s implementation. If we assume the power efficiency State Circuits, feb 2014, pp. 128–129.
of these devices in their original technology, this reduces the [13] C. Labovitz, S. Iekel-Johnson, D. McPherson et al., “Internet inter-
power efficiency including I/O to 342 GOp/s/W, 632 GOp/s/W domain traffic,” ACM SIGCOMM Comput. Commun. Rev., vol. 41, no. 4,
and 191 GOp/s/W for our chip in high-throughput and high- pp. 75–86, 2011.
efficiency configuration as well as NeuFlow. If we look [14] C. Bobda and S. Velipasalar, Eds., Distributed Embedded Smart Cam-
at their projected 28 nm efficiency, they are decreased to eras. Springer, 2014.
1315 GOp/s/W, 2326 GOp/s/W and 243 GOp/s/W. This clearly [15] Markets and Markets Inc., “Machine Vision Market Worth $9.50 Billion
shows the importance of the reduced I/O bandwidth in our de- by 2020,” 2014. [Online]. Available: http://www.marketsandmarkets.
com/PressReleases/machine-vision-systems.asp
sign, and the relevance of I/O power in general with it making
[16] Movidius Inc., “Myriad 2 Vision Processor Product
up a share of 42% to 82% of the total power consumption for Brief,” 2014. [Online]. Available: http://uploads.movidius.com/
these three devices. 1441734401-Myriad-2-product-brief.pdf
[17] J. Campbell and V. Kazantsev, “Using an Embedded Vision Processor
VIII. C ONCLUSIONS & F UTURE W ORK to Build an Efficient Object Recognition System,” 2015.
[18] Mobileye Inc., “System-on-Chip for driver assistance systems,” CAN
We have presented the first silicon measurement results of a Newsl., vol. 4, pp. 4–5, 2011.
convolutional network accelerator. The developed architecture [19] L. Cavigelli, D. Gschwend, C. Mayer et al., “Origami: A Convolutional
is also first to scale to multi-TOp/s performance by signifi- Network Accelerator,” in Proc. ACM Gt. Lakes Symp. VLSI. ACM
cantly improving on the external memory bottleneck of previ- Press, 2015, pp. 199–204.
ARXIV PREPRINT 14

[20] C. Dong, C. C. Loy, K. He et al., “Learning a deep convolutional [44] Z. Du, R. Fasthuber, T. Chen et al., “ShiDianNao: Shifting Vision
network for image super-resolution,” Proc. Eur. Conf. Comput. Vis., pp. Processing Closer to the Sensor,” in Proc. ACM/IEEE Int. Symp.
184–199, 2014. Comput. Archtiecture, 2015.
[21] C. Farabet, C. Couprie, L. Najman et al., “Learning Hierarchical [45] T. Chen, Z. Du, N. Sun et al., “A High-Throughput Neural Network
Features for Scene Labeling,” IEEE Trans. Pattern Anal. Mach. Intell., Accelerator,” IEEE Micro, vol. 35, no. 3, pp. 24–32, 2015.
2013. [46] W. Qadeer, R. Hameed, O. Shacham et al., “Convolution Engine :
[22] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks Balancing Efficiency & Flexibility in Specialized Computing,” in Proc.
for Semantic Segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern ACM Int. Symp. Comput. Archit., 2013, pp. 24–35.
Recognit., 2015. [47] S. Park, K. Bong, D. Shin et al., “A 1.93TOPS/W Scalable Deep Learn-
[23] P. Weinzaepfel, J. Revaud, Z. Harchaoui et al., “DeepFlow: Large ing / Inference Processor with Tetra-Parallel MIMD Architecture for
Displacement Optical Flow with Deep Matching,” in Proc. IEEE Int. Big-Data Applications,” in Proc. IEEE Int. Conf. Solid-State Circuits,
Conf. Comput. Vis. IEEE, dec 2013, pp. 1385–1392. 2015, pp. 80–82.
[24] C. Farabet, B. Martini, B. Corda et al., “NeuFlow: A Runtime Recon- [48] Q. Yu, C. Wang, X. Ma et al., “A Deep Learning Prediction Process
figurable Dataflow Processor for Vision,” in Proc. IEEE Conf. Comput. Accelerator Based FPGA,” Proc. IEEE/ACM Int. Symp. Clust. Cloud
Vis. Pattern Recognit. Work., jun 2011, pp. 109–116. Grid Comput., no. 500, pp. 1159–1162, 2015.
[25] P. H. Pham, D. Jelaca, C. Farabet et al., “NeuFlow: Dataflow vision [49] T. Moreau, M. Wyse, J. Nelson et al., “SNNAP: Approximate Com-
processing system-on-a-chip,” in Proc. Midwest Symp. Circuits Syst., puting on Programmable SoCs via Neural Acceleration,” in Proc. IEEE
2012, pp. 1044–1047. Int. Symp. High Perform. Comput. Archit., 2015, pp. 603–614.
[26] T. Chen, Z. Du, N. Sun et al., “DianNao: A Small-Footprint High- [50] Q. Zhang, T. Wang, Y. Tian et al., “ApproxANN: An Approximate
Throughput Accelerator for Ubiquitous Machine-Learning,” in Proc. Computing Framework for Artificial Neural Network,” in Proc. IEEE
ACM Int. Conf. Archit. Support Program. Lang. Oper. Syst., 2014, pp. Des. Autom. Test Eur. Conf., 2015, pp. 701–706.
269–284. [51] F. Akopyan, J. Sawada, A. Cassidy et al., “TrueNorth: Design and
[27] L. Cavigelli, M. Magno, and L. Benini, “Accelerating Real-Time Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic
Embedded Scene Labeling with Convolutional Networks,” in Proc. Chip,” IEEE Trans. Comput. Des. Integr. Circuits Syst., vol. 34, no. 10,
ACM/IEEE Des. Autom. Conf., 2015. pp. 1537–1557, oct 2015.
[28] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks [52] M. Courbariaux, Y. Bengio, and J.-P. David, “Training Deep Neural
for Large-Scale Image Recognition,” in Proc. Int. Conf. Learn. Repre- Networks with Low Precision Multiplications,” in Proc. Int. Conf.
Learn. Represent., 2015.
sent., sep 2015.
[53] S. Gupta, A. Agrawal, K. Gopalakrishnan et al., “Deep Learning with
[29] O. Russakovsky, J. Deng, H. Su et al., “ImageNet Large Scale Visual
Limited Numerical Precision,” in Proc. Int. Conf. Mach. Learn., 2015.
Recognition Challenge,” 2014.
[54] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of
[30] K. He, X. Zhang, S. Ren et al., “Deep Residual Learning for Image neural networks on CPUs,” in Adv. Neural Inf. Process. Syst. Work.,
Recognition,” vol. 7, no. 3, pp. 171–180, dec 2015. 2011.
[31] R. Collobert, “Torch7: A Matlab-like Environment for Machine Learn- [55] G. Soulié, V. Gripon, and M. Robert, “Compression of Deep Neural
ing,” Adv. Neural Inf. Process. Syst. Work., 2011. Networks on the Fly,” in arXiv:1509.08745, 2015.
[32] Y. Jia, “Caffe: An Open Source Convolutional Architecture for [56] S. Gould, R. Fulton, and D. Koller, “Decomposing a scene into
Fast Feature Embedding,” 2013. [Online]. Available: http://caffe. geometric and semantically consistent regions,” in Proc. IEEE Int. Conf.
berkeleyvision.org Comput. Vis., 2009.
[33] S. Chetlur, C. Woolley, P. Vandermersch et al., “cuDNN: Efficient [57] M. Schaffner, F. K. Gürkaynak, A. Smolic et al., “DRAM or no-DRAM
Primitives for Deep Learning,” in arXiv:1410.0759, oct 2014. ? Exploring Linear Solver Architectures for Image Domain Warping in
[34] Nervana Systems Inc., “Neon Framework,” 2015. [Online]. Available: 28 nm CMOS,” in Proc. IEEE Des. Autom. Test Eur., 2015.
http://neon.nervanasys.com
[35] S. Chintala, “convnet-benchmarks,” 2015. [Online]. Available: https: Lukas Cavigelli received the M.Sc. degree in elec-
//github.com/soumith/convnet-benchmarks trical engineering and information technology from
[36] A. Lavin, “maxDNN: An Efficient Convolution Kernel for Deep Learn- ETH Zurich, Zurich, Switzerland, in 2014. Since
ing with Maxwell GPUs,” in arXiv:1501.06633v3, jan 2015. then he has been with the Integrated Systems Lab-
[37] Nvidia Inc., “NVIDIA Tegra X1 Whitepaper,” 2015. oratory, ETH Zurich, pursuing a Ph.D. degree. His
current research interests include deep learning, com-
[38] M. Mathieu, M. Henaff, and Y. LeCun, “Fast Training of Convolutional puter vision, digital signal processing, and low-power
Networks through FFTs,” in arXiv:1312.5851, dec 2014. integrated circuit design. Mr. Cavigelli received the
[39] C. Farabet, C. Poulet, J. Y. Han et al., “CNP: An FPGA-based processor best paper award at the 2013 IEEE VLSI-SoC Con-
for Convolutional Networks,” in Proc. IEEE Int. Conf. F. Program. Log. ference.
Appl., 2009, pp. 32–37.
[40] V. Gokhale, J. Jin, A. Dundar et al., “A 240 G-ops/s Mobile Coprocessor
for Deep Neural Networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Luca Benini is the Chair of Digital Circuits and
Recognit., 2014, pp. 682–687. Systems at ETH Zurich and a Full Professor at
[41] K. Ovtcharov, O. Ruwase, J.-y. Kim et al., “Accelerating Deep Con- the University of Bologna. He has served as Chief
volutional Neural Networks Using Specialized Hardware,” Microsoft Architect for the Platform2012/STHORM project at
Research, Tech. Rep., 2015. STmicroelectronics, Grenoble. He has held visiting
[42] C. Zhang, P. Li, G. Sun et al., “Optimizing FPGA-based Accelerator and consulting researcher positions at EPFL, IMEC,
Design for Deep Convolutional Neural Networks,” in Proc. ACM Int. Hewlett-Packard Laboratories, Stanford University.
Symp. Field-Programmable Gate Arrays, 2015, pp. 161–170. Dr. Benini’s research interests are in energy-efficient
system and multi-core SoC design. He is also active
[43] F. Conti and L. Benini, “A Ultra-Low-Energy Convolution Engine for in the area of energy-efficient smart sensors and
Fast Brain-Inspired Vision in Multicore Clusters,” in Proc. IEEE Des. sensor networks for biomedical and ambient intel-
Autom. Test Eur. Conf., 2015. ligence applications. He has published more than 700 papers in peer-reviewed
international journals and conferences, four books and several book chapters.
He is a Fellow of the IEEE and a member of the Academia Europaea.

You might also like