Origami
Origami
Origami
Abstract—An ever increasing number of computer vision and production environment [3], [8], [11]. These companies are
image/video processing challenges are being approached using mainly interested in running such algorithms on powerful
deep convolutional neural networks, obtaining state-of-the-art re- compute clusters in large data centers.
sults in object recognition and detection, semantic segmentation, With the increasing number of imaging devices the im-
action recognition, optical flow and superresolution. Hardware portance of digital signal processing in imaging continues
acceleration of these algorithms is essential to adopt these
to grow. The amount of on- and near-sensor computation is
arXiv:1512.04295v2 [cs.CV] 19 Jan 2016
• An implementation of this architecture with optimized linear unit (ReLU) [2], [4], [5], which designates the function
precision using fixed-point evaluations constrained for x 7→ max(0, x). The activation function introduces non-
an accelerator-sized ASIC. linearity into neural networks, giving them the potential to be
• Silicon measurements of the taped-out ASIC, providing more powerful than linear methods. Typical filter sizes range
experimental characterization of the silicon. from 5 × 5 to 9 × 9, sometimes even 11 × 11 [2], [4], [21].
• A thorough comparison to and discussion of previous
work. v = actReLU (y), vo (j, i) = max(yo (j, i), 0) (5)
Organization of the paper: Section II shortly introduces
convolutional networks and highlights the need for accelera- The feature extractor with the convolutional layers is usually
tion. Previous work is investigated in Section III, discussing followed by a classification step with fully-connected neural
available software, FPGA and ASIC implementations and network layers interspersed with activation functions, reducing
explaining the selection of our design objectives. In Section IV the dimensionality from several hundred or even thousands
we present our architecture and its properties. The implemen- down to the number of classes. In case of scene labeling these
tation aspects are shown in Section V. We present our results fully-connected layers are just applied on a per-pixel basis with
in Section VI and discuss and compare them in Section VII. inputs being the values of all the channels at any given pixel
We conclude the paper in Section VIII. pixel [22].
the performance can reach up to 3059 GOp/s for some special a Spartan 3A DSP 3400 FPGA using 18 bit fixed-point
problems and about 1800 GOp/s on meaningful ConvNets. On arithmetic for the multiplications. Its architecture was designed
the Tegra K1 up to 96 GOp/s can be achieved, with 76 GOp/s to be self-contained, allowing it to execute the operations for
being achieved with an actual ConvNet. On both platforms an all common ConvNet layers, and coming with a soft CPU to
energy-efficiency of about 7 GOp/s/W considering the power of control the overall program flow. It also features a compiler,
the entire platform and 14.4 GOp/s/W with differential power converting network implementations with Torch directly to
measurements can be obtained [27]. Except for this evaluation CNP instructions.
the focus is usually on training speed, where multiple images The CNPs architecture does not allow easy scaling of its
are processed together in batches to attain higher performance performance, prompting the follow-up work NeuFlow which
(e.g. using the loaded filter values for multiple images). Batch uses multiple CNP convolution engines, an interconnect, and a
processing is not suitable for real-time applications, since it smart DMA controller. The data flow between the processing
introduces a delay of many frames. tiles can be be rerouted at runtime. The work published in
A comparison of the throughput of many optimized soft- 2011 features a Virtex 6 VLX240T to achieve 147 GOp/s at
ware implementations for GPUs based on several well-known 11 W using 16 bit fixed-point arithmetic.
ConvNets is provided in [35]. The list is lead by an im- To make use of the newly available platform ICs, NeuFlow
plementation by Nervana Systems of which details on how was ported to a Zynq XC7Z045 in 2014, further improved by
it works are not known publicly. They confirm that it is making use of the hard-wired ARM cores, and renamed to nn-
based on maxDNN [36], which started from an optimized X. It further increases the throughput to about 200 GOp/s at
matrix-matrix multiplication, adapted for convolutional layers 4 W (FPGA, memory and host) and uses 4 × 950 MB/s full-
and with fine-tuned assembly code. Their implementation is duplex memory interfaces.
tightly followed by Nvidia’s cuDNN library [33]. The edge of Only few alternatives to CNP/NeuFlow/nn-X exist. The two
these two implementations over others originates from using most relevant are a ConvNet accelerator based on Microsoft’s
half-precision floating point representations instead of single- Catapult platform in [41] with very little known details and
precision for storage in memory, thus reducing the required a HLS-based implementation [42] with a performance and
memory bandwidth, which is the currently limiting factor. New energy efficiency inferior to nn-X.
GPU-based platforms such as the Nvidia Tegra X1 are now
supporting half-precision computation [37], which can be used
to save power or provide further speedup, but no thorough C. ASIC Implementations
investigations have been published on this. More computer The NeuFlow architecture was implemented as an ASIC in
vision silicon has been presented recently with the Movidius 2012 on 12.5 mm2 of silicon for the IBM 45nm SOI process.
Myriad 2 device [16] which has been used in Google Tango, The results based on post-layout simulations were published
and the Mobileye EyeQ3 platform, but no benchmarking in [25], featuring a performance of about 300 GOp/s at 0.6 W
results regarding ConvNets are available yet. operating at 400 MHz with an external memory bandwidth of
A different approach to increase throughput is through the 4 × 1.6 GB/s full-duplex.
use of the Fourier transform, diagonalizing the convolution To explore the possibilities in terms of energy efficiency, a
operation. While this has a positive effect for kernels larger convolution accelerator suitable for small ConvNets was im-
than 9 × 9, the bandwidth problem generally becomes much plemented in ST 28nm FDSOI technology [43]. They achieve
worse and the already considerable memory requirements are 37 GOp/s with 206 GOp/s/W at 0.8 V and 1.39 GOp/s with
boosted further, since the filters have to be padded to the input 1375 GOp/s/W at 0.4 V during simulation (pre-silicon) with
image size [38], [27]. the same implementation, using aggressive voltage scaling
However optimized the software running on such platforms, combined with reverse body biasing available with FDSOI
it will always be constrained by the underlying architecture: technology.
the arithmetic precision cannot be adapted to the needs of Further interesting aspects are highlighted in ShiDian-
the computation, caches are used instead of optimized on- Nao [44], [45], which evolved from DianNao [26]. The original
chip buffers, instructions have to be loaded and decoded. This DianNao was tailored to fully-connected layers, but was also
pushes the need for specialized architectures to achieve high able to evaluate convolutional layers. However, its buffering
power- and area-efficiency. strategy was not making use of the 2D structure of the com-
putational problem at hand. This was improved in ShiDianNao.
B. FPGA Implementations Nevertheless, its performance strongly depends on the size
Embeddability and energy-efficiency is a major concern re- of the convolutional layer to be computed, only unfolding its
garding commercialization of ConvNet-based computer vision performance for tiny feature maps and networks. They achieve
systems and has hence prompted many researchers to approach a peak performance of 128 GOp/s with 320 mW on a core-only
this issue using FPGA implementations. Arguably the most area of 1.3mm2 in a TSMC 65 nm post-layout evaluation.
popular architecture is the one which started as CNP [39] and Another way to approach the problem at hand is to look at
was further improved and renamed to NeuFlow [24], [25] and general convolution accelerators, such as the ConvEngine [46]
later on to nn-X [40]. which particularly targets 1D and 2D convolutions common in
Published in 2009, CNP was the first ConvNet specific computer vision applications. It comes with an array of 64 10-
FPGA implementation and achieved 12 GOp/s at 15 W on bit ALUs and input and output buffers optimized for the task
ARXIV PREPRINT 5
Stream 12 nchhin, maxwk = 344 kbit SRAM 12 nchhkwk = 5.4 kbit registers Test Interface 6
12 hkwk
f = 250 MHz
ffast = 2 f
Filter 12 hkwk 12 hkwk 12 hkwk 12 hkwk
Bank
12 (nch/2)hkwk
12 nchnchhkwk =
37.6 kbit registers
SoP SoP SoP SoP
hkwk = 49 multipliers hkwk = 49 multipliers hkwk = 49 multipliers hkwk = 49 multipliers
and adders and adders and adders and adders
12 12 12 12
Fig. 3. Top-level block diagram of the proposed architecture for the chosen implementation parameters.
loa col wk − 1
1
k +
hts
hts
w
loa ol 1
2
1
eig
eig
loa col
ol
ol
ol
dw
dc
dc
d
d
loa
loa
loa
loa
loa
input data is loaded from the image window SRAM and shifted into the
t image bank. The situation is illustrated in Figure 2 for an
output data individual channel.
t
In order for the SRAM to be able to provide this minimum
amount of data, it needs to store a wk element wide window for
input data
t all nch channels and have selectable height hin ≤ h(in,max) . A
large image has to be fed into the circuit in stripes of maximum
output data t height hin,max with an overlap of hk −1 pixels. The overlap is
nch (hk − 1) out col 1 7 + nch out col 2 because an evaluation of the kernel will need a surrounding of
(hk − 1)/2 pixel in height and (wk /1)/2 pixel in width. When
Fig. 4. Time diagram of the input and output data transfers. Internal activity the image bank reaches the bottom of the image window stored
is strongly related up to small delays (blue = load col = image window SRAM
r/w active = image bank write active; red = output = ChSum active = SoP in SRAM, it jumps back to the top, but shifted one pixel to the
active = image bank reading = filter bank reading; green = load weights = right. This introduces a delay of nch (hk − 1) cycles, during
filter bank shifting/writing) which the rest of the circuit is idling. This delay is not only due
to the loading of the new values for the image bank, but also to
receive the new pixels for the image window SRAM through
unit to compute the complete result, which is then transmitted the external I/O. Choosing h(in,max) is thus mostly a trade-
out of the circuit. off between throughput and area. The performance penalty on
For our architecture we tile the convolutional layer into the overall circuit is about a factor of (hk − 1)/h(in,max) .
blocks with a fixed number of input and output channels nch . The same behavior can be observed at the beginning of the
We perform 2n2ch hk wk operations every nch clock cycles, horizontal stripe. During the first nch hin (wk − 1) cycles the
while transmitting and receiving nch values instead of n2ch . processing units are idling.
This is different from all previous work, and improves the 2) Filter Bank: The filter bank stores all the weights of the
throughput per bandwidth by a factor of nch . The architecture filters, these are nch nch hk wk values. In configuration mode
can also be formulated for non-equal block size for the input the filter values are shifted into these registers which are
and output channels, but there is no advantage doing so, thus clocked with at the lower frequency f . In normal operation,
we keep this constraint for simplicity of notation. We proceed the entire filter bank is read-only. In each cycle all the filter
by presenting the individual blocks of the architecture in more values supplied to the SoP have to be changed, this means
detail. that nch hk wk filter values are read per cycle. Because so
1) Image Window SRAM and Image Bank: The image many filters have to be read in parallel and they change so
window SRAM and the image bank are in charge of storing frequently, it is not possible to keep them in a SRAM. Instead,
new received image data and providing the SoP units with the it is implemented with registers and a multiplexer capable of
image patch required for every computation cycle. To minimize multiplexing selecting one of nch sets of nch hk wk weights.
ARXIV PREPRINT 7
Physical Characteristics
Efficiency [GOp/s/W]
800 200
Throughput [GOp/s]
Technology UMC 65 nm, 8 Metal Layers
Core/Pad Voltage 1.2 V / 1.8 V 600 150
Package QFN-56
# Pads 55 (i: 14, o: 13, clk/test: 8, pwr: 20) 400 100
Core Area 3.09 mm2
Circuit Complexitya 912 kGE (1.31 mm2 )
200 50
Logic (std. cells) 697 kGE (1.00 mm2 )
On-chip SRAM 344 kbit
0 0
0.8 0.9 1 1.1 1.2
Performance & Efficiency @1.2 V
Vcore [V]
Max. Clock Frequency core: 500 MHz, i/o: 250 MHz
Efficiency 25°C ( ), 125°C ( ), Throughput 25°C ( ), 125°C ( )
Powera @500 MHz 449 mW (core) + 205 mW (pads)
Peak Throughput 196 GOp/s Fig. 9. Measured energy efficiency and throughput in dependence of Vcore
Effective Throughput 145 GOp/s for 25°C and 125°C.
Core Power-Efficiency 437 GOp/s/W
Performance & Efficiency @0.8 V
in the interval [0.95 V, 1.25 V] increasing to 14.7% for a core
Max. Clock Frequency core: 189 MHz, i/o: 95 MHz voltage of 0.8 V at 125°C.
Powerb @189 MHz 93 mW (core) + 144 mW (pads)
Peak Throughput 74 GOp/s
Effective Throughput 55 GOp/s VII. D ISCUSSION
Core Power-Efficiency 803 GOp/s/W None of the previous work on ConvNet accelerators has
a
Including the SRAM blocks. silicon measurement results. We will thus compare to post-
b
The power usage was measured running with real data and at layout and post-synthesis results of state-of-the-art related
maximum load. works, although such simulation results are known to be
optimistic. We have listed the key figures of all these works in
Table V and discuss the various results in the sections below.
A. Area Efficiency
Our chip is the most area-efficient ConvNet accelerator
reported in literature. We measure the area in terms of 2-
input NAND gate equivalents to compensate for technology
differences to some extent. With 90.7 GOp/s/MGE our imple-
mentation is by far the most area-efficient, and even in high
power-efficiency configuration we outperform previous state-
of-the-art results. The next best implementation is a NeuFlow
design at 33.8 GOp/s/MGE, requiring a factor 3 more space
for the same performance. ShiDianNao is of comparable area-
efficiency with 26.3 GOp/s/MGE. Also note that the chip size
was limited by the pad-frame, and that the area occupied by
the standard cells and the on-chip SRAM is only 1.31 mm2
(0.91 MGE). We would thus achieve a throughput density of
Fig. 8. Shmoo plot showing number of incorrect results in dependence of an enormous 215 GOp/s/MGE in this very optimistic scenario.
frequency (f = ffast /2, x-axis, in MHz) and core voltage (Vcore , y-axis, in This would require a more complex and expensive pad-frame
V) at 125°C. Green means no errors. architecture, e.g. flip-chip with multiple rows of pads, which
we decided not to implement.
We see the reason for these good results in our approach
measurement have been obtained at a forced ambient temper- to compute multiple input and output channels in parallel.
ature of 125°C. The resulting Shmoo plot is shown in Figure 8. This way we have to buffer a window of 8 input channels to
Besides the two mentioned operating points there are many compute 64 convolutions, instead of buffering 64 input images,
more, allowing for a continuous trade-off between throughput a significant saving of storage, particularly also because the
and energy efficiency by changing the core supply voltage as window size of the input images that has to be buffered is a
evaluated empirically in Figure 9. As expected the figures are lot larger than the size of a single convolution kernel. Another
slightly worse for the measurements at higher temperature. 25% can be attributed to the use of 12 bit instead of 16 bit
Static power dissipation takes a share around 1.25% across words which expresses itself mostly with the size of the SRAM
the entire voltage range at 25°C and a share of about 10.5% and the filter kernel buffer.
ARXIV PREPRINT 12
TABLE V. S UMMARY OF R ELATED W ORK FOR A W IDE R ANGE OF P LATFORMS (CPU, GPU, FPGA, ASIC).
publication type platform theor.a peaka act.a powerb power eff. prec. Vcore areaj area eff.h
GOp/s GOp/s GOp/s W GOp/s/W V MGE GOp/s/MGE
Cavigelli et al. [27] CPU Xeon E5-1620v2 118 35d 230 0.15 float32
Cavigelli et al. [27] GPU GTX780 3977 3030 1908d sd:200 14g float32
cuDNN R3 [35] GPU Titan X 6600 6343 d:250 25.6g float32
Cavigelli et al. [27] SoC Tegra K1 365 95 84d s:11 8.6 float32
this work silicon umc 65nm 196 196 145d c:0.51f 437 fixed12 1.2 c:0.91 d:2.16 90.7
this work silicon umc 65nm 74 74 55d c:0.093f 803 fixed12 0.8 c:0.91 d:2.16 34.3
a
We distinguish between theoretical performance, where we consider the maximum throughput of the arithmetic units, the peak throughput, which is the maximum throughput for
convolutional layers of any size, and the actual throughput, which has been benchmarked for a real ConvNet and without processing in batches.
b
For the different types of power measurements, we abbreviate: s (entire system), d (device/chip), c (core), m (memory), io (pads), sd (system differential load-vs-idle).
c
We use the abbreviations c (core area, incl. SRAM), d (die size)
d
These values were obtained for the ConvNet described in [27].
f
The static power makes up for around 1.3% of the total power at 25°C for the entire range of feasible Vcore , and about 11% at 125°C.
g
The increased energy efficiency of the Titan X over the GTX780 is significant and can neither be attributed solely to technology (both 28 nm) nor the software implementation or
memory interface (both GDDR5). Instead the figures published by Nvidia suggest that architectural changes from Kepler to Maxwell are the source of this improvement.
h
We take the theoretical performance to be able to compare more works and the device/chip size for the area. ShiDianNao does not include a pad ring in their layout (3.09mm2 ),
so we added it for better comparability (height 90µm).
j
We measure area in terms of size of millions of 2-input NAND gates. 1GE: 1.44µm2 (umc 65nm), 1.17µm2 (TSMC 65nm), 0.65µm2 (45nm), 0.49µm2 (ST 28nm FDSOI).
B. Bandwidth Efficiency at 0.4 V and making use of reverse body biasing, available
The number of I/O pins is often one of the most con- only with FDSOI technology to this extent. Our chip is then
gested resources when designing a chip, and the fight for followed by NeuFlow (490 GOp/s/W), the Convolution Engine
bandwidth is even more present when the valuable memory (409 GOp/s/W) and ShiDianNao (400 GOp/s/W).
bandwidth of an SoC has to be shared with accelerators. We However, technology has a strong impact on the energy
achieve a bandwidth efficiency of 521 GOp/GB, providing an efficiency. Our design was done in UMC 65 nm, while Neu-
improvement by a factor of more than 10× over the best Flow was using IBM 45 nm SOI and HWCE even resorted
previous work – NeuFlow comes with a memory bandwidth of to ST 28 nm FDSOI. In order to analyze our architecture
6.4 GB/s to provide 320 GOp/s, i.e. it can perform 50 GOp/GB. independently of the particular implementation technology
ShiDianNao does not provide any information on the exter- used, we take a look at its effect. We take the simple model
nal bandwidth and the HWCE can do 6.1 GOp/GB. These 2
`new Vdd,new
large differences, particularly between this work and previous P̃ = P .
`old Vdd,old
results, can be attributed to us having focused on reducing
the required bandwidth while maximizing throughput on a The projected results are shown in Table VI. To obtain the
small piece of silicon. The architecture has been designed operating voltage in 28 nm technology, we scale the operating
to maximize reuse of the input data by calculating pixels voltage linearly with respect to the common operating voltage
of multiple output channels in parallel, bringing a significant of the used technology. This projection, although not based
improvement over caching as in [44] or accelerating individual on a very accurate model, gives an idea of how the various
2D convolutions [46], [43]. implementations perform in a recent technology. Clearly, the
only competitive results in terms of core power efficiency
are ShiDianNao and this work. Previous work has always
C. Power Efficiency excluded I/O power, although it is a major contributor to
Our chip performs second-best in terms of energy efficiency the overall power usage. We estimate this power based on
of the core with 803 GOp/s/W (high-efficiency configuration) an energy usage of 21 pJ/bit, which has been reported for a
and 437 GOp/s (high-performance configuration), being out- LPDDR3 memory model and the PHY on the chip in 28 nm
performed only by the HWCE. The HWCE can reach up [57], assuming a reasonable output load and a very high page
to 1375 GOp/s/W in its high-efficiency setup when running hit rate. For our chip, this amounts to an additional 63 mW or
ARXIV PREPRINT 13
[20] C. Dong, C. C. Loy, K. He et al., “Learning a deep convolutional [44] Z. Du, R. Fasthuber, T. Chen et al., “ShiDianNao: Shifting Vision
network for image super-resolution,” Proc. Eur. Conf. Comput. Vis., pp. Processing Closer to the Sensor,” in Proc. ACM/IEEE Int. Symp.
184–199, 2014. Comput. Archtiecture, 2015.
[21] C. Farabet, C. Couprie, L. Najman et al., “Learning Hierarchical [45] T. Chen, Z. Du, N. Sun et al., “A High-Throughput Neural Network
Features for Scene Labeling,” IEEE Trans. Pattern Anal. Mach. Intell., Accelerator,” IEEE Micro, vol. 35, no. 3, pp. 24–32, 2015.
2013. [46] W. Qadeer, R. Hameed, O. Shacham et al., “Convolution Engine :
[22] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks Balancing Efficiency & Flexibility in Specialized Computing,” in Proc.
for Semantic Segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern ACM Int. Symp. Comput. Archit., 2013, pp. 24–35.
Recognit., 2015. [47] S. Park, K. Bong, D. Shin et al., “A 1.93TOPS/W Scalable Deep Learn-
[23] P. Weinzaepfel, J. Revaud, Z. Harchaoui et al., “DeepFlow: Large ing / Inference Processor with Tetra-Parallel MIMD Architecture for
Displacement Optical Flow with Deep Matching,” in Proc. IEEE Int. Big-Data Applications,” in Proc. IEEE Int. Conf. Solid-State Circuits,
Conf. Comput. Vis. IEEE, dec 2013, pp. 1385–1392. 2015, pp. 80–82.
[24] C. Farabet, B. Martini, B. Corda et al., “NeuFlow: A Runtime Recon- [48] Q. Yu, C. Wang, X. Ma et al., “A Deep Learning Prediction Process
figurable Dataflow Processor for Vision,” in Proc. IEEE Conf. Comput. Accelerator Based FPGA,” Proc. IEEE/ACM Int. Symp. Clust. Cloud
Vis. Pattern Recognit. Work., jun 2011, pp. 109–116. Grid Comput., no. 500, pp. 1159–1162, 2015.
[25] P. H. Pham, D. Jelaca, C. Farabet et al., “NeuFlow: Dataflow vision [49] T. Moreau, M. Wyse, J. Nelson et al., “SNNAP: Approximate Com-
processing system-on-a-chip,” in Proc. Midwest Symp. Circuits Syst., puting on Programmable SoCs via Neural Acceleration,” in Proc. IEEE
2012, pp. 1044–1047. Int. Symp. High Perform. Comput. Archit., 2015, pp. 603–614.
[26] T. Chen, Z. Du, N. Sun et al., “DianNao: A Small-Footprint High- [50] Q. Zhang, T. Wang, Y. Tian et al., “ApproxANN: An Approximate
Throughput Accelerator for Ubiquitous Machine-Learning,” in Proc. Computing Framework for Artificial Neural Network,” in Proc. IEEE
ACM Int. Conf. Archit. Support Program. Lang. Oper. Syst., 2014, pp. Des. Autom. Test Eur. Conf., 2015, pp. 701–706.
269–284. [51] F. Akopyan, J. Sawada, A. Cassidy et al., “TrueNorth: Design and
[27] L. Cavigelli, M. Magno, and L. Benini, “Accelerating Real-Time Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic
Embedded Scene Labeling with Convolutional Networks,” in Proc. Chip,” IEEE Trans. Comput. Des. Integr. Circuits Syst., vol. 34, no. 10,
ACM/IEEE Des. Autom. Conf., 2015. pp. 1537–1557, oct 2015.
[28] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks [52] M. Courbariaux, Y. Bengio, and J.-P. David, “Training Deep Neural
for Large-Scale Image Recognition,” in Proc. Int. Conf. Learn. Repre- Networks with Low Precision Multiplications,” in Proc. Int. Conf.
Learn. Represent., 2015.
sent., sep 2015.
[53] S. Gupta, A. Agrawal, K. Gopalakrishnan et al., “Deep Learning with
[29] O. Russakovsky, J. Deng, H. Su et al., “ImageNet Large Scale Visual
Limited Numerical Precision,” in Proc. Int. Conf. Mach. Learn., 2015.
Recognition Challenge,” 2014.
[54] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of
[30] K. He, X. Zhang, S. Ren et al., “Deep Residual Learning for Image neural networks on CPUs,” in Adv. Neural Inf. Process. Syst. Work.,
Recognition,” vol. 7, no. 3, pp. 171–180, dec 2015. 2011.
[31] R. Collobert, “Torch7: A Matlab-like Environment for Machine Learn- [55] G. Soulié, V. Gripon, and M. Robert, “Compression of Deep Neural
ing,” Adv. Neural Inf. Process. Syst. Work., 2011. Networks on the Fly,” in arXiv:1509.08745, 2015.
[32] Y. Jia, “Caffe: An Open Source Convolutional Architecture for [56] S. Gould, R. Fulton, and D. Koller, “Decomposing a scene into
Fast Feature Embedding,” 2013. [Online]. Available: http://caffe. geometric and semantically consistent regions,” in Proc. IEEE Int. Conf.
berkeleyvision.org Comput. Vis., 2009.
[33] S. Chetlur, C. Woolley, P. Vandermersch et al., “cuDNN: Efficient [57] M. Schaffner, F. K. Gürkaynak, A. Smolic et al., “DRAM or no-DRAM
Primitives for Deep Learning,” in arXiv:1410.0759, oct 2014. ? Exploring Linear Solver Architectures for Image Domain Warping in
[34] Nervana Systems Inc., “Neon Framework,” 2015. [Online]. Available: 28 nm CMOS,” in Proc. IEEE Des. Autom. Test Eur., 2015.
http://neon.nervanasys.com
[35] S. Chintala, “convnet-benchmarks,” 2015. [Online]. Available: https: Lukas Cavigelli received the M.Sc. degree in elec-
//github.com/soumith/convnet-benchmarks trical engineering and information technology from
[36] A. Lavin, “maxDNN: An Efficient Convolution Kernel for Deep Learn- ETH Zurich, Zurich, Switzerland, in 2014. Since
ing with Maxwell GPUs,” in arXiv:1501.06633v3, jan 2015. then he has been with the Integrated Systems Lab-
[37] Nvidia Inc., “NVIDIA Tegra X1 Whitepaper,” 2015. oratory, ETH Zurich, pursuing a Ph.D. degree. His
current research interests include deep learning, com-
[38] M. Mathieu, M. Henaff, and Y. LeCun, “Fast Training of Convolutional puter vision, digital signal processing, and low-power
Networks through FFTs,” in arXiv:1312.5851, dec 2014. integrated circuit design. Mr. Cavigelli received the
[39] C. Farabet, C. Poulet, J. Y. Han et al., “CNP: An FPGA-based processor best paper award at the 2013 IEEE VLSI-SoC Con-
for Convolutional Networks,” in Proc. IEEE Int. Conf. F. Program. Log. ference.
Appl., 2009, pp. 32–37.
[40] V. Gokhale, J. Jin, A. Dundar et al., “A 240 G-ops/s Mobile Coprocessor
for Deep Neural Networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Luca Benini is the Chair of Digital Circuits and
Recognit., 2014, pp. 682–687. Systems at ETH Zurich and a Full Professor at
[41] K. Ovtcharov, O. Ruwase, J.-y. Kim et al., “Accelerating Deep Con- the University of Bologna. He has served as Chief
volutional Neural Networks Using Specialized Hardware,” Microsoft Architect for the Platform2012/STHORM project at
Research, Tech. Rep., 2015. STmicroelectronics, Grenoble. He has held visiting
[42] C. Zhang, P. Li, G. Sun et al., “Optimizing FPGA-based Accelerator and consulting researcher positions at EPFL, IMEC,
Design for Deep Convolutional Neural Networks,” in Proc. ACM Int. Hewlett-Packard Laboratories, Stanford University.
Symp. Field-Programmable Gate Arrays, 2015, pp. 161–170. Dr. Benini’s research interests are in energy-efficient
system and multi-core SoC design. He is also active
[43] F. Conti and L. Benini, “A Ultra-Low-Energy Convolution Engine for in the area of energy-efficient smart sensors and
Fast Brain-Inspired Vision in Multicore Clusters,” in Proc. IEEE Des. sensor networks for biomedical and ambient intel-
Autom. Test Eur. Conf., 2015. ligence applications. He has published more than 700 papers in peer-reviewed
international journals and conferences, four books and several book chapters.
He is a Fellow of the IEEE and a member of the Academia Europaea.