Survey of Deep Learning Accelerators
Survey of Deep Learning Accelerators
Review
Survey of Deep Learning Accelerators for Edge and
Emerging Computing
Shahanur Alam 1, *, Chris Yakopcic 1 , Qing Wu 2 , Mark Barnell 2 , Simon Khan 2 and Tarek M. Taha 1, *
1 Department of Electrical and Computer Engineering, University of Dayton, Dayton, OH 45469, USA;
cyakopcic1@udayton.edu
2 Information Directorate, Air Force Research Laboratory, Rome, NY 13411, USA; qing.wu.2@us.af.mil (Q.W.);
mark.barnell.1@us.af.mil (M.B.); simon.khan@us.af.mil (S.K.)
* Correspondence: alamm8@udayton.edu (S.A.); tarek.taha@udayton.edu (T.M.T.)
Abstract: The unprecedented progress in artificial intelligence (AI), particularly in deep learning
algorithms with ubiquitous internet connected smart devices, has created a high demand for AI
computing on the edge devices. This review studied commercially available edge processors, and the
processors that are still in industrial research stages. We categorized state-of-the-art edge processors
based on the underlying architecture, such as dataflow, neuromorphic, and processing in-memory
(PIM) architecture. The processors are analyzed based on their performance, chip area, energy effi-
ciency, and application domains. The supported programming frameworks, model compression, data
precision, and the CMOS fabrication process technology are discussed. Currently, most commercial
edge processors utilize dataflow architectures. However, emerging non-von Neumann computing
architectures have attracted the attention of the industry in recent years. Neuromorphic processors
are highly efficient for performing computation with fewer synaptic operations, and several neuro-
morphic processors offer online training for secured and personalized AI applications. This review
found that the PIM processors show significant energy efficiency and consume less power compared
to dataflow and neuromorphic processors. A future direction of the industry could be to implement
state-of-the-art deep learning algorithms in emerging non-von Neumann computing paradigms for
low-power computing on edge devices.
Citation: Alam, S.; Yakopcic, C.; Wu,
Q.; Barnell, M.; Khan, S.; Taha, T.M. Keywords: AI accelerator; AI frameworks; deep learning; edge computing; low-power applications;
Survey of Deep Learning Accelerators quantization; PIM or CIM computing; neuromorphic computing
for Edge and Emerging Computing.
Electronics 2024, 13, 2988. https://
doi.org/10.3390/electronics13152988
Figure 1.
Figure Illustration of
1. Illustration of edge
edge computing
computing with
with cloud
cloud interconnection.
interconnection.
brain-float (BF) are used in inference processors. Typically, INT4, INT8, INT16, FP16, or
BF16 numerical precision is used in the inference processor. However, recently released
processors from multiple startups can compute with very low precision while trading off
accuracy to some extent [21].
The current trend in computing technology is to enable data movement faster for
higher speed and more efficient computing. To achieve this, AI edge processors need some
essential prerequisites: lower energy consumption, smaller area, and higher performance.
Neuromorphic and PIM processors are becoming more popular for their higher energy
efficiency and lower latency [9,10,19,20]. However, a single edge processor usually does
not support all types of DNN networks and frameworks. There are multiple types of
DNN models, and each usually excels at particular application domains. For example,
recurrent neural networks (RNNs), long short-term-memory (LSTM), and gated recurrent
units (GRUs) are suitable for natural language processing [22–28], but convolutional neural
networks (CNNs), residual neural network (ResNet), and visual geometry group (VGG)
networks are better for detection and classification [29–31].
The CMOS technology node used for fabricating each device has a significant impact
on its area, energy consumption, and speed. TSMC currently uses 3 nm extreme ultraviolet
(UV) technology for the Apple A17 processor [32]. TSMC is currently aspiring to develop
2 nm technology by 2025 for higher performance and highly energy-efficient AI computing
processors [33]. Samsung’s smartphone processor Exynos 2200, developed with 4 nm
technology, is on the market [34]. Intel utilized its Intel-4/7 nm technology for its Loihi
2 neuromorphic processor [9].
This article provides a comprehensive review of commercial deep learning edge
processors. Over 100 edge processors are listed along with their key specifications. We
believe this is the most comprehensive technical analysis at present. The main contributions
of this review are as follows:
1. It provides a comprehensive and easy-to-follow description of state-of-the-art edge
devices and their underlying architecture.
2. It reviews the supported programming frameworks of the processors and general
model compression techniques to enable edge computing.
3. It analyzes the technical details of the processors for edge computing and provides
charts on hardware parameters.
This paper is arranged as follows: Section 2 describes key deep learning algorithms
very briefly. Section 3 describes model compression techniques commonly used to optimize
deep learning networks for edge applications. Section 4 discusses the frameworks available
for deep learning AI applications. Section 5 describes the frameworks for developing AI
applications on SNN processors. The processors are reviewed briefly in Section 6. Section 7
discusses the data on the processors and performs a comparative analysis. A brief summary
of this review study is presented in Section 8.
and GoogleNet [45]. Semi-supervised neural networks use a few labels to learn categories
and could be generative models or time-based sequence learning models. The semi-
supervised topologies include GAN, GRU, RNN, and LSTM. The internal layers of these
NN models are composed of CNNs and fully connected network topologies. A number of
edge processors support semi-supervised network models for automation applications. For
example, DeepVision (now Kinara) introduced ARA-1 (2020) and ARA-2 (2022) [46], which
target autonomous applications such as robotics, autonomous vehicles, smart tracking,
and autonomous security systems. Kneron introduced KL720 in 2021, which supports
semi-supervised network topologies for a wide range of applications [47]. In 2021, Syntiant
released a new PIM AI processor for extreme edge applications, accommodating supervised
and semi-supervised network topologies and supporting CNN, GRU, RNN, and LSTM
topologies [20].
The computational complexity of DL models is a barrier to implementing these mod-
els for resource-constrained edge or IoT devices. For edge applications, the deep neu-
ral network should be designed in an optimized way that is equally efficient without
losing a significant amount of accuracy. Common deep learning application areas for
the edge include [48–55] image classification, object detection, object tracking, speech
recognition, health care, and natural language processing (NLP). This section will dis-
cuss some lightweight DL models for edge applications that perform classification and
object detection.
2.1. Classification
Classification is probably the most popular use of CNNs and is one of the key applica-
tions in the computer vision field [56–58]. While larger networks with higher accuracies
are utilized in desktop and server systems, smaller and more highly efficient networks are
typically used for edge applications.
SqueezeNet [59,60] utilizes a modified convolutional model that is split into squeeze
and expand layers. Instead of 3 × 3 convolution operations seen in typical CNNs, a
much simpler 1 × 1 convolution operation is used. SqueezeNet achieves AlexNet levels of
accuracy with 50× fewer network parameters [60]. Using model compression techniques,
SqueezeNet can be compressed to 0.5 MB, which is about 510× smaller than AlexNet.
MobileNet [61] was created by Google and is one of the most popular DL models for
edge applications. MobileNet substitutes the traditional convolution operation with a more
flexible and efficient depthwise separable operation, significantly reducing computational
costs. The depthwise separable technique performs two operations: depthwise convolution
and pointwise convolution. There are three available versions of MobileNet networks:
MobileNet v1 [61], MobileNet v2 [62], and MobileNet v3 [63]. MobileNet v2 builds on
MobileNet v1 by adding a linear bottleneck and an inverted residual block at the end. The
latest MobileNet v3 utilizes NAS (neural architecture search) and NetAdapt to design a
more accurate and efficient network architecture for inference applications [63].
ShuffleNet [64] utilizes group convolution and channel shuffle to reduce computation
complexity. It increases accuracy by retraining with minimal computational power. There
are two versions of ShuffleNet: ShuffleNet v1 and ShuffleNet v2 [64,65].
EfficientNet is a type of CNN with a corresponding scaling method that is able to
find a balance between computational efficiency and performance. It can uniformly scale
all the network dimensions, such as width, depth, and resolution, by using a compound
coefficient [66]. The scaling method facilitates the development of a family of networks.
Unlike other DL models, the EfficientNet model focuses not only on accuracy but also on
the efficiency of the model.
2.2. Detection
Object detection is an important task in computer vision that identifies and localizes
all the objects in an image. This application has a wide range of applications, including
autonomous vehicles, smart cities, target tracking, and security systems [67]. The broad
Electronics 2024, 13, 2988 5 of 44
range of object detection and DL network applications are discussed in [56,68]. DL networks
for object detection can be categorized into two types: (i) single-stage (such as SSD, YOLO,
and CenterNet) and (ii) two-stage (such as Fast/Faster RCNN). There are multiple criteria
for choosing the right architecture for the edge application. Single-stage detectors are
computationally more efficient than two-stage architectures, making them a better choice
for edge applications. For example, YOLO v5 demonstrates better performance compared
to Faster-RCNN-ResNet-50 [67].
3. Model Compression
Unoptimized DL models contain considerable redundancy in parameters and are
generally designed without consideration of power or latency. Lightweight and optimized
DL models enable AI application on edge devices. Designing effective models for running
on resource-constrained systems is challenging. DNN model compression techniques
are utilized to convert unoptimized models to forms that are suitable for edge devices.
Model compression techniques are studied extensively and discussed in [91–96]. The
techniques include parameter pruning, quantization, low-rank factorization, compact
filtering, and knowledge distillation. In this section, we will discuss some of the key model
compression techniques.
3.1. Quantization
Quantization is a promising approach to optimizing DNN models for edge devices.
Data quantization for edge AI has been studied extensively in [92–99]. Parameter quan-
tization takes a DL model and compresses its parameters by changing the floating point
weights to a lower precision to avoid costly floating point computations. As shown in
Table 2, most edge inference engines support INT4, 8, or 16 precisions. Quantization tech-
niques can be taken to the limit by developing binary neural networks (BNNs) [99]. A BNN
uses a single bit to represent activations and reduces memory requirements. Leapmind
Electronics 2024, 13, 2988 6 of 44
Table 2. Commercial edge processors with operation technology, process technology, and numeri-
cal precision.
Max Energy
Latest Max Power Process Area Precision
Company Performance Efficiency Architecture Reference
Chip (W) (nm) (mm2 ) INT/FP
(TOPS) (TOPS/W)
Analog
MAX78000 1 pJ/MAC -- 64 1, 2, 4, 8 -- -- Dataflow [100,101]
Devices
Apple M1 10 5 119 64 11 1.1 Dataflow [102]
Apple A14 6 5 88 64 11 1.83 Dataflow [103]
Apple A15 7 5 64 15.8 2.26 Dataflow [103]
Apple A16 5.5 4 64 17 3 Dataflow [104]
* AIStorm AIStorm 0.225 8 2.5 11 Dataflow [105]
* AlphaIC RAP-E 3 8 30 10 Dataflow [106]
Dynap-
aiCTX 0.001 22 12 1 0.0002 0.2 Neuromorphic [15,107]
CNN
* ARM Ethos78 1 5 16 10 10 Dataflow [108,109]
Apache5
* AIMotive 0.8 16 121 8 1.6–32 2 Dataflow [110,111]
IEP
Pathfinder, 64, FP-8,
* Blaize 6 14 16 2.7 Dataflow [112]
EI Cano BF16
*Bitman BM1880 2.5 28 93.52 8 2 0.8 Dataflow [113,114]
* BrainChip Akida1000 2 28 225 1, 2, 4 1.5 0.75 Neuromorphic [115,116]
Kendrite
* Cannan 2 28 8 1.5 1.25 Dataflow [117,118]
K210
CEVA- 2, 5, 8, 12,
* CEVA 16 12.7 Dataflow [119]
Neuro-S 16
CEVA- 2, 5, 8, 12,
* CEVA 0.83 16 20 24 Dataflow [120]
Neuro-M 16
* Cadence DNA100 0.85 16 16 4.6 3 Dataflow [121,122]
* Deepvision ARA-1 1.7 28 8, 16 4 2.35 Dataflow [123]
* Deepvision ARA-2 16 Dataflow [124]
* Eta ECM3532 0.01 55 25 8 0.001 0.1 Dataflow [125]
* FlexLogic InferX X1 13.5 7 54 8 7.65 0.57 Dataflow [126]
* Google Edge TPU 2 28 96 8, BF16 4 2 Dataflow [127,128]
LightSpeer
* Gyrfalcon 0.7 28 81 8 16.8 24 PIM [47,89]
2803S
LightSpeer
* Gyrfalcon 0.224 28 36 8 2.8 12.6 PIM [89]
5801
* Gyrfalcon Janux GS31 650/900 28 10457.5 8 2150 3.30 PIM [129]
* GreenWaves GAP9 0.05 22 12.25 8, 16, 32 0.05 1 Dataflow [130–132]
* Horizon Journey 3 2.5 16 8 5 2 Dataflow [133]
* Horizon Journey5/5P 30 16 8 128 4.8 Dataflow [134,135]
* Hailo Hailo 8 M2 2.5 28 225 4, 8, 16 26 2.8 Dataflow [136,137]
Intel Loihi 2 0.1 7 31 8 0.3 3 Neuromorphic [9]
Intel Loihi 0.11 14 60 1–9 0.03 0.3 Neuromorphic [9,138]
Intel®
* Intel 2 16 71.928 16 4 2 Dataflow [139]
Movidius
Electronics 2024, 13, 2988 7 of 44
Table 2. Cont.
Max Energy
Latest Max Power Process Area Precision
Company Performance Efficiency Architecture Reference
Chip (W) (nm) (mm2 ) INT/FP
(TOPS) (TOPS/W)
IBM TrueNorth 0.065 28 430 8 0.0581 0.4 Neuroorphic [10,138]
IBM NorthPole 74 12 800 2, 4, 8 200 (INT8) 2.7 Dataflow [90,140]
PowerVR
* Imagination FP-(8, 16) 0.60 Dataflow [141,142]
Series3NX
29.5 (A), 0.14
* Imec DIANA 22 10.244 2 14.4 PIM + Digital [143,144]
(D)
IMG 4NX
* Imagination 0.417 4, 16 12.5 30 Dataflow [145]
MC1
* Kalray MPPA3 15 16 8, 16 255 1.67 Dataflow [13]
* Kneron KL720 AI 1.56 28 81 8, 16 1.4 0.9 Dataflow [47]
* Kneron KL530 0.5 8 1 2 Dataflow [47]
* Koniku Konicore Neuromorphic [12]
1, 2, 4, 8, 16,
* LeapMind Efficiera 0.237 12 0.422 6.55 27.7 Dataflow [21]
32
4, 8, 16,
* Memryx MX3 1 -- -- 5 5 Dataflow [146]
BF16
* Mythic M1108 4 361 8 35 8.75 PIM [87]
* Mythic M1076 3 40 294.5 8 25 8.34 PIM [18,86,88]
* mobileEye EyeQ5 10 7 45 4, 8 24 2.4 Dataflow [147–149]
* mobileEye EyeQ6 40 7 4, 8 128 3.2 Dataflow [150]
* Mediatek i350 14 0.45 Dataflow [151]
Jetson
* NVIDIA 10 20 118 FP16 1.88 0.188 Dataflow [152]
Nano B01
* NVIDIA AGX Orin 60 7 -- 8 275 3.33 Dataflow [153]
* NXP i.MX 8M+ 14 196 FP16 2.3 Dataflow [84,85]
* NXP i.MX9 4 × 10−6 12 Dataflow [83]
* Perceive Ergo 0.073 5 49 8 4 55 Dataflow [154]
TSU & Polar
QM930 12 12 1089 4, 8, 16 20 (INT8) 1.67 Dataflow [155]
Bear Tech
Qualcomm QCS8250 7 157.48 8 15 Dataflow [156,157]
Snapdragon
Qualcomm 5 5 FP32 32 6.4 Dataflow [158–160]
888+
Snapdragon 4, 8, 16,
Qualcomm 4 51 Dataflow [161]
8 Gen2 FP16
* RockChip rk3399Pro 3 28 729 8, 16 3 1 Dataflow [162]
Amlogic
Rokid 12 5 Dataflow [163]
A311D
Exynos
Samsung 5 26 Dataflow [164,165]
2100
Exynos
Samsung 4 8, 16, FP16 Dataflow [166]
2200
Samsung HBM-PIM 0.9 20 46.88 1.2 1.34 PIM [167,168]
Sima.ai MLSoC 10 16 175.55 8 50 5 Dataflow [169,170]
Synopsis EV7x 16 8, 12, 16, 2.7 Dataflow [171,172]
* Syntiant NDP100 0.00014 40 2.52 0.000256 20 PIM [173,174]
* Syntiant NDP101 0.0002 40 25 1, 2, 4,8 0.004 20 PIM [173,175]
* Syntiant NDP102 0.0001 40 4.2921 1, 2, 4, 8 0.003 20 PIM [173,175]
* Syntiant NDP120 0.0005 40 7.75 1, 2, 4, 8 0.0019 3.8 PIM [173,176]
Electronics 2024, 13, 2988 8 of 44
Table 2. Cont.
Max Energy
Latest Max Power Process Area Precision
Company Performance Efficiency Architecture Reference
Chip (W) (nm) (mm2 ) INT/FP
(TOPS) (TOPS/W)
* Syntiant NDP200 0.001 40 1, 2, 4, 8 0.0064 6.4 PIM [173,177]
NEMA® |pico
Think Silicon 0.0003 28 0.11 FP16, 32 0.0018 6 Dataflow [178]
XS
Tesla/Samsung FSD Chip 36 14 260 8, FP-8 73.72 2.04 Dataflow [179]
Videntis TEMPO Neuromorphic [11]
Verisilicon VIP9000 16 16, FP16 0.5–100 Dataflow [180,181]
Untether TsunAImi 400 16 8 2008 8 PIM [182,183]
UPMEM-
UPMEM 700 20 32, 64 0.149 PIM [184–187]
PIM
* Processors are available for purchase; Integer precision is indicated by only precision number(s). Floating point
precision is denoted FP in the precision column.
Recent hardware studies show that lower precision does not have a major impact on
inference accuracy. For example, Intel and Tsinghua University have presented QNAP [188],
where they utilize 8 bits for weights and activations. They show an inference accuracy loss
of only 0.11% and 0.40% for VGG-Net and GoogleNet, respectively, when compared to a
software baseline with the ImageNet dataset. Samsung and Arizona State University have
experimented with extremely-low-precision inference in PIMCA [189], where they utilized
1 bit for weights and activations. They showed that VGG-9 and ResNet-18 had accuracy
losses of 3.89% and 6.02%, respectively.
Lower precision increases the energy and area efficiency of a system. PIMCA can com-
pute 136 and 35 TOPS/W in 1- and 2-bit precision, respectively, for ResNet-18. TSMC [190]
has studied the impact of low-precision computations on area efficiency. They showed
221 and 55 TOPS/mm2 area efficiency in 4- and 8-bit precision. Thus, with 4-bit computa-
tion, they achieved about 3.5× higher computation throughput per unit area compared to
8-bit computation.
Brain-Float-16 (or BF-16) [191] is a limited precision floating point format that is
becoming popular for AI applications in edge devices. BF16 combines certain components
of FP32 and FP16. From FP16, the BF16 utilizes 16 bits overall. From FP32, BF16 utilizes
8 bits for the exponent field (instead of 5 bits for FP16). A key benefit of BF16 is the format
obtains the same dynamic range and inference accuracy as FP32 [75]. BF16 speeds up the
MAC operation in edge devices to enable faster AI inference on the edge devices. Both
the GDDR6-AiM from SK Hynix [192] and Pathfinder-1600 from Blaize [112,193] support
BF16 for AI applications. The supported precision levels of various edge processors are
presented in Table 2.
3.2. Pruning
Pruning is the technique used to remove unnecessary network connections to make the
network lightweight for deploying on edge processors. Several studies [92–100,194–196] show
that up to 91% of weights in AlexNet can be pruned with minimal accuracy reduction.
Various training methods have been proposed to apply pruning to pre-trained networks [99].
Pruning, however, has drawbacks such as creating sparsity in the weight matrices. This
sparsity leads to unbalanced parallelism in the computation and irregular access to the
on-chip memory. Several techniques have been developed [197,198] to reduce the sparsity.
authors show that the distillation of knowledge from a larger regularized model into a
smaller model works effectively. Various algorithms have been proposed to improve the
process of transferring knowledge, such as adversarial distillation, multi-teacher distil-
lation, cross-modal distillation, attention-based distillation, quantized distillation, and
NAS-based distillation [205]. Although knowledge distillation techniques are mainly used
for classification applications, they are also applied to other applications, such as object
detection, semantic segmentation, language modeling, and image synthesis [81].
Table 3. Processors, supported neural network models, deep learning frameworks, and application
domains.
Table 3. Cont.
Table 3. Cont.
Table 3. Cont.
Some edge processors are compatible only with their in-home frameworks. For exam-
ple, Kalray’s MPPA3 edge processor is compatible with KaNN (Kalray Neural Network), so
any trained deep network must be converted to KaNN to run on the MPPA3 processor [13].
Electronics 2024, 13, 2988 13 of 44
CEVA introduced its own software framework, CEVA-DNN, for converting pre-trained
network models and weights from offline training frameworks (such as Caffe, TensorFlow)
for inference applications on the CEVA processors [119,217]. CEVA added a retrain feature
in CEVA-DNN for the Neuro-Pro processor to enable a deployed device to be updated with-
out uploading a database to the server [119]. The developer can also use CEVA-DNN tools
on a simulator or test device and then transfer the updated model to edge devices [217].
6. Edge Processors
At present, GPUs are the most popular platform for implementing DNNs. These,
however, are usually not suitable for edge computing (except the NVIDIA Jetson systems)
due to their high power consumption. A large variety of AI hardware has been developed,
many of which target edge applications. Several articles have reviewed AI hardware in
broad categories, giving an overall idea of the current trend in AI accelerators [230–232].
Earlier works [2,233–235] have reviewed a small selection of older edge AI processors.
This paper presents a very broad coverage of edge AI processors and PIM processors
from the industry. This includes processors already released, processors that have been
announced, and processors that have been published in research venues (such as the
ISSCC and the VLSI conferences). The data presented here are collected from open-source
platforms that include scientific articles, tech news portals, and company websites. The
exact numbers could be different than in this report. If someone is interested in a particular
processor, we suggest verifying the performance data with the providers. This section
is divided into four subsections: subsection (i) describes dataflow processors; subsection
(ii) describes neuromorphic processors; and subsection (iii) describes PIM processors. All of
these sections describe industrial products that have been announced or released. Finally,
subsection (iv) describes the processors in industrial research.
Table 2 describes the key hardware characteristics of the commercial edge-AI and PIM-
AI processors. Table 3 lists the same key characteristics for the processors from industrial
research. Table 4 describes the key software/application characteristics of the processors in
Table 2.
Table 4. Edge processors in industrial research with technology, process technology, and numerical
precision.
Table 4. Cont.
areas are object detection, classification, facial recognition, time series data processing, and
noise cancellation.
Apple released the bionic SoC A16 with an NPU unit for the iPhone 14 [104]. The A16
processor exhibits about 20% better performance with the same power consumption as
their previous version, A15. It is embedded with a 6-core ARM8.6a CPU, 16-core NPU, and
8-core GPU [104]. The Apple M2 processor was released in 2022 primarily for Macbooks
and then optimized for iPads. This processor includes a 10-core GPU and 16-core NPU [265].
M1 performs 11 TOPS with 10 W of power consumption [109]. The M2 has an 18% and
35% more powerful CPU and GPU for faster computation.
ARM recently announced the Ethos-N78 with an 8-core NPU for automotive applica-
tions [108]. Ethos-N78 is an upgraded version of Ethos-N77. Both NPUs support INT8 and
INT16 precision. Ethos-N78 performs more than two times better than the earlier version.
The most significant improvement of Ethos-N78 is a new data compression method that
reduces the bandwidth and improves performance and energy efficiency [109].
Blaize released its Pathfinder P1600 El Cano AI inference processor. This processor
integrates 16 graph streaming processors (GSPs) that deliver 16 TOPS at its peak perfor-
mance [112]. It uses a dual Cortex-A53 for running the operating system at up to 1 GHz.
Blaize GSP processors integrate data pipelining and support up to INT-64 and FP-8-bit
operations [112].
AIMotive [110] introduced the inference edge processor Apache5, which supports a
wide range of DNN models. The system has an aiWare3p NPU with an energy efficiency
of 2 TOPS/W. Apache5 supports INT8 MAC and INT32 internal precision [111]. This
processor is mainly targeted at autonomous vehicles [266].
CEVA [119] released the Neupro-S on-device AI processor for computer vision appli-
cations. Neupro comprises two separate cores. One is the DSP-based Vector Processor Unit
(VPU), and the other is the Neupro Engine. VPU is the controller, and the Neupro Engine
performs most of the computing work with INT8 or INT16 precision. A single processor
performs up to 12.5 TOPS, while the performance can be scaled to 100 TOPS with multicore
clusters [119,120]. The deep learning edge processors are mostly employed for inference
tasks. CEVA added a retraining capability to its CDNN (CEVA DNN) framework for online
learning on client devices [217].
Cadence introduced the Tensilica DNA 100, which is a comprehensive SoC for domain-
specific on-device AI edge accelerators [121]. It has low-, mid-, and high-end AI products.
Tensilica DNA 100 offers 8 GOPS to 32 TOPS AI processing performance currently and
predicts 100 TOPS in future releases [122]. The target applications of the DNA 100 include
IoTs, intelligent sensors, vision, and voice application. The mid- and high-end applications
include smart surveillance and autonomous vehicles, respectively.
Deepvision has updated their edge inference coprocessor ARA-1 for applications to
autonomous vehicles and smart industries [123]. It includes eight compute engines with
4 TOPS and consumes 1.7–2.3 W of power [123]. The computing engine supports INT8
and INT16 precision. Deepvision has recently announced its second-generation inference
engine, ARA-2, which will be released later in 2022 [124]. The newer version will support
LSTM and RNN neural networks in addition to the networks supported in ARA-1.
Horizon announced its next automotive AI inference processor Journey 5/5P [133],
which is the updated version of Journey 3. The mass production of Journey 5 will be
starting in 2022. The processor exhibits a performance of 128 TOPS, and has a power of
30 W, giving an energy efficiency of 4.3 TOPS/W [134,135].
Hailo released its Hailo-8 M-2 SoC for various edge applications [136]. The com-
puting engine supports INT8 and INT16 precision. This inference engine is capable of
26 TOPS and requires 2.5 W of power. The processor can be employed as a standalone or
coprocessor [137].
Google introduced its Coral Edge TPU, which comprises only 29% of the floorplan
area of the original TPU for edge applications [127]. The Coral TPU shows high energy
efficiency in DNN computations compared to the original TPUs which are used in cloud
Electronics 2024, 13, 2988 17 of 44
inference applications [267]. Coral Edge TPU supports INT8 precision and can perform
4 TOPS with 2 Watts of power consumption [127].
Google released its Tensor processor for mobile applications, coming with its recent
Pixel series mobile phone [268]. Tensor is an 8-core cortex CPU chipset fabricated with 5 nm
process technology. The processor has a 20-core Mali-G78 MP20 GPU with 2170 GFLOPS
computing speed. The processor has a built-in NPU to accelerate AI models with a perfor-
mance of 5.7 TOPS. The maximum power consumption of the processor is 10 W.
GreenWaves announced their edge inference chip GAP9 [130]. It is a very low-cost,
low-power device that consumes 50 mW and performs 50 GOPS at its peak [132]. However,
it consumes 330µW/GOP [131]. GAP9 provides hearable developments through DSP, AI
accelerator, and ultra-low-latency audio streaming on IoT devices. GAP9 supports a wide
range of computing precision, such as INT8, 16, 24, 32, and FP16, 32 [131].
IBM introduced the NorthPole [90,140], a non-von Neumann deep learning inference
engine, at the HotChips 2023 conference. The processor shows massive parallelism with
256 cores. Each core has 768 KB of near-computer memory to store weights, activations,
and programs. The total on-chip memory capacity is 192 MB. The NorthPole processor
does not use off-chip memory to load weights or store intermediate values during deep
learning computations. Thus, it dramatically improves latency, throughput, and energy
consumption, which helps outperform existing commercial deep learning processors. The
external host processor works on three commands: write tensor, run network, and read
tensor. The NorthPole processor follows a set of pre-scheduled deterministic operations in
the core array. It is implemented in 12 nm technology and has 22 billion transistors taking
up 800 mm2 of chip area. The performance data released on the NothPole processor are
computed based on frame/sec. The performance metrics of operations/sec in integer or
floating point are unavailable in the public domain currently. However, the operations
per cycle are available for different data precisions. In vector–matrix multiplication, 8-, 4-,
and 2-bit cases can perform 2048, 4096, and 8192 operations/cycle. The FP16 can compute
256 operations/cycle (the number of cycles/s has not been released at this time). NorthPole
can compute 800, 400, and 200 TOPS with INT 2, 4, and 8 precisions. The processor can
be applied to a broad area of applications and can execute inference with a wide range of
network models applied in classification, detection, segmentation, speech recognition, and
transformer models in NLP.
Imagination introduced a wide range of edge processors with targeted applications in
IoTs to autonomous vehicles [182]. The edge processor series is categorized as the PowerVR
Series3NX and can achieve up to 160 TOPS with multicore implementations. For ultra-low-
power applications, one can choose PowerVR AX3125, which has a 0.6 TOPS computing
performance [183]. IMG 4NX MC1 is a single-core Series 4 processor for autonomous vehicle
applications and performs at 12.5 TOPS with less than 0.5 W of power consumption [184].
Intel released multiple edge AI processors such as Nirvana Spring Crest NNP-I [269]
and Movidious [139]. Recently, they have announced a scalable fourth-generation Xeon
processor series that can be used for desktop to extreme edge devices [270]. The power
consumption for an ultra-mobile processor is around 9 W when computed with INT8
precision. The development utilizes the SuperFin fabrication technology with 10 nm
process technology. Intel is comparing its core architecture to the Skylake processor, and it
claims an efficient core achieves 40% better performance with 40% less power.
IBM developed the Artificial Intelligence Unit (AIU) based on their AI accelerator
used in the 7-nanometer Telum chip that powers its z16 system [271]. AIU is a scaled
version developed using 5 nm process technology and features a 32-core design with a total
of 23 billion transistors. AIU uses IBM’s approximate computing frameworks where the
computing executes with FP16 and FP32 precisions [272].
Leapmind has introduced the Efficiera for edge AI inference implemented in FPGA or
ASIC [21]. Efficiency is for ultra-low-power applications. The computations are typically
performed in 8-, 16-, or 32-bit precision. However, the company claims that 1-bit weight
and 2-bit activation can be achieved while still maintaining accuracy for better power and
Electronics 2024, 13, 2988 18 of 44
area efficiency. They show 6.55 TOPS at 800 MHz clock frequency with an energy efficiency
of 27.7 TOPS/W [273].
Kneron released its edge inference processor, KL 720, for various applications, such as
autonomous vehicles and smart industry [47]. The KL 720 is an upgraded version of the
earlier KL 520 for similar applications. The revised version performs at 0.9 TOPS/W and
shows up to 1.4 TOPS. The neural computation supports INT8 and INT16 precisions [47].
Kneron’s most up-to-date heterogeneous AI chip is KL 530 [47]. It is enabled with a brand
new NPU, which supports INT4 precision and offers 70% higher performance than that
of INT8. The maximum power consumption of KL 530 is 500 mW and can deliver up to
1 TOPS [47].
Memryx [146] released an inference processor, MX3. This processor computes deep
learning models with 4-, 8-, or 16-bit weight and BF16 activation functions. MX3 consumes
about 1 W of power and computes with 5 TFLOPS. This chip stores 10 million parameters
on a die and thus needs more chips for implementing larger networks.
MobileEye and STMicroelectronics released EyeQ 5 SoC for autonomous driving [147].
EyeQ 5 is four times faster than their earlier version, EyeQ 4. It can produce 2.4 TOPS/W
and goes up to 24 TOPS with 10 W of power [148]. Recently, MobileEye has announced
their next-generation processor, EyeQ6, which is around 5× faster than EyeQ5 [149]. For
INT8 precision, EyeQ5 performs 16 TOPS, and EyeQ6 shows 34 TOPS [150].
NXP introduced their edge processor i.MX 8M+ for targeted applications in vision,
multimedia, and industrial automations [84]. The system includes a powerful Cortex-A53
processor integrated with an NPU. The neural network performs 2.3 TOPS with 2 W of
power consumption. The neural computation supports INT16 precision [85]. NXP is sched-
uled to launch its next AI processor, iMX9, in 2023, with more features and efficiency [84].
NVIDIA released the Jetson Nano, which can run multiple applications in parallel,
such as image classification, object detection, segmentation, and speech processing [152].
This developer kit is supported by the NVIDIA JetPack SDK and can run state-of-the-art AI
models. The Jetson Nano consumes around 5–10 W of power and computes 472 GFLOPS
in FP16 precision. The new version of Jetson Nano B01 can perform 1.88 TOPS [274].
NVIDIA released Jetson Orin, which includes specialized development hardware,
AGX Orin. It is embedded with 32 GB of memory, has a 12-core CPU, and can exhibit a
computing performance of 275 TOPS while using INT8 precision [152]. The computer is
powered by NVIDIA ampere architecture with 2048 cores, 64 tensor cores, and 2 NVDLA
v2.0 accelerators for deep learning [153].
Qualcomm developed the QCS8250 SoC for intensive camera and edge applica-
tions [156]. This processor supports Wi-Fi and 5G for the IoTs. A quad hexagon vector
extension V66Q with hexagon DSP is used for machine learning. An integrated NPU is used
for advanced video analysis. The NPU supports INT8 precision and runs at 15 TOPS [157].
Qualcomm has released the Snapdragon 888+ 5G processor for use in smartphones. It
takes the smartphone experience to a new level with AI-enhanced gaming, streaming, and
photography [158]. It includes a sixth-generation Qualcomm AI engine with the Qualcomm
Hexagon780 CPU [159,160]. The throughput of the AI engine is 32 TOPS with 5 W of
power consumption [159]. The Snapdragon 8 Gen2 mobile platform was presented at the
HotChips 2023 conference and exhibited 60% better energy efficiency than the Snapdragon
8 in INT4 precision.
Samsung announced the Exynos 2100 AI edge processor for smartphones, smart-
watches, and automobiles [164]. Exynos supports 5G network and performs on-device AI
computations with triple NPUs. They fabricate using 5 nm extreme UV technology. The
Exynos 2100 consumes 20% lower power and delivers 10% higher performance than Exynos
990. Exynos 2100 can perform up to 26 TOPS, and it is two times more power-efficient than
the earlier version of Exynos [165]. A more powerful mobile processor, Exynos 2200, was
released recently.
SiMa.ai [169] introduced the MLSoC for computer vision applications. MLSoc is
implemented on TSMC 16 nm technology. The accelerator can compute 50 TOPS while
Electronics 2024, 13, 2988 19 of 44
consuming 10 W of power. MLSoC uses INT8 precision in computation. The processor has
4 MB of on-chip memory for deep learning operations. The processor is 1.4× more efficient
than Orin, measured in frames/W.
Tsinghua and Polar Bear Tech released their QM930 accelerator consisting of seven
chiplets [155]. The chiplets are organized as one hub chiplet and six side chiplets, forming
a hub-side processor. The processor is implemented in 12 nm CMOS technology. The
total area for the chiplets is 209 mm2 for seven chiplets. However, the total substrate area
of the processor is 1089 mm2 . The processor can compute with INT4, INT8, and INT16
precision, showing peak performances of 40, 20, and 10 TOPS, respectively. The system
energy efficiency is 1.67 TOPS/W when computed in INT8. The power consumption can
be varied from 4.5 to 12 W.
Verisilicon introduced VIP 9000 for face and voice recognition. It adopts Vivante’s
latest VIP V8 NPU architecture for processing neural networks [180]. The computing
engine supports INT8, INT16, FP16, and BF16. The performance can be scaled from 0.5 to
100 TOPS [181].
Synopsis developed the EV7x multi-core processor family for vision applications [171].
The processor integrates vector DSP, vector FPU, and a neural network accelerator. Each
VPU supports a 32-bit scalar unit. The MAC can be configured for INT8, INT16, or INT32
precisions. The chip can achieve up to 2.7 TOPS in performance [172].
Tesla designed the FSD processor which was manufactured by Samsung for au-
tonomous vehicle operations [179]. The SoC processor includes two NPUs and one GPU.
The NPUs support INT8 precision, and each NPU can compute 36.86 TOPS. The peak
performance of the FSD chip is 73.7 TOPS. The total TDP power consumption of each FSD
chip is 36 W [179].
Several other companies have also developed edge processors for various applications
but did not share hardware performance details on their websites or through publicly
available publications. For instance, Ambarella [275] has developed various edge pro-
cessors for automotive, security, consumer, and IoTs for industrial and robotics applica-
tions. Ambarella’s processors are SoC types, mainly using ARM processors and GPUs for
DNN computations.
processor with 5–10 W of power consumption, and its latency is 10× less than the NVIDIA
nano [277]. The target applications are for audio/video processing on end devices.
IBM developed the TrueNorth neuromorphic spiking system for real-time tracking,
identification, and detection [10]. It consists of 4096 neurosynaptic cores and 1 million
digital neurons. The typical power consumption is 65 mW, and the processor can execute
46 GSOPS/W, with 26 pJ per synaptic operation [10,278]. The total area of the chip is
430 mm2 , which is almost 14× bigger than that of Intel’s Loihi 2.
Innatera announced a neuromorphic chip that is fabricated using TSMC’s 28 nm
process [279]. When tested with audio signals [280], each spike event consumed about
200 fJ, while the chip consumed only 100 uW for each inference event. The target application
areas are mainly audio, healthcare, and radar voice recognition [280].
Intel released the Loihi [9], a spiking neural network chip, in 2018 and an updated
version, the Loihi 2 [9], in 2021. The Loihi 2 is fabricated using Intel’s 7 nm technology
and has 2.3 billion transistors with a chip area of 31 mm2 . This processor has 128 neuron
cores and 6 low-power x86 cores. It can evaluate up to 1 million neurons and 120 million
synapses. The Loihi chips support online learning. Loihi processors support INT8 precision.
Loihi 1 can deliver 30 GSOPS with 15 pJ per synaptic operation [138]. Both Loihi 1 and Loihi
2 consume similar amounts of power (110 mW and 100 mW, respectively [221]). However,
the Loihi 2 outperforms the Loihi 1 by 10 times. The chips can be programmed through
several frameworks, including, Nengo, NxSDK, and Lava [229]. The latter is a framework
developed by Intel and is being pushed as the primary platform to program the Loihi 2.
IMEC developed a RISC-V processor-based digital neuromorphic processor with
22 nm process technology in 2022 [281]. They implemented an optimized BF-16 processing
pipeline inside the neural process engine. The computation can also support INT4 and
INT8 precision. They used three-layer memory to reduce the chip area.
Koniku combines biological machines with silicon devices to design a micro electrode
array system core [12]. They are developing hardware and an algorithm that mimic the
smell sensory receptor that is found in some animal noses. However, the detailed device
parameters are not publicly available. The device is mainly used in security, agriculture,
and safe flight operation [282].
compute a series of matrices for CNN. The Lightspeeur 5801 has a performance of 2.8 TOPS
at 224 mW and can be scaled up to 12.6 TOPS/W. The Lightspeeur 2803S is their latest
PIM processor for the advanced edge, desktop, and data center deployments [19]. Each
Lightspeeur 2803S chip performs 16.8 TOPS while consuming 0.7 W of power, giving an
efficiency of 24 TOPS/W. Lightspeeur 2801 can compute 5.6 TOPS with an energy efficiency
of 9.3 TOPS/W. Gyrfalcon introduced its latest processor, Lightspeeur 2802, using TSMC’s
magnetoresistive random access memory technology. Lightspeeur 2802 exhibits an energy
efficiency of 9.9 TOPS/W. Janux GS31 is the edge inference server which is built with
128 Lightspeeur 2803S chips [129]. It can perform 2150 TOPS and consumes 650 W.
Mythic has announced its new analog matrix processor, M1076 [18]. The latest version
of Mythic’s PIM processor reduced its size by combining 76 analog computing tiles, while
Electronics 2024, 13, x FOR PEER REVIEW
the original one (M1108) uses 108 tiles. The smaller size offers more compatibility to implant 22 of 46
on edge devices. The processor supports 79.69 M on-chip weights in the array of flash
memory and 19,456 ADCs for parallel processing. There is no external DRAM storage
required. The
consuming 3 WDNN models
of power are
[88]. Thequantized
system canfrom beFP32
scaledto for
INT8 and
high retrained inup
performance Mythic’s
to 400
analogbycompute
TOPS engine.
combining A single
16 M1076 chipsM1076
whichchip can deliver
require up to 25 TOPS while consuming
75 W [86,87].
3 W Samsung
of power has
[88].announced
The system itscan be scaled
HBM-PIM for highlearning-enabled
machine performance upmemoryto 400 TOPS
systemby
combining 16 M1076 chips which require 75 W [86,87].
with PIM architecture [16]. This is the first successful integration of a PIM architecture of
high Samsung
bandwidthhas announced
memory. This its HBM-PIM
technology machine learning-enabled
incorporates the AI processing memory system
function into
with PIM architecture [16]. This is the first successful integration
the Samsung HBM2 Aquabolt to speed up high-speed data processing in supercomputers.of a PIM architecture of
high bandwidth memory. This technology incorporates the AI processing
The system delivered 2.5× better performance with 60% lower energy consumption than function into
the earlier
the Samsung HBM2
HBM1 Aquabolt
[16]. Samsung to speed up high-speed
LPDDR5-PIM memorydatatechnology
processing inforsupercomputers.
mobile device
The system delivered 2.5 × better performance with 60% lower
technology is targeted at bringing AI capability into mobile devices without energy consumption
connectingthan
to
the earlier HBM1 [16]. Samsung LPDDR5-PIM memory technology
the data center [167]. The HBM-PIM architecture is different from the traditional analog for mobile device
technology is targeted at bringing AI capability into mobile devices without connecting to
PIM architecture, as outlined in Figure 2. It does not require data conversion and sensing
the data center [167]. The HBM-PIM architecture is different from the traditional analog
circuits as the actual computation takes place in the near-computing module in the digital
PIM architecture, as outlined in Figure 2. It does not require data conversion and sensing
domain. Instead, it uses a GPU surrounded by HBM stacks to realize the parallel pro-
circuits as the actual computation takes place in the near-computing module in the digital
cessing and minimize data movement [168]. Therefore, this is similar to a dataflow pro-
domain. Instead, it uses a GPU surrounded by HBM stacks to realize the parallel processing
cessor.
and minimize data movement [168]. Therefore, this is similar to a dataflow processor.
Syntiant has developed a line of flash memory array-based edge inference processors,
such as NDP10x, NDP120, NDP200 [173]. Syntiant’s PIM architecture is very energy-effi-
cient and it combines with an edge-optimized training pipeline. A Cortex-M0 is embed-
ded in the system that runs the NDP firmware. The NDP10x processors can hold 560 k
weights of INT4 precision and perform MAC operation with an INT8 activation. The train-
ing pipeline can build neural networks for various applications according to the specifi-
cations with optimized latency, memory size, and power consumption [173]. Syntiant re-
Electronics 2024, 13, 2988 22 of 44
Syntiant has developed a line of flash memory array-based edge inference processors,
such as NDP10x, NDP120, NDP200 [173]. Syntiant’s PIM architecture is very energy-
efficient and it combines with an edge-optimized training pipeline. A Cortex-M0 is embed-
ded in the system that runs the NDP firmware. The NDP10x processors can hold 560 k
weights of INT4 precision and perform MAC operation with an INT8 activation. The
training pipeline can build neural networks for various applications according to the spec-
ifications with optimized latency, memory size, and power consumption [173]. Syntiant
released five different versions of application processors. NDP 100 is their first AI processor,
updated in 2020 with a tiny dimension of 2.52 mm2 and ultra-low power consumption, less
than 140 µW [174]. Syntiant continues to provide more PIM processors named NDP 101,
102, 120, and NDP 200 [175,177,283]. The application domains are mainly smartphones,
wearable and hearable pieces of equipment, remote controls, and IoT endpoints. The neural
computations are supported by INT 1, 2, 4, and 8 precision. The energy efficiency of the
NDP 10× series is 2 TOPS/W [284], which includes NDP100, NDP 101, and NDP 102. NDP
120 [175] and NDP 200 exhibit 1.9 GOPS/W and 6.4 GOPS/W [177], respectively.
Untether has developed its PIM AI accelerator card TsunAImi [182] for inference
at the data center or in the server. The heart of the TsunAImi is four runAI200 chips
which are fabricated by TSMC in standard SRAM arrays. Each runAI200 chip features
511 cores and 192 MB of SRAM memory. runAI200 computes in INT8 precision and
performs 502 TOPS at 8 TOPS/W, which is 3× more than NVIDIA’s Ampere A100 GPU.
The resulting performance of TsunAImi system is 2008 TOPS with 400 W [183].
UPMEMP PIM innovatively placed thousands of DPU units within the DRAM memory
chips [184]. The DPUs are controlled by high-level applications running on the main CPU.
Each DIMM consists of 16 PIM-enabled chips. Each PIM has 8 DPUs; thus, 128 DPUs are
contained in each UPMEM [185].
However, the system is massively parallel, and up to 2560 DPUs units can be as-
sembled as a unit server with 256 GB PIM DRAM. The computing power is 15× of a
x86 server with the main CPU. The throughput benchmarked for INT32 bit addition is
58.3 MOPS/DPU [186]. This system is suitable for DNA sequencing, genome comparison,
phylogenetics, metagenomic analysis, and more [187].
Precision:
The Data precision
IBM NorthPole has 200is TOPS
an important
for INT8consideration
precision at 60when comparing
W (based processor
on a discussion
performance. Figure 4 presents the precision of the processors from
with IBM). However, the NorthPole can have higher TOPS of 400 and 800 at 4 and Figure 3. Figure
2 bit5
shows therespectively.
precision, distribution of precision to
According and total number
a recent of processors
NorthPole article, theformaximum
each architecture
power
category. A processor
consumption may support
of the NorthPole moreisthan
processor 74 Wone type of computing precision. Figures 3
[90].
and Among
4 are based on the highest
neuromorphic precision
processors, supported
Loihi by each
2 outperforms processor.
other neuromorphic processors,
except for the Akida AKD1000. The AKD1000, however, consumes 20× more power than
the Loihi 2 (see Table 2). Although the neuromorphic processors seem less impressive in
terms of TOPS vs. W, it is important to note that they generally need far fewer synaptic
operations to perform a task if the task is performed with an algorithm that is natively
spiking (i.e., not a deep network implemented with spiking neurons) [287].
The neuromorphic processors consume significantly less energy than other processors
for inference tasks [227]. For example, the Loihi processor consumes 5.3× less energy than
the Intel Movidious and 20.5× less energy than the Nvidia Jetson Nano [227]. Figure 3
shows that higher-performance PIM processors (such as the M1076, M1108, LS-2803S, and
AnIA) exhibit similar computing speeds as dataflow or neuromorphic processors within
the same range of power consumption (0.5 to 1.5 W).
Precision: Data precision is an important consideration when comparing processor
performance. Figure 4 presents the precision of the processors from Figure 3. Figure 5 shows
the distribution of precision and total number of processors for each architecture category.
A processor may support more than one type of computing precision. Figures 3 and 4 are
based on the highest precision supported by each processor.
Among dataflow processors, INT8 is the most widely supported precision for DNN
computations. NVIDIA’s Orin achieves 275 TOPS with INT8 precision, the maximum
computing speed for INT8 precision in Figure 5. However, some processors utilize INT1
(Efficiara), INT64 (A15, A14, and M1), FP16 (ARA-1, DNA100, Jetson Nano, Snapdragon
888+), and INT16 (Ethos78, and Movidius). Neuromorphic and PIM processors mainly
support INT1 to INT8 data precisions. Lower computing precisions generally reduce the
inference accuracy. According to [236], VGG-9 and ResNet-18 have accuracy losses of
3.89%
Figureand 6.02%,
4. Power vs.respectively, foredge
performance of inference while
processors computed
with computingwith INT1 precision. A more
precision.
in-depth discussion of the relationship between quantization and accuracy is presented in
Section 3.1. A higher precision provides better accuracy but incurs more computing costs.
Figure 3. Power consumption and performance of AI edge processors.
Electronics 2024, 13, 2988 Precision: Data precision is an important consideration when comparing processor 25 of 44
performance. Figure 4 presents the precision of the processors from Figure 3. Figure 5
shows the distribution of precision and total number of processors for each architecture
category. A processor
Figure 5 shows may
that the support
most more
common than oneintype
precision of computing
the processors precision.
examined Figures
is INT8. This3
and 4 are based on the highest precision supported by each processor.
provides a good balance between accuracy and computational costs.
Among dataflow processors, INT8 is the most widely supported precision for DNN
computations. NVIDIA’s Orin achieves 275 TOPS with INT8 precision, the maximum
computing speed for INT8 precision in Figure 5. However, some processors utilize INT1
(Efficiara), INT64 (A15, A14, and M1), FP16 (ARA-1, DNA100, Jetson Nano, Snapdragon
888+), and INT16 (Ethos78, and Movidius). Neuromorphic and PIM processors mainly
support INT1 to INT8 data precisions. Lower computing precisions generally reduce the
inference accuracy. According to [236], VGG-9 and ResNet-18 have accuracy losses of
3.89% and 6.02%, respectively, for inference while computed with INT1 precision. A more
in-depth discussion of the relationship between quantization and accuracy is presented in
Section 3.1. A higher precision provides better accuracy but incurs more computing costs.
Figure 5 shows that the most common precision in the processors examined is INT8. This
provides a good
Power
Figure4.4.Power
Figure balance
vs.
vs. between
performance
performance accuracy
ofofedge
edge andwith
processors
processors computational
with costs.
computingprecision.
computing precision.
Figure 5.
Figure Number of edge processors
5. Number processors supporting
supportingvarious
variousdegrees
degreesofofdata
dataprecision.
precision.The
Thetotal number
total num-
of processors
ber is indicated
of processors in the
is indicated legend.
in the legend.
As shown
As shown in in Figures
Figures 44 and
and 5,
5, almost
almost allall the
the neuromorphic
neuromorphic processors
processors use
use INT8
INT8 for
for
synaptic computations.
synaptic computations. TheThe exception
exception to to this
this is
is the
theAKD1000,
AKD1000,which
whichuses
usesINT4
INT4and
andshows
shows
the best
the best performance
performance among
among neuromorphic
neuromorphic processors
processors inin terms
terms of
of operations
operations per
per second
second
(1.5 TOPS). However, it consumes around 18 × more power than
(1.5 TOPS). However, it consumes around 18× more power than Loihi processors. At INT8Loihi processors. At
precision, the Loihi 1 performs 30 GSOPS using 110 mW [138,223], whereas Loihi 2 sur-2
INT8 precision, the Loihi 1 performs 30 GSOPS using 110 mW [138,223], whereas Loihi
surpasses
passes thisthis throughput
throughput by 10
by 10×, ×, with
with a similar
a similar power power consumption
consumption [9]. [9].
As shown in Figures 4 and 5, PIM processors primarily supportprecisions
As shown in Figures 4 and 5, PIM processors primarily support precisionsofof
INT1
INT1to
INT8.
to INT8.Figure
Figure5 shows
5 showsthethe
performance
performance of PIM
of PIMprocessors in INT4
processors andand
in INT4 INT8 precisions
INT8 due
precisions
to the unavailability of data for all supported precisions. Mythic processors
due to the unavailability of data for all supported precisions. Mythic processors (M1108 (M1108 and
M1076)
and manifest
M1076) the best
manifest performance
the best performanceamong among PIM PIM
processors. Mythic
processors. and Syntiant
Mythic have
and Syntiant
developed their PIM processors with flash memory devices. However, Mythic processors
have developed their PIM processors with flash memory devices. However, Mythic pro-
cessors require significantly higher power to compute DNNs in INT8 precision with its 76
computing tiles. Syntiant processors use INT4 precision and compute with about 13,000×
lower throughput than Mythic M1076 while consuming about 6000× less power. The Syn-
tiant processors are limited to smaller networks with up to 64 classes in NDP10x. On the
other hand, Mythic processors can handle 10× more weights with greater precision [283].
Electronics 2024, 13, 2988 26 of 44
require significantly higher power to compute DNNs in INT8 precision with its 76 comput-
ing tiles. Syntiant processors use INT4 precision and compute with about 13,000× lower
throughput than Mythic M1076 while consuming about 6000× less power. The Syntiant
processors are limited to smaller networks with up to 64 classes in NDP10x. On the other
hand, Mythic processors can handle 10× more weights with greater precision [283]. The
Samsung DRAM architecture-based PIM processor uses computing modules near the
memory banks and supports INT64 precision [16].
Energy Efficiency: Figure 6 presents the performance vs. energy efficiency of dataflow
for PIM and neuromorphic processors. Efficiency determines the computing throughput
Electronics 2024, 13, x FOR PEER REVIEW
of a processor per watt. The energy efficiency of all PIM processors is located within 27 of 46
1 to
16 TOPS/W, whereas most of the dataflow processors are located in the 0.1 to 55 TOPS/W
range. The PIM architecture reduces latency by executing the computation inside the mem-
ory modules, which
consumption. Loihi 2increases computing
manifests performance
the best energy andamong
efficiency reducesallpower consumption.
neuromorphic pro-
Loihi 2 manifests the best energy efficiency among all neuromorphic processors.
cessors. Energy efficiency vs. power consumption, as shown in Figure 7, gives us a betterEnergy
efficiency vs. power
understanding aboutconsumption, as shown
the processors. Loihi 2inshows
Figurebetter
7, gives us a better
energy understanding
efficiency than many
about the processors. Loihi 2 shows better energy efficiency than many high-performance
high-performance edge AI processors, while it consumes very low power. Ergo is the most
edge AI processors,
energy-efficient while it
processor consumes
among very low
all dataflow power. Ergo
processors, is the
which most55
shows energy-efficient
TOPS/W.
processor among all dataflow processors, which shows 55 TOPS/W.
Figure 6.
Figure 6. Performance
Performance and
and energy
energy efficiency
efficiency of
of edge
edge processors.
processors.
Chip Area: The area is an important factor for choosing a processor for AI applications
on edge devices. Modern processor technologies are pushing the boundaries to fabricate
systems with very high density and superior performance at the same time. The smaller die
area and lower power consumption is very important for battery-powered edge devices.
The chip area is related to the cost of the silicon fabrication and also defines the application
area. A smaller chip with high performance is desirable for edge applications.
Figures 7 and 8 present the power consumption and performance, respectively, vs. the
chip area. It can be observed that in general, both the power consumption and performance
increase with chip area. Based on the available chip sizes, the NothPole has the largest chip
size of 800 mm2 and performs 200 TOPS in INT8. The lowest-area chips have a dataflow
architecture. Figure 9 shows the energy efficiency vs. area as the combined relationship
of Figures 8 and 9. In this Figure, the PIM processors form a cluster. The overall energy
Electronics 2024, 13, 2988 27 of 44
efficiency of this PIM cluster is higher than that of dataflow and neuromorphic processors
of similar chip area. Some dataflow processors (such as Nema Pico, Efficiera, and IMG 4NX)
Figure 6. Performance and energy efficiency of edge processors.
exhibit higher energy efficiency and better performance vs. area than other processors.
Chip Area: The area is an important factor for choosing a processor for AI applica-
tions on edge devices. Modern processor technologies are pushing the boundaries to fab-
ricate systems with very high density and superior performance at the same time. The
smaller die area and lower power consumption is very important for battery-powered
edge devices. The chip area is related to the cost of the silicon fabrication and also defines
the application area. A smaller chip with high performance is desirable for edge applica-
tions.
Figures 7 and 8 present the power consumption and performance, respectively, vs.
the chip area. It can be observed that in general, both the power consumption and perfor-
mance increase with chip area. Based on the available chip sizes, the NothPole has the
largest chip size of 800 mm2 and performs 200 TOPS in INT8. The lowest-area chips have
a dataflow architecture. Figure 9 shows the energy efficiency vs. area as the combined
relationship of Figures 8 and 9. In this Figure, the PIM processors form a cluster. The over-
all energy efficiency of this PIM cluster is higher than that of dataflow and neuromorphic
processors of similar chip area. Some dataflow processors (such as Nema Pico, Efficiera,
and IMG 4NX) exhibit higher energy efficiency and better performance vs. area than other
processors.
consumption vs.
Figure 7. Power consumption
Figure vs. area
area of
of edge
edgeprocessors.
processors.
Figure
Figure 8. Area vs.
8. Area performance of
vs. performance of edge
edge processors.
processors.
Electronics
Electronics2024,
2024,13,
13,x2988
FOR PEER REVIEW 29 of 46
28 of 44
Figure 9.
Figure Area vs.
9. Area vs. energy
energy efficiency
efficiency of
of edge
edge processors.
processors.
7.2. AI Edge Processors with PIM Architecture
7.2. AI Edge Processors with PIM Architecture
While Figures 3–11 describe processors of all types, Figure 12 shows the relationship
only While
between PIM processors
Figures 3–11 describe thatprocessors
have either ofbeen announced
all types, Figure as
12products
shows theorrelationship
are still in
industrial research. The research processors are presented in the conferences, such as ISSCC
only between PIM processors that have either been announced as products or are still in
and VLSI. The PIM processors at the lower right corner of Figure 11 are candidates for data
industrial research. The research processors are presented in the conferences, such as
center and intensive computing applications [182–187]. PIM processors with higher energy
ISSCC andare
efficiency VLSI. The PIM
suitable processors
for edge and IoTat the lower right
applications cornerofoftheir
because Figure 11 aresize,
smaller candidates
lower
for
power consumption, and higher energy efficiency. From Figure 12, we can see that mostwith
data center and intensive computing applications [182–187]. PIM processors of
higher
the PIM energy efficiency
processors underareindustrial
suitable for edge and
research showIoT applications
higher because ofthan
energy efficiency theiralready
smaller
size, lower power
announced consumption,
processors. and higher
This indicates energy
that future PIMefficiency.
processorsFrom
are Figure 12,have
likely to we can
much see
better
that performance
most of the PIMand efficiency.under industrial research show higher energy efficiency
processors
The PIMannounced
than already processors compute
processors.the This
MACindicates
operationthatinside the memory
future array, thus
PIM processors arereduc-
likely
ing the data transfer latency. Generally,
to have much better performance and efficiency. PIM processors compute in lower-integer/fixed-
point precision. A PIM processor generally supports INT 1–16 precision. However, accord-
ing to our study, we found around 59% of the PIM processors support INT8 precision for
MAC operation, as shown in Figure 5. Low-precision computation is faster and requires
lower power consumption compared to dataflow processors. PIM edge processors consume
0.0001 to 4 W for deep learning inference applications, as presented in Table 2 and Figure 3.
However, the dataflow processors suffer from high memory requirements and latency
issues, and they consume higher power than most of the PIM processors in order to achieve
the same performance that we see in Figures 3–5.
From Figures 3 and 4, Syntient’s NDP200 consumes less than 1 mW and shows the
highest performance for extreme edge applications. Mythic M1108 consumes 4 W and
exhibits the highest performance (35 TOPS) of all dataflow and neuromorphic processors
that consume below 10 W of power. For the same chip area, the M1108 consumes 9× less
power than Tesla’s dataflow processor FSD, while FSD computes 2× faster than M1108, as
presented in Figures 8 and 9.
Electronics 2024, 13, x FOR PEER REVIEW 30 of 46
Electronics
Electronics 2024,
2024, 13,
13, x2988
FOR PEER REVIEW 30 of 46
29 of 44
presented
Figure 12. PIMat (red)
high-tier
and conferences
dataflow (blue (such as ISSCC,
label) VLSI).
processors The chart research.
in industrial includes The
bothreferences
PIM
Figure 12. Performance vs. energy efficiency of PIM/CIM processors, Processors with an asterix (*)
[69,189,192,236–241,245–251,257,258,262,263]
areindicate
used inthe
this chart areare and dataflow
[69,189,192,236–241,245–251,257,258,262,263] [74,188,242–244,252–
processors still undergoing industrial research, and otherfor PIM and
processors [74,188,242–244,
have been re-
257,259,261,264] processors.
252–257,259,261,264]
leased or announcedfor bydataflow architecture.
the manufacturer. For industrial PIM processors the examples are used
from references [69,189,192,236–241,245–251,257,258,262,263].
The PIM processors compute the MAC operation inside the memory array, thus re-
ducing the data transfer latency. Generally, PIM processors compute in lower-inte-
ger/fixed-point precision. A PIM processor generally supports INT 1–16 precision. How-
ever, according to our study, we found around 59% of the PIM processors support INT8
precision for MAC operation, as shown in Figure 5. Low-precision computation is faster
and requires lower power consumption compared to dataflow processors. PIM edge pro-
cessors consume 0.0001 to 4 W for deep learning inference applications, as presented in
Table 2 and Figure 3. However, the dataflow processors suffer from high memory require-
ments and latency issues, and they consume higher power than most of the PIM proces-
sors in order to achieve the same performance that we see in Figures 3–5.
From Figures 3 and 4, Syntient’s NDP200 consumes less than 1 mW and shows the
highest performance for extreme edge applications. Mythic M1108 consumes 4 W and ex-
hibits the highest performance (35 TOPS) of all dataflow and neuromorphic processors
that consume below 10 W of power. For the same chip area, the M1108 consumes 9× less
power than Tesla’s dataflow processor FSD, while FSD computes 2× faster than M1108, as
presented in Figures 8 and 9.
For the processors below 100 mm2, Gyrfalcon’s LS2803 shows the highest perfor-
mance except for EyeQ5. However, EyeQ5 consumes about 14× higher power and per-
forms 1.4× better than LS2803. The benefit of deploying PIM processors for edge applica-
tions is high performance with low power consumption, and the PIM processors reduce
Figure
the 13.13.Performance
Figure PIM (red)
computing andvs.
latency energy(blue
dataflow efficiency
significantly label)
as theofMAC
PIM/CIM
processors processors,
in industrial
operations Processors
research.
are with an
The references
performed inside asterix
are
the
(*) memory
indicate the
used in this processors
chart
array. are are still undergoing industrial research,
[69,189,192,236–241,245–251,257,258,262,263] for and
PIM other
and processors have
[74,188,242–244,252– been
257,259,261,264]
released for dataflow
or announced by the architecture.
manufacturer. For industrial PIM processors the examples are used
from7.3.references [69,189,192,236–241,245–251,257,258,262,263].
Edge Processors in Industrial Research
Renesas Electronics presented a near-memory system in ISSCC 2024 developed in a
Several companies, along with their collaborators, are developing edge computing
14 nm process that achieved 130.55 TOPS with 23.9 TOPS/W [264]. TSMC and National
architectures and infrastructures with state-of-the-art performance. Figure 13 shows the
Tsing Hua University presented one near-memory system in a 22 nm CMOS process in
power consumption vs. energy efficiency of the industrial research processors which were
ISSCC 2023 that computes 0.59 TOPS and 160 TOPS/W in 8-8-26-bit (input–weight–out-
put) precision [260]. This system showed the highest energy efficiency amongst the near-
Electronics 2024, 13, 2988 31 of 44
candidate processors are Horizon’s Journey series, Tesla’s FSD, NVIDIA’s Orin, Mobileye’s
EyeQ and IBM’s NorthPole.
However, if we analyze the price of commercially available processors for edge appli-
cations, the prices vary based on the computing capability, energy efficiency and the types
of applications. From this context, we can say that in general, the cost of a processor varies
with performance (TOPS). The processors located in the lower left corner in Figures 3 and 4
exhibit the lowest performance and are in use in wearable AI devices that cost only a few
dollars (USD 3–USD 10) [177]. The mid-range processors cost around USD 100, and the
target applications are security and tracking applications. In this category, the Google Coral
Edge TPU board costs USD 98 [288]. High-end edge processors can compute more than
100 TOPS and cost a few hundred to a couple of thousand dollars. These processors are
mainly used in autonomous vehicles and in industry. For example, the current market
price of the Tesla FSD is USD 8000 [289], and the NVIDIA Jetson Orin costs around USD
2000 [290].
8. Summary
This article reviewed different aspects and paradigms of AI edge processors released or
announced recently by various tech companies. About 100 edge processors were examined.
This work, however, did not cover DNN algorithms, HPC computing processors, or
cloud computing. We categorized state-of-the-art edge processors and analyzed their
performance, area, and energy efficiency to support the research community in edge
computing. Multiple processing architectures including dataflow, neuromorphic, and PIM
were examined. The performance and power consumption were analyzed for narrowing
down edge AI processors for specific applications. Deep neural networks and software
frameworks supported were discussed and are presented in tables.
Several of the edge processors offer on-chip retraining in real time. This enables the
retraining of networks without having to send sensitive data to the cloud, thus increasing
security and privacy. Intel’s Loihi 2 and Brainchip’s Akida processor can be retrained on
local data for personalized applications and faster response rates.
This study found the power consumption and performance of processors varies in
different architectures and application domains. For extreme wearable edge devices, power
consumption ranges from 100 µW to a few mW, and computing throughput is around
1 GOPS. We found that many applications require higher computing performance, such as
video processing and autonomous car operations. These high-performance applications
consume a higher amount of power than extreme edge processors. For example, IBM’s
NorthPole computes at 200 TOPS with INT8 while consuming 60 W of power. This study
found that for the same range of power consumption and chip size, PIM architectures
perform better than dataflow or neuromorphic processors. This review found that the
PIM processors show significant energy efficiency and consume less power compared to
dataflow and neuromorphic processors. For example, the Mythic M1108 is a PIM processor
and has the highest performance (35 TOPS), among dataflow and neuromorphic processors
that consume less than 10 W of power. Neuromorphic processors are highly efficient
for performing computation with less synaptic operations but may not be ideal for deep
learning applications yet.
There are different types of deep learning frameworks for developing edge accelerators.
The most common frameworks are TFL, ONNX, and Caffe2. Some providers developed
their own framework to ease the development for users; for example, KaNN provides
Kalray, and CEVA-DNN provides CEVA. Overall, TFL, Caffe2, and ONNX are the most
popular platforms for developing DNN accelerator systems. Neuromorphic processors
have different frameworks which support spike generation and computation, such as
Nengo and Lava.
There are several emerging deep learning applications that are attracting significant
interest. This includes generative AI models, such as transformer models used in ChatGPT
and DALL-E for automated art generation. Transformer models are taking the AI world by
Electronics 2024, 13, 2988 33 of 44
Author Contributions: S.A. and T.M.T. are the main contributors. They conceptualized the review
idea and collected data, visualized and critically analyzed hardware performance. C.Y., Q.W., M.B.
and S.K. have contributed equally to review and editing. All authors have read and agreed to the
published version of the manuscript.
Funding: Funding is provided by the Department of Electrical and Computer Engineering, University
of Dayton, Dayton, OH 45469, USA.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Merenda, M. Edge machine learning for ai-enabled iot devices: A review. Sensors 2020, 20, 2533. [CrossRef] [PubMed]
2. Vestias, M.P.; Duarte, R.P.; de Sousa, J.T.; Neto, H.C. Moving Deep Learning to the Edge. Algorithms 2020, 13, 125. [CrossRef]
3. IBM. Why Organizations Are Betting on Edge Computing? May 2020. Available online: https://www.ibm.com/thought-
leadership/institute-business-value/report/edge-computing (accessed on 1 June 2023).
4. Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge Computing: Vision and Challenges. IEEE Internet Things J. 2016, 3, 637–646.
[CrossRef]
5. Statista. IoT: Number of Connected Devices Worldwide 2015–2025. November 2016. Available online: https://www.statista.com/
statistics/471264/iot-number-of-connected-devices-worldwide/ (accessed on 5 June 2023).
6. Chabas, J.M.; Gnanasambandam, C.; Gupte, S.; Mahdavian, M. New Demand, New Markets: What Edge Computing Means for
Hardware Companies; McKinsey & Company: New York, NY, USA, 2018. Available online: https://www.mckinsey.com/
industries/technology-media-and-telecommunications/our-insights/new-demand-new-markets-what-edge-computing-
means-for-hardware-companies (accessed on 22 July 2023).
7. Google. Cloud TPU. Available online: https://cloud.google.com/tpu (accessed on 5 May 2023).
8. Accenture Lab. Driving Intelligence at the Edge with Neuromorphic Computing. 2021. Available online: https://www.accenture.
com/_acnmedia/PDF-145/Accenture-Neuromorphic-Computing-POV.pdf (accessed on 3 June 2023).
9. Intel Labs. Technology Brief. Taking /Neuromorphic Computing to the Next Level with Loihi 2. 2021. Available online:
https://www.intel.com/content/www/us/en/research/neuromorphic-computing-loihi-2-technology-brief.html (accessed on
10 May 2023).
10. Akopyan, F. TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip. IEEE Trans.
Comput. Des. Integr. Circuits Syst. 2015, 34, 1537–1557. [CrossRef]
11. Videantis. July 2020. Available online: https://www.videantis.com/videantis-processor-adopted-for-tempo-ai-chip.html
(accessed on 11 June 2022).
12. Konikore. A living Breathing Machine. 2021. Available online: https://good-design.org/projects/konikore/ (accessed on
10 July 2022).
13. Kalray. May 2023. Available online: https://www.kalrayinc.com/press-release/projet-ip-cube/ (accessed on 7 July 2023).
14. Brainchip. 2023. Available online: https://brainchipinc.com/akida-neuromorphic-system-on-chip/ (accessed on 21 July 2023).
15. Synsence. May 2023. Available online: https://www.synsense-neuromorphic.com/technology (accessed on 1 June 2023).
16. Samsung. HBM-PIM. March 2023. Available online: https://www.samsung.com/semiconductor/solutions/technology/hbm-
processing-in-memory/ (accessed on 25 July 2023).
17. Upmem. Upmem-PIM. October 2019. Available online: https://www.upmem.com/nextplatform-com-2019-10-03-accelerating-
compute-by-cramming-it-into-dram/ (accessed on 7 May 2023).
Electronics 2024, 13, 2988 34 of 44
48. Wang, Q.; Yu, N.; Zhang, M.; Han, Z.; Fu, G. N3LDG: A Lightweight Neural Network Library for Natural Language Processing.
Beijing Da Xue Xue Bao 2019, 55, 113–119. [CrossRef]
49. Desai, S.; Goh, G.; Babu, A.; Aly, A. Lightweight convolutional representations for on-device natural language processing. arXiv
2020, arXiv:2002.01535.
50. Zhang, M.; Yang, J.; Teng, Z.; Zhang, Y. Libn3l: A lightweight package for neural nlp. In Proceedings of the Tenth International
Conference on Language Resources and Evaluation (LREC’16), Portoroz, Slovenia, 23–28 May 2016; pp. 225–229. Available online:
https://aclanthology.org/L16-1034 (accessed on 6 July 2023).
51. Tay, Y.; Zhang, A.; Tuan, L.A.; Rao, J.; Zhang, S.; Wang, S.; Fu, J.; Hui, S.C. Lightweight and efficient neural natural language
processing with quaternion networks. arXiv 2019, arXiv:1906.04393.
52. Gyrfalcon. LightSpeur 5801S Neural Accelerator. 2022. Available online: https://www.gyrfalcontech.ai/solutions/lightspeeur-
5801/ (accessed on 10 December 2022).
53. Liu, D.; Kong, H.; Luo, X.; Liu, W.; Subramaniam, R. Bringing AI to edge: From deep learning’s perspective. Neurocomputing 2022,
485, 297–320. [CrossRef]
54. Li, H. Application of IOT deep learning in edge computing: A review. Acad. J. Comput. Inf. Sci. 2021, 4, 98–103.
55. Zaidi, S.S.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A survey of modern deep learning based object detection
models. Digit. Signal Process. 2022, 126, 103514. [CrossRef]
56. Chen, J.; Ran, X. Deep Learning with Edge Computing: A Review. Proc. IEEE 2019, 107, 1655–1674. [CrossRef]
57. Rawat, W.; Wang, Z. Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review. Neural Comput.
2017, 29, 2352–2449. [CrossRef] [PubMed]
58. Al-Saffar, A.M.; Tao, H.; Talab, M.A. Review of deep convolution neural network in image classification. In Proceedings of
the 2017 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET), Jakarta,
Indonesia, 23–24 October 2017; pp. 26–31. [CrossRef]
59. Iandola, N.F.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer
parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360.
60. Elhassouny, A.; Smarandache, F. Trends in deep convolutional neural Networks architectures: A review. In Proceedings of
the 2019 International Conference of Computer Science and Renewable Energies (ICCSRE), Agadir, Morocco, 22–24 July 2019;
pp. 1–8.
61. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient
convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861.
62. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. arXiv 2018,
arXiv:1801.04381.
63. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. VSearching
for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27
October-2 November 2019; pp. 1314–1324.
64. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018;
pp. 6848–6856.
65. Ningning, M.X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the
European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131.
66. Mingxing, T.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International
Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114.
67. Niv, V. Hailo blog. Object Detection at the Edge: Making the Right Choice. AI on the Edge: The Hailo Blog. October 2022.
Available online: https://hailo.ai/blog/object-detection-at-the-edge-making-the-right-choice/ (accessed on 4 January 2023).
68. Zhao, Z.-Q.; Zheng, P.; Xu, S.T.; Wu, X. Object Detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019,
30, 3212–3232. [CrossRef] [PubMed]
69. Hung, J.-M.; Huang, Y.H.; Huang, S.P.; Chang, F.C.; Wen, T.H.; Su, C.I.; Khwa, W.S.; Lo, C.C.; Liu, R.S.; Hsieh, C.C.; et al. An 8-Mb
DC-Current-Free Binary-to-8b Precision ReRAM Nonvolatile Computing-in-Memory Macro using Time-Space-Readout with
1286.4-21.6TOPS/W for Edge-AI Devices. In Proceedings of the 2022 IEEE International Solid- State Circuits Conference (ISSCC),
San Francisco, CA, USA, 20–26 February 2022; pp. 1–3. [CrossRef]
70. Oruh, J.; Viriri, S.; Adegun, A. Adegun. Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition.
IEEE Access 2022, 10, 30069–30079. [CrossRef]
71. Liu, B.; Zhang, W.; Xu, X.; Chen, D. Time Delay Recurrent Neural Network for Speech Recognition. J. Phys. Conf. Ser. 2019, 1229,
012078. [CrossRef]
72. Zhao, Y.; Li, J.; Wang, X.; Li, Y. The Speechtransformer for Large-scale Mandarin Chinese Speech Recognition. In Proceedings
of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK,
12–17 May 2019; pp. 7095–7099. [CrossRef]
73. Omar, M.; Choi, S.; Nyang, D.; Mohaisen, D. Natural Language Processing: Recent Advances, Challenges, and Future Directions.
arXiv 2022, arXiv:2201.00768. [CrossRef]
Electronics 2024, 13, 2988 36 of 44
74. Yuan, Z.; Yang, Y.; Yue, J.; Liu, R.; Feng, X.; Lin, Z.; Wu, X.; Li, X.; Yang, H.; Liu, Y. 14.2 A 65 nm 24.7 µJ/Frame 12.3
mW Activation-Similarity-Aware Convolutional Neural Network Video Processor Using Hybrid Precision, Inter-Frame Data
Reuse and Mixed-Bit-Width Difference-Frame Data Codec. In Proceedings of the 2020 IEEE International Solid- State Circuits
Conference—(ISSCC), San Francisco, CA, USA, 16–20 February 2020; pp. 232–234. [CrossRef]
75. Geoff, T. Advantages of BFloat16 for AI Inference. October 2019. Available online: https://semiengineering.com/advantages-of-
bfloat16-for-ai-inference/ (accessed on 7 January 2023).
76. OpenAI. GPT-4: Technical Report. arXiv 2023, arXiv:2303.08774.
77. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multi-task learners. OpenAI
Blog 2019, 1, 9.
78. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al.
Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901.
79. Fedus, W. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv 2021,
arXiv:2101.03961.
80. Cao, Q.; Trivedi, H.; Balasubramanian, A.; Balasubramanian, N. DeFormer: Decomposing pre-trained transformers for faster
question answering. arXiv 2020, arXiv:2005.00697.
81. Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; Zhou, D. Mobilebert: A compact task-agnostic bert for resource-limited devices. arXiv
2020, arXiv:2004.02984.
82. Garret. The Synatiant Journey and Pervasive NDP. Blog Post, Processor, August 2021. Available online: https://www.edge-
ai-vision.com/2021/08/the-syntiant-journey-and-the-pervasive-ndp/#:~:text=In%20the%20summer%20of%202019,will%20
capitalize%20on%20the%20momentum (accessed on 5 May 2022).
83. NXP. iMX Application Processors. Available online: https://www.nxp.com/products/processors-and-microcontrollers/arm-
processors/i-mx-applications-processors/i-mx-9-processors:IMX9-PROCESSORS (accessed on 10 July 2023).
84. NXP. i.MX 8M Plus-Arm Cortex-A53, Machine Learning Vision, Multimedia and Industrial IoT. Available online: https://www.
nxp.com/products/processors-and-microcontrollers/arm-processors/i-mx-applications-processors/i-mx-8-processors/i-mx-
8m-plus-arm-cortex-a53-machine-learning-vision-multimedia-and-industrial-iot:IMX8MPLUS (accessed on 17 June 2023).
85. NXP Datasheet. i.MX 8M Plus SoM Datasheet. Available online: https://www.solid-run.com/wp-content/uploads/2021/06/i.
MX8M-Plus-Datasheet-2021-.pdf (accessed on 10 February 2023).
86. Deleo, Cision, PRNewwire. Mythic Expands Product Lineup with New Scalable, Power-Efficient Analog Matrix Processor for
Edge AI Applications. Mythic 1076. Available online: https://www.prnewswire.com/news-releases/mythic-expands-product-
lineup-with-new-scalable-power-efficient-analog-matrix-processor-for-edge-ai-applications-301306344.html (accessed on 10
May 2023).
87. Foxton, S.W. EETimes. Mythic Launches Second AI Chip. Available online: https://www.eetasia.com/mythic-launches-second-
ai-chip/ (accessed on 20 April 2022).
88. Fick, L.; Skrzyniarz, S.; Parikh, M.; Henry, M.B.; Fick, D. Analog Matrix Processor for Edge AI Real-Time Video Analytics. In
Proceedings of the 2022 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 20–26 February
2022; pp. 260–262.
89. Gyrfalcon. PIM AI Accelerators. Available online: https://www.gyrfalcontech.ai/ (accessed on 1 August 2023).
90. Modha, D.S.; Akopyan, F.; Andreopoulos, A.; Appuswamy, R.; Arthur, J.V.; Cassidy, A.S.; Datta, P.; DeBole, M.V.; Esser, S.K.;
Otero, C.O.; et al. IBM NorthPole neural inference machine. In Proceedings of the HotChips Conference, Palo Alto, CA, USA,
27–29 August 2023.
91. Cheng, Y.; Wang, D.; Zhou, P.; Zhang, T. A survey of model compression and acceleration for deep neural networks. arXiv 2020,
arXiv:1710.09282.
92. Deng, L.; Li, G.; Han, S.; Shi, L.; Xie, Y. Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive
Survey. Proc. IEEE 2020, 108, 485–532. [CrossRef]
93. Nan, K.; Liu, S.; Du, J.; Liu, H. Deep model compression for mobile platforms: A survey. Tsinghua Sci. Technol. 2019, 24, 677–693.
[CrossRef]
94. Berthelier, A.; Chateau, T.; Duffner, S.; Garcia, C.; Blanc, C. Deep Model Compression and Architecture Optimization for
Embedded Systems: A Survey. J. Signal Process. Syst. 2021, 93, 863–878. [CrossRef]
95. Lei, J.; Gao, X.; Song, J.; Wang, X.L.; Song, M.L. Survey of Deep Network Model Compression. J. Softw. 2018, 29, 251–266.
[CrossRef]
96. Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and
huffman coding. arXiv 2015, arXiv:1510.00149.
97. Qin, Q.; Ren, J.; Yu, J.; Wang, H.; Gao, L.; Zheng, J.; Feng, Y.; Fang, J.; Wang, Z. To compress, or not to compress: Characterizing
deep learning model compression for embedded inference. In Proceedings of the 2018 IEEE Intl Conf on Parallel & Distributed
Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing
& Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), Melbourne,
Australia, 11–13 December 2018; pp. 729–736. [CrossRef]
Electronics 2024, 13, 2988 37 of 44
98. Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and Training of Neural
Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision
and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2704–2713. [CrossRef]
99. Chunyu, Y.; Agaian, S.S. A comprehensive review of Binary Neural Network. arXiv 2023, arXiv:2110.06804.
100. Analog Devices Inc. MAX78000. Available online: https://www.analog.com/en/products/max78000.html (accessed on
9 July 2024).
101. Mouser Electronics. Maxim Integrated’s New Neural-Network-Accelerator MAX78000 SoC Now Available at Mouser. Available
online: https://www.mouser.com/publicrelations_maxim_max78000_2020final/ (accessed on 9 July 2024).
102. Apple. Press Release. Apple Unleashes M1. 10 November 2020. Available online: https://www.apple.com/newsroom/2020/11/
apple-unleashes-m1/ (accessed on 5 December 2021).
103. Nanoreview.net. A14 Bionic vs. A15 Bionic. Available online: https://nanoreview.net/en/soc-compare/apple-a15-bionic-vs-
apple-a14-bionic (accessed on 16 June 2023).
104. Cross, J. Macworld. Apple’s A16 Chip Doesn’t Live up to Its ‘Pro’ Price or Expectations. Available online: https://www.
macworld.com/article/1073243/a16-processor-cpu-gpu-lpddr5-memory-performance.html (accessed on 1 January 2023).
105. Merrit, R. Startup Accelerates AI at the Sensor. EETimes 11 February 2019. Available online: https://www.eetimes.com/startup-
accelerates-ai-at-the-sensor/ (accessed on 10 June 2023).
106. Clarke, P. Indo-US Startup Preps Agent-based AI Processor. EENews. 26 August 2018. Available online: https://www.
eenewsanalog.com/en/indo-us-startup-preps-agent-based-ai-processor-2/ (accessed on 20 June 2023).
107. Ghilardi, M. Synsense Secures Additional Capital from Strategic Investors. News Synsecse. 18 April 2023. Available online:
https://www.venturelab.swiss/SynSense-secures-additional-capital-from-strategic-investors (accessed on 5 May 2023).
108. ARM, NPU, Ethos-78. Highly Scalaeable and Efficient Second Generation ML Inference Processor. Available online: https://www.
arm.com/products/silicon-ip-cpu/ethos/ethos-n78 (accessed on 15 May 2022).
109. Frumusanu. Arm Announces Ethos-N78: Bigger and More Efficient. Anandtech. 27 May 2020. Available online: https://www.
anandtech.com/show/15817/arm-announces-ethosn78-npu-bigger-and-more-efficient (accessed on 25 April 2022).
110. AIMotive. Industry High 98% Efficiency Demonstrated Aimotive and Nextchip. 15 April 2021. Available online: https://aimotive.
com/-/industry-high-98-efficiency-demonstrated-by-aimotive-and-nextchip (accessed on 25 March 2022).
111. AIMotive. NN Acceleration for Automotive AI. Available online: https://aimotive.com/aiware-apache5 (accessed on
25 May 2022).
112. Blaize. 2022 Best Edge AI Processor Blaize Pathfinder P1600 Embedded System on Module. Available online: https://www.
blaize.com/products/ai-edge-computing-platforms/ (accessed on 5 December 2022).
113. Wheeler, B. Bitmain SoC Brings AI to the Edge. Available online: https://www.linleygroup.com/newsletters/newsletter_detail.
php?num=5975&year=2019&tag=3 (accessed on 23 July 2023).
114. Liang, W. Get Started, Neural Network Stick. Github. 10 May 2019. Available online: https://github.com/BM1880-BIRD/bm188
0-system-sdk/wiki/GET-STARTED (accessed on 16 May 2023).
115. Brainchip. Introducing the ADK1000 IP and NSOM for Edge AI IoT. May 2020. Available online: https://www.youtube.com/
watch?v=EUGx45BCKlE (accessed on 20 November 2022).
116. Clarke, P. eeNews. Akida Spiking Neural Processor Could Head to FDSOI. 2 August 2021. Available online: https://www.
eenewsanalog.com/news/akida-spiking-neural-processor-could-head-fdsoi (accessed on 25 November 2022).
117. Gwennap, L. Kendryte Embeds AI for Surveillance. Available online: https://www.linleygroup.com/newsletters/newsletter_
detail.php?num=5992 (accessed on 14 July 2023).
118. Canaan. Kendryte K210. Available online: https://canaan.io/product/kendryteai (accessed on 15 May 2023).
119. CEVA. Edge AI & Deep Learning. Available online: https://www.ceva-dsp.com/app/deep-learning/ (accessed on 10 July 2023).
120. Demler, M. CEVA Neupro Accelerator Neural Nets. Microprocessor Report, January 2018. Available online: https://www.ceva-
dsp.com/wp-content/uploads/2018/02/Ceva-NeuPro-Accelerates-Neural-Nets.pdf. (accessed on 10 July 2023).
121. Cadence. Tesilica AI Platform. Available online: https://www.cadence.com/en_US/home/tools/ip/tensilica-ip/tensilica-ai-
platform.html (accessed on 12 December 2022).
122. Cadence Newsroom. Cadence Accelerates Intelligent SoC Development with Comprehensive On-Device Tensilica AI Platform.
13 September 2021. Available online: https://www.cadence.com/en_US/home/company/newsroom/press-releases/pr/2021
/cadence-accelerates-intelligent-soc-development-with-comprehensi.html (accessed on 25 August 2022).
123. Maxfield, M. Say Hello to Deep Vision’s Polymorphic Dataflow Architecture. EE Journal 24 December 2020. Available online:
https://www.eejournal.com/article/say-hello-to-deep-visions-polymorphic-dataflow-architecture/ (accessed on 5 December
2022).
124. Ward-Foxton, S. AI Startup Deepvision Raises Funds Preps Next Chip. EETimes. 15 September 2021. Available online:
https://www.eetasia.com/ai-startup-deep-vision-raises-funds-preps-next-chip/ (accessed on 5 December 2022).
125. Eta Compute. Micropower AI Vision Platform. Available online: https://etacompute.com/tensai-flow/ (accessed on
15 May 2023).
126. FlexLogic. Flexlogic Announces InferX High Performance IP for DSP and AI Inference. 24 April 2023. Available online:
https://flex-logix.com/inferx-ai/inferx-ai-hardware/ (accessed on 12 June 2023).
127. Edge TPU. Coral Technology. Available online: https://coral.ai/technology/ (accessed on 20 May 2022).
Electronics 2024, 13, 2988 38 of 44
128. Coral. USB Accelerator. Available online: https://coral.ai/products/accelerator/ (accessed on 13 June 2022).
129. SolidRun. Janux GS31 AI Server. Available online: https://www.solid-run.com/embedded-networking/nxp-lx2160a-family/ai-
inference-server/ (accessed on 25 May 2022).
130. GreenWaves. GAP9 Processor for Hearables and Sensors. Available online: https://greenwaves-technologies.com/gap9
_processor/ (accessed on 18 June 2023).
131. Deleo. GreenWaves. GAP9. GreenWaves Unveils Groundbreaking Ultra-Low Power GAP9 IoT Application Processor for the Next
Wave of Intelligence at the Very Edge. Available online: https://greenwaves-technologies.com/gap9_iot_application_processor/
(accessed on 8 August 2023).
132. France, G. Design & Reuse, GreenWaves, GAP9. Available online: https://www.design-reuse.com/news/47305/greenwaves-iot-
processor.html (accessed on 7 July 2024).
133. Horizon, A.I. Efficient AI Computing for Automotive Intelligence. Available online: https://en.horizon.ai/ (accessed on 6
December 2022).
134. Horizon Robotics. Horizon Robotics and BYD Announce Cooperation on BYD’s BEV Perception Solution Powered by
Journey 5 Computing Solution at Shanghai Auton Show 2023. Cision PR Newswire. 19 April 2023. Available online:
https://www.prnewswire.com/news-releases/horizon-robotics-and-byd-announce-cooperation-on-byds-bev-perception-
solution-powered-by-journey-5-computing-solution-at-shanghai-auto-show-2023-301802072.html (accessed on 20 June 2023).
135. Zheng. Horizon Robotics’ AI Chip with up to 128 TOPS Computing Power Gets Key Certification. Cnevpost. 6 July 2021.
Available online: https://cnevpost.com/2021/07/06/horizon-robotics-ai-chip-with-up-to-128-tops-computing-power-gets-
key-certification/ (accessed on 16 June 2022).
136. Hailo. The World’s Top Performance AI Processor for Edge Devices. Available online: https://hailo.ai/ (accessed on
20 May 2023).
137. Brown. Hailo-8 NPU Ships on Linux-Powered Lanner Edge System. 1 June 2021. Available online: https://linuxgizmos.com/
hailo-8-npu-ships-on-linux-powered-lanner-edge-systems/ (accessed on 10 July 2022).
138. Rajendran, B.; Sebastian, A.; Schmuker, M.; Srinivasa, N.; Eleftheriou, E. Low-Power Neuromorphic Hardware for Signal
Processing Applications: A Review of Architectural and System-Level Design Approaches. IEEE Signal Process. Mag. 2019, 36,
97–110. [CrossRef]
139. Carmelito. Intel Neural Compute Stick 2-Review. Element14. 8 March 2021. Available online: https://community.element14.
com/products/roadtest/rv/roadtest_reviews/954/intel_neural_compute_3 (accessed on 24 March 2023).
140. Modha, D.S.; Akopyan, F.; Andreopoulos, A.; Appuswamy, R.; Arthur, J.V.; Cassidy, A.S.; Datta, P.; DeBole, M.V.; Esser, S.K.;
Otero, C.O.; et al. Neural inference at the frontier of energy, space, and time. Science 2023, 382, 329–335. [CrossRef]
141. Imagination. Power Series3NX, Advanced Compute and Neural Network Processors Enabling the Smart Edge. Available online:
https://www.imaginationtech.com/vision-ai/powervr-series3nx/ (accessed on 10 June 2022).
142. Har-Evan, B. Seperating the Wheat from the Chaff in Embedded AI with PowerVR Series3NX. 24 January 2019. Available online:
https://www.imaginationtech.com/blog/separating-the-wheat-from-the-chaff-in-embedded-ai/ (accessed on 25 July 2022).
143. Ueyoshi, K.; Papistas, I.A.; Houshmand, P.; Sarda, G.M.; Jain, V.; Shi, M.; Zheng, Q.; Giraldo, S.; Vrancx, P.; Doevenspeck, J.; et al.
DIANA: An End-to-End Energy-Efficient Digital and ANAlog Hybrid Neural Network SoC. In Proceedings of the 2022 IEEE
International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 20–26 February 2022; pp. 1–3. [CrossRef]
144. Flaherty, N.; Axelera Shows DIANA Analog In-Memory Computing Chip. EENews. 21 Feburaury 2022. Available online:
https://www.eenewseurope.com/en/axelera-shows-diana-analog-in-memory-computing-chip/ (accessed on 22 July 2023).
145. Imagination. The Ideal Single Core Solution for Neural Network Acceleration. Available online: https://www.imaginationtech.
com/product/img-4nx-mc1/ (accessed on 16 June 2022).
146. Memryx. Available online: https://memryx.com/products/ (accessed on 1 August 2023).
147. MobileEye. One Automatic Grade SoC, Many Mobility Solutions. Available online: https://www.mobileye.com/our-technology/
evolution-eyeq-chip/ (accessed on 4 August 2023).
148. EyeQ5. Wikichip. March 2021. Available online: https://en.wikichip.org/wiki/mobileye/eyeq/eyeq5 (accessed on 22 June 2023).
149. Casil, D. Mobileye Presents EyeQ Ultra, the Chip That Promises True Level 4 Autonomous Driving in 2025. 1 July 2022. Available
online: https://www.gearrice.com/update/mobileye-presents-eyeq-ultra-the-chip-that-promises-true-level-4-autonomous-
driving-in-2025/ (accessed on 5 June 2023).
150. MobileEye. Meet EyeQ6: Our Most Advanced Driver-Assistance Chip Yet. 25 May 2022. Available online: https://www.
mobileye.com/blog/eyeq6-system-on-chip/ (accessed on 27 May 2023).
151. Mediatek. i350. Mediatek Introduces i350 Edge AI Platform Designed for Voice and Vision Processing Applications. 14 October
2020. Available online: https://corp.mediatek.com/news-events/press-releases/mediatek-introduces-i350-edge-ai-platform-
designed-for-voice-and-vision-processing-applications (accessed on 16 May 2023).
152. Nvidia. Jetson Nano. Available online: https://elinux.org/Jetson_Nano#:~:text=Useful%20for%20deploying%20computer%20
vision,5-10W%20of%20power%20consumption (accessed on 26 May 2023).
153. Nvidia, Jetson Orin. The Future of Industrial-Grade Edge AI. Available online: https://www.nvidia.com/en-us/autonomous-
machines/embedded-systems/jetson-orin/ (accessed on 25 July 2023).
154. Perceive. Put High Power Intelligence in a Low Poer Device. Available online: https://perceive.io/product/ergo/ (accessed on
16 May 2023).
Electronics 2024, 13, 2988 39 of 44
155. Tan, Z.; Wu, Y.; Zhang, Y.; Shi, H.; Zhang, W.; Ma, K. A scaleable multi-chiplet deep learning accelerator with hub-side 2.5D
heterogeneous integration. In Proceedings of the HotChip Conference 2023, Palo Alto, CA, USA, 27–29 August 2023.
156. Deleon, L. Build Enhanced Video Conference Experiences. Qualcom. 7 March 2023. Available online: https://developer.
qualcomm.com/blog/build-enhanced-video-conference-experiences (accessed on 5 May 2023).
157. Qualcomm, QCS8250. Premium Processor Designed to Help You Deliver Maximum Performance for Compute Intensive Camera,
Video Conferencing and Edge AI Applications with Support Wi-Fi 6 and 5G for the Internet of Things (IoT). Available online:
https://www.qualcomm.com/products/qcs8250 (accessed on 15 July 2023).
158. Snapdragon. 888+ 5G Mobile Platform. Available online: https://www.qualcomm.com/products/snapdragon-888-plus-5g-
mobile-platform (accessed on 24 May 2023).
159. Qualcomm. Qualcomm Snapdragon 888 Plus, Benchmark, Test and Spec. CPU Monkey. 16 June 2023. Available online:
https://www.cpu-monkey.com/en/cpu-qualcomm_snapdragon_888_plus (accessed on 15 July 2023).
160. Hsu. Training ML Models at the Edge with Federated Learning. Qualcomm 7 June 2021. Available online: https://developer.
qualcomm.com/blog/training-ml-models-edge-federated-learning (accessed on 7 July 2023).
161. Mahurin, E. Qualcomm Hexagon NPU. In Proceedings of the HotChip Conference 2023, Palo Alto, CA, USA, 27–29 August 2023.
162. Yida. Introducing the Rock Pi N10 RK3399Pro SBC for AI and Deep Learning. Available online: https://www.seeedstudio.com/
blog/2019/12/04/introducing-the-rock-pi-n10-rk3399pro-sbc-for-ai-and-deep-learning/ (accessed on 17 May 2023).
163. GadgetVersus. Amalogic A311D Processor Benchmarks and Specs. Available online: https://gadgetversus.com/processor/
amlogic-a311d-specs/ (accessed on 16 May 2023).
164. Samsung. The Core that Redefines Your Device. Available online: https://www.samsung.com/semiconductor/minisite/exynos/
products/all-processors/ (accessed on 25 May 2023).
165. GSMARENA. Exynos 2100 Vs Snapdragon 888: Benchmarking the Samsung Galaxy S21 Ultra Versions. GSMARENA. 7 February
2021. Available online: https://www.gsmarena.com/exynos_2100_vs_snapdragon_888_benchmarking_the_samsung_galaxy_
s21_ultra_performance-news-47611.php (accessed on 10 June 2023).
166. Samsung. Exynos 2200. Available online: https://semiconductor.samsung.com/us/processor/mobile-processor/exynos-2200/
(accessed on 1 June 2023).
167. Samsung. Samsung Brings PIM Technology to Wider Applications. 24 August 2021. Available online: https://www.
samsung.com/semiconductor/newsroom/news-events/samsung-brings-in-memory-processing-power-to-wider-range-of-
applications/ (accessed on 18 May 2023).
168. Kim, J.H.; Kang, S.-H.; Lee, S.; Kim, H.; Song, W.; Ro, Y.; Lee, S.; Wang, D.; Shin, H.; Phuah, B.; et al. Aquabolt-XL: Samsung HBM2-
PIM with in-memory processing for ML accelerators and beyond. In Proceedings of the 2021 IEEE Hot Chips 33 Symposium
(HCS), Palo Alto, CA, USA, 22–24 August 2021; pp. 1–26.
169. Dhruvanarayan, S.; Bittorf, V. MLSoCTM —An Overview. In Proceedings of the HotChips Conference 2023, Palo Alto, CA, USA,
27–29 August 2023.
170. SiMa.ai. Available online: https://sima.ai/ (accessed on 3 September 2023).
171. Synopsys. Designware ARC EV Processors for Embedded Vsion. Available online: https://www.synopsys.com/designware-ip/
processor-solutions/ev-processors.html (accessed on 25 July 2022).
172. Synopsys. Synopsys EV7x Vision Processor. Available online: https://www.synopsys.com/dw/ipdir.php?ds=ev7x-vision-
processors (accessed on 25 May 2023).
173. Syntiant. Making Edge AI a Reality: A New Processor for Deep Learning. Available online: https://www.syntiant.com/
(accessed on 18 June 2023).
174. Syntiant. NDP100 Neural Decision Processor- NDP100- Always-on Speech Recognition. Available online: https://www.syntiant.
com/ndp100 (accessed on 28 June 2023).
175. Tyler, N. Syntiant Introduces NDP102 Neural Decision Processor. Newelectronics. 16 September 2021. Available online: https:
//www.newelectronics.co.uk/content/news/syntiant-introduces-ndp102-neural-decision-processor (accessed on 28 June 2023).
176. Demler, M. Syntiant NDP120 Sharpens Its Hearing, Wake-Word Detector COmbines Ultra-Low Power DLA with HiFi 3DSP. 2021.
Available online: https://www.linleygroup.com/mpr/article.php?id=12455 (accessed on 20 June 2023).
177. Halfacree, G. Syntiant’s NDP200 Promises 6.4GOP/s of Edge AI Compute in a Tiny 1mW Power Envelope. Hackster.io. 2021.
Available online: https://www.hackster.io/news/syntiant-s-ndp200-promises-6-4gop-s-of-edge-ai-compute-in-a-tiny-1mw-
power-envelope-96590283ffbc (accessed on 29 June 2023).
178. Think Silicon. Nema Pico XS. Available online: https://www.think-silicon.com/nema-pico-xs#features (accessed on 23
May 2023).
179. Wikichip. FSD Chip. Wikichip. Available online: https://en.wikichip.org/wiki/tesla_(car_company)/fsd_chip (accessed on 28
May 2023).
180. Kong, M. VeriSilicon VIP9000 NPU AI Processor and ZSPNano DSP IP bring AI-Vision and AI-Voice to Low Power Automotive
Image Processing SoC. VeriSilicon Press Release. 12 May 2020. Available online: https://www.verisilicon.com/en/PressRelease/
VIP9000andZSPAdoptedbyiCatch (accessed on 16 July 2022).
181. VeriSilicon. VeriSilicon Launches VIP9000, New Generation of Neural Processor Unit IP. VeriSilicon Press Release. 8 July 2019.
Available online: https://www.verisilicon.com/en/PressRelease/VIP9000 (accessed on 25 May 2022).
Electronics 2024, 13, 2988 40 of 44
182. Untether. The Most Efficient AI Computer Engine Available. Available online: https://www.untether.ai/press-releases/untether-
ai-ushers-in-the-petaops-era-with-at-memory-computation-for-ai-inference-workloads (accessed on 18 May 2023).
183. Untether. Untether AI. Available online: https://www.colfax-intl.com/downloads/UntetherAI-tsunAImi-Product-Brief.pdf
(accessed on 18 May 2023).
184. Upmem. The PIM Reference Platform. Available online: https://www.upmem.com/technology/ (accessed on 19 May 2023).
185. Lavenier, D.; Cimadomo, R.; Jodin, R. Variant Calling Parallelization on Processor-in-Memory Architecture. In Proceedings of the
2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Republic of Korea, 16–19 December 2020;
pp. 204–207. [CrossRef]
186. Gómez-Luna, J.; El Hajj, I.; Fernandez, I.; Giannoula, C.; Oliveira, G.F.; Mutlu, O. Benchmarking Memory-Centric Computing
Systems: Analysis of Real Processing-in-Memory Hardware. arXiv 2021, arXiv:2110.01709.
187. Ian Cutress. Hot Chips 31 Analysis: In Memory Processing by Upmem. Anandtech. 18 August 2019. Available online:
https://www.anandtech.com/show/14750/hot-chips-31-analysis-inmemory-processing-by-upmem (accessed on 20 May 2023).
188. Mo, H.; Zhu, W.; Hu, W.; Wang, G.; Li, Q.; Li, A.; Yin, S.; Wei, S.; Liu, L. 9.2 A 28nm 12.1TOPS/W Dual-Mode CNN Processor Using
Effective-Weight-Based Convolution and Error-Compensation-Based Prediction. In Proceedings of the 2021 IEEE International
Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 13–22 February 2021; pp. 146–148. [CrossRef]
189. Yin, S.; Zhang, B.; Kim, M.; Saikia, J.; Kwon, S.; Myung, S.; Kim, H.; Kim, S.J.; Seok, M.; Seo, J.S. PIMCA: A 3.4-Mb Programmable
In-Memory Computing Accelerator in 28nm for On-Chip DNN Inference. In Proceedings of the 2021 Symposium on VLSI
Circuits, Kyoto, Japan, 13–19 June 2021; pp. 1–2. [CrossRef]
190. Fujiwara, H.; Mori, H.; Zhao, W.C.; Chuang, M.C.; Naous, R.; Chuang, C.K.; Hashizume, T.; Sun, D.; Lee, C.F.; Akarvardar, K.; et al.
A 5-nm 254-TOPS/W 221-TOPS/mm2 Fully Digital Computing-in-Memory Macro Supporting Wide-Range Dynamic-Voltage-
Frequency Scaling and Simultaneous MAC and Write Operations. In Proceedings of the 2022 IEEE International Solid- State
Circuits Conference (ISSCC), San Francisco, CA, USA, 20–26 February 2022; pp. 1–3. [CrossRef]
191. Wang, S.; Kanwar, P. BFloat16: The Secret to High Performance on Cloud TPUs. August 2019. Available online: https://cloud.google.
com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus (accessed on 18 September 2022).
192. Lee, S.; Kim, K.; Oh, S.; Park, J.; Hong, G.; Ka, D.; Hwang, K.; Park, J.; Kang, K.; Kim, J.; et al. A 1ynm 1.25V 8Gb, 16Gb/s/pin
GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning
Applications. In Proceedings of the 2022 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA,
20–26 February 2022; pp. 1–3. [CrossRef]
193. Demer, M. Blaize Ignites Edge-AI Performance, Microprocessor Report. September 2020. Available online: https://www.blaize.
com/wp-content/uploads/2020/09/Blaize-Ignites-Edge-AI-Performance.pdf (accessed on 1 June 2023).
194. Liang, T.; Glossner, J.; Wang, L.; Shi, S.; Zhang, X. Pruning and quantization for deep neural network acceleration: A survey.
Neurocomputing 2021, 461, 370–403. [CrossRef]
195. Mahdi, B.M.; Ghatee, M. A systematic review on overfitting control in shallow and deep neural networks. Artif. Intell. Rev. 2021,
54, 6391–6438. [CrossRef]
196. Yang, H.; Kang, G.; Dong, X.; Fu, Y.; Yang, Y. Soft filter pruning for accelerating deep convolutional neural networks. arXiv 2018,
arXiv:1808.06866.
197. Torsten, H.; Alistarh, D.; Ben-Nun, T.; Dryden, N.; Peste, A. Sparsity in Deep Learning: Pruning and growth for efficient inference
and training in neural networks. arXiv 2021, arXiv:2102.00554.
198. Sanh, V.; Wolf, T.; Rush, A. Movement pruning: Adaptive sparsity by fine-tuning. arXiv 2020, arXiv:2005.07683.
199. Cristian, B.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; pp. 535–541. [CrossRef]
200. Jianping, G.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [CrossRef]
201. Kim, Y.; Rush, A.M. Sequence-level knowledge distillation. arXiv 2016, arXiv:1606.07947.
202. Zeyuan, Z.; Li, Y. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv 2023,
arXiv:2012.09816.
203. Huang, M.; You, Y.; Chen, Z.; Qian, Y.; Yu, K. Knowledge Distillation for Sequence Model. In Proceedings of the Interspeech,
Hyderabad, India, 2–6 September 2018; pp. 3703–3707. [CrossRef]
204. Hyun, C.J.; Hariharan, B. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4794–4802.
205. Tambe, T.; Hooper, C.; Pentecost, L.; Jia, T.; Yang, E.Y.; Donato, M.; Sanh, V.; Whatmough, P.; Rush, A.M.; Brooks, D.; et al.
EdgeBERT: Optimizing On-chip inference for multi-task NLP. arXiv 2020, arXiv:2011.14203.
206. Tensorflow. An End-to-End Open-Source Machine Learning Platform. Available online: https://www.tensorflow.org/ (accessed
on 1 May 2023).
207. Li, S. TensorFlow Lite: On-Device Machine Learning Framework. J. Comput. Res. Dev. 2020, 57, 1839–1853. [CrossRef]
208. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch:
An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037.
209. Pytorch, Pytorch Mobile. End to End Workflow from Training to Deployment for iOS and Android Mobile Devices. Available
online: https://pytorch.org/mobile/home/ (accessed on 20 December 2022).
210. Keras. Keras API References. Available online: https://keras.io/api/ (accessed on 20 December 2022).
Electronics 2024, 13, 2988 41 of 44
211. Caffe2. A New Lightweight, Modular, and Scalable Deep Learning Framework. Available online: https://research.facebook.
com/downloads/caffe2/ (accessed on 21 December 2022).
212. Zelinsky, A. Learning OpenCV—Computer Vision with the OpenCV Library (Bradski, G.R. et al.; 2008) [On the Shelf]. IEEE
Robot. Autom. Mag. 2009, 16, 100. [CrossRef]
213. ONNX. Open Neural Network Exchange-the Open Standard for Machine Learning Interoperability. Available online: https://onnx.ai/
(accessed on 22 December 2022).
214. MXNet. A Flexible and Efficient Efficient Library for Deep Learning. Available online: https://mxnet.apache.org/versions/1.9.0/
(accessed on 22 December 2022).
215. ONNX. Meta AI. Available online: https://ai.facebook.com/tools/onnx/ (accessed on 23 December 2022).
216. Vajda, P.; Jia, Y. Delivering Real-Time AI in the Palm of Your Hand. Available online: https://engineering.fb.com/2016/11/08
/android/delivering-real-time-ai-in-the-palm-of-your-hand/ (accessed on 27 December 2022).
217. CEVA. CEVA NeuPro-S On-Device Computer Vision Processor Architecture. September 2020. Available online: https://www.
ceva-dsp.com/wpcontent/uploads/2020/11/09_11_20_NeuPro-S_Brochure_V2.pdf (accessed on 17 July 2022).
218. Merolla, P.A.; Arthur, J.V.; Alvarez-Icaza, R.; Cassidy, A.S.; Sawada, J.; Akopyan, F.; Jackson, B.L.; Imam, N.; Guo, C.;
Nakamura, Y.; et al. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science
2014, 345, 668–673. [CrossRef]
219. Yakopcic, C.; Rahman, N.; Atahary, T.; Taha, T.M.; Douglass, S. Solving Constraint Satisfaction Problems Using the Loihi Spiking
Neuromorphic Processor. In Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE),
Grenoble, France, 9–13 March 2020; pp. 1079–1084. [CrossRef]
220. Bohnstingl, T. Neuromorphic Hardware Learns to Learn. Front. Neurosci. 2019, 13, 483. [CrossRef] [PubMed]
221. Shrestha, S.B.; Orchard, G. Slayer: Spike layer error reassignment in time. arXiv 2018, arXiv:1810.08646.
222. Davidson, S.; Furber, S.B. Comparison of Artificial and Spiking Neural Networks on Digital Hardware. Front. Neurosci. 2021, 15,
345. [CrossRef]
223. Blouw, P.; Choo, X.; Hunsberger, E.; Eliasmith, C. Benchmarking keyword spotting efficiency on neuromorphic hardware. In
Proceedings of the 7th Annual Neuro-inspired Computational Elements Workshop, Albany, NY, USA, 26–28 March 2019; pp. 1–8.
[CrossRef]
224. NengoLoihi. Available online: https://www.nengo.ai/nengo-loihi/ (accessed on 20 November 2022).
225. Nengo. Spinnaker backend for Nengo. Available online: https://nengo-spinnaker.readthedocs.io/en/latest/ (accessed on 20
November 2022).
226. NengoDL. Available online: https://www.nengo.ai/nengo-dl/ (accessed on 20 November 2022).
227. Brainchip. MetaTF. Available online: https://brainchip.com/metatf-development-environment/ (accessed on 10 July 2023).
228. Demer, M. Brainchip Akida Is a Faster Learner. Microprocessor Report, Lynely Group. 28 October 2019. Available online:
https://d1io3yog0oux5.cloudfront.net/brainchipinc/files/BrainChip+Akida+Is+a+Fast+Learner.pdf (accessed on 12 July 2023).
229. Lava. Lava Software Framework. Available online: https://lava-nc.org/ (accessed on 26 November 2022).
230. Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J. AI and ML Accelerator Survey and Trends. In Proceedings
of the 2022 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 19–23 September 2022; pp.
1–10. [CrossRef]
231. Chen, Y.; Xie, Y.; Song, L.; Chen, F.; Tang, T. A Survey of Accelerator Architectures for Deep Neural Networks. Engineering 2020, 6,
264–274. [CrossRef]
232. Li, W.; Liewig, M. A survey of AI accelerators for edge environments. In Proceedings of the World Conference on Information
Systems and Technologies, Budva, Montenegro, 7–10 April 2020; Springer: Cham, Switzerland, 2020; pp. 35–44. [CrossRef]
233. Murshed, M.S.; Murphy, C.; Hou, D.; Khan, N.; Ananthanarayanan, G.; Hussain, F. Machine Learning at the Network Edge: A
Survey. ACM Comput. Surv. 2021, 54, 1–37. [CrossRef]
234. Lin, W.; Adetomi, A.; Arslan, T. Low-Power Ultra-Small Edge AI Accelerators for Image Recognition with Convolution Neural
Networks: Analysis and Future Directions. Electronics 2021, 10, 2048. [CrossRef]
235. Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J. Survey of Machine Learning Accelerators. In Proceedings
of the 2020 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 22–24 September 2020; pp.
1–12. [CrossRef]
236. Xue, C.-X.; Hung, J.M.; Kao, H.Y.; Huang, Y.H.; Huang, S.P.; Chang, F.C.; Chen, P.; Liu, T.W.; Jhang, C.J.; Su, C.I.; et al. 16.1 A 22nm
4Mb 8b-Precision ReRAM Computing-in-Memory Macro with 11.91 to 195.7TOPS/W for Tiny AI Edge Devices. In Proceedings of
the 2021 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 13–22 February 2021; pp. 245–247.
[CrossRef]
237. Chih, Y.-D.; Lee, P.H.; Fujiwara, H.; Shih, Y.C.; Lee, C.F.; Naous, R.; Chen, Y.L.; Lo, C.P.; Lu, C.H.; Mori, H.; et al. 16.4 An
89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in 22nm for Machine-
Learning Edge Applications. In Proceedings of the 2021 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco,
CA, USA, 13–22 February 2021; pp. 252–254. [CrossRef]
238. Dong, Q.; Sinangil, M.E.; Erbagci, B.; Sun, D.; Khwa, W.S.; Liao, H.J.; Wang, Y.; Chang, J. 15.3 A 351TOPS/W and 372.4GOPS
Compute-in-Memory SRAM Macro in 7nm FinFET CMOS for Machine-Learning Applications. In Proceedings of the 2020 IEEE
International Solid- State Circuits Conference—(ISSCC), San Francisco, CA, USA, 16–20 February 2020; pp. 242–244. [CrossRef]
Electronics 2024, 13, 2988 42 of 44
239. Yuan, G.; Behnam, P.; Li, Z.; Shafiee, A.; Lin, S.; Ma, X.; Liu, H.; Qian, X.; Bojnordi, M.N.; Wang, Y.; et al. FORMS: Fine-grained
Polarized ReRAM-based In-situ Computation for Mixed-signal DNN Accelerator. In Proceedings of the 2021 ACM/IEEE 48th
Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–19 June 2021; pp. 265–278. [CrossRef]
240. Khaddam-Aljameh, R.; Stanisavljevic, M.; Mas, J.F.; Karunaratne, G.; Brandli, M.; Liu, F.; Singh, A.; Muller, S.M.; Petropoulos, A.;
Antonakopoulos, T.; et al. HERMES Core—A 14nm CMOS and PCM-based In-Memory Compute Core using an array of
300ps/LSB Linearized CCO-based ADCs and local digital processing. In Proceedings of the 2021 Symposium on VLSI Technology,
Kyoto, Japan, 13–19 June 2021; pp. 1–2.
241. Caminal, H.; Yang, K.; Srinivasa, S.; Ramanathan, A.K.; Al-Hawaj, K.; Wu, T.; Narayanan, V.; Batten, C.; Martínez, J.F. CAPE:
A Content-Addressable Processing Engine. In Proceedings of the 2021 IEEE International Symposium on High-Performance
Computer Architecture (HPCA), Seoul, Republic of Korea, 27 February–3 March 2021; pp. 557–569. [CrossRef]
242. Park, S.; Park, C.; Kwon, S.; Jeon, T.; Kang, Y.; Lee, H.; Lee, D.; Kim, J.; Kim, H.S.; Lee, Y.; et al. A Multi-Mode 8K-MAC HW-
Utilization-Aware Neural Processing Unit with a Unified Multi-Precision Datapath in 4nm Flagship Mobile SoC. In Proceedings of
the 2022 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 20–26 February 2022; pp. 246–248.
[CrossRef]
243. Zhu, H.; Jiao, B.; Zhang, J.; Jia, X.; Wang, Y.; Guan, T.; Wang, S.; Niu, D.; Zheng, H.; Chen, C.; et al. COMB-MCM: Computing-on-
Memory-Boundary NN Processor with Bipolar Bitwise Sparsity Optimization for Scalable Multi-Chiplet-Module Edge Machine
Learning. In Proceedings of the 2022 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 20–26
February 2022; pp. 1–3. [CrossRef]
244. Niu, D.; Li, S.; Wang, Y.; Han, W.; Zhang, Z.; Guan, Y.; Guan, T.; Sun, F.; Xue, F.; Duan, L.; et al. 184QPS/W 64Mb/mm23D
Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System. In Proceedings of the 2022
IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 20–26 February 2022; pp. 1–3. [CrossRef]
245. Chiu, Y.-C.; Yang, C.S.; Teng, S.H.; Huang, H.Y.; Chang, F.C.; Wu, Y.; Chien, Y.A.; Hsieh, F.L.; Li, C.Y.; Lin, G.Y.; et al. A 22nm
4Mb STT-MRAM Data-Encrypted Near-Memory Computation Macro with a 192GB/s Read-and-Decryption Bandwidth and
25.1–55.1TOPS/W 8b MAC for AI Operations. In Proceedings of the 2022 IEEE International Solid- State Circuits Conference
(ISSCC), San Francisco, CA, USA, 20–26 February 2022; pp. 178–180. [CrossRef]
246. Khwa, W.-S.; Chiu, Y.C.; Jhang, C.J.; Huang, S.P.; Lee, C.Y.; Wen, T.H.; Chang, F.C.; Yu, S.M.; Lee, T.Y.; Chang, M.F. 11.3 A 40-nm,
2M-Cell, 8b-Precision, Hybrid SLC-MLC PCM Computing-in-Memory Macro with 20.5–65.0TOPS/W for Tiny-Al Edge Devices.
In Proceedings of the 2022 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 20–26 February
2022; pp. 1–3. [CrossRef]
247. Spetalnick, S.D.; Chang, M.; Crafton, B.; Khwa, W.S.; Chih, Y.D.; Chang, M.F.; Raychowdhury, A. A 40nm 64kb 26.56TOPS/W
2.37Mb/mm2RRAM Binary/Compute-in-Memory Macro with 4.23× Improvement in Density and >75% Use of Sensing Dynamic
Range. In Proceedings of the 2022 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 20–26
February 2022; pp. 1–3. [CrossRef]
248. Chang, M.; Spetalnick, S.D.; Crafton, B.; Khwa, W.S.; Chih, Y.D.; Chang, M.F.; Raychowdhury, A. A 40nm 60.64TOPS/W
ECC-Capable Compute-in-Memory/Digital 2.25MB/768KB RRAM/SRAM System with Embedded Cortex M3 Microprocessor
for Edge Recommendation Systems. In Proceedings of the 2022 IEEE International Solid- State Circuits Conference (ISSCC),
San Francisco, CA, USA, 20–26 February 2022; pp. 1–3. [CrossRef]
249. Wang, D.; Lin, C.T.; Chen, G.K.; Knag, P.; Krishnamurthy, R.K.; Seok, M. DIMC: 2219TOPS/W 2569F2/b Digital In-Memory
Computing Macro in 28nm Based on Approximate Arithmetic Hardware. In Proceedings of the 2022 IEEE International Solid-
State Circuits Conference (ISSCC), San Francisco, CA, USA, 20–26 February 2022; pp. 266–268. [CrossRef]
250. Yue, J.; Feng, X.; He, Y.; Huang, Y.; Wang, Y.; Yuan, Z.; Zhan, M.; Liu, J.; Su, J.W.; Chung, Y.L.; et al. 15.2 A 2.75-to-75.9TOPS/W
Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero Skipping and Ping-Pong CIM with Simultaneous
Computation and Weight Updating. In Proceedings of the 2021 IEEE International Solid- State Circuits Conference (ISSCC), San
Francisco, CA, USA, 13–22 February 2021; pp. 238–240. [CrossRef]
251. Yue, J.; Yuan, Z.; Feng, X.; He, Y.; Zhang, Z.; Si, X.; Liu, R.; Chang, M.F.; Li, X.; Yang, H.; et al. 14.3 A 65nm Computing-in-
Memory-Based CNN Processor with 2.9-to-35.8TOPS/W System Energy Efficiency Using Dynamic-Sparsity Performance-Scaling
Architecture and Energy-Efficient Inter/Intra-Macro Data Reuse. In Proceedings of the 2020 IEEE International Solid- State
Circuits Conference—(ISSCC), San Francisco, CA, USA, 16–20 February 2020; pp. 234–236. [CrossRef]
252. Wang, Y.; Qin, Y.; Deng, D.; Wei, J.; Zhou, Y.; Fan, Y.; Chen, T.; Sun, H.; Liu, L.; Wei, S.; et al. A 28nm 27.5TOPS/W Approximate-
Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing. In Proceedings
of the 2022 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 20–26 February 2022; pp. 1–3.
[CrossRef]
253. Matsubara, K.; Lieske, H.; Kimura, M.; Nakamura, A.; Koike, M.; Morikawa, S.; Hotta, Y.; Irita, T.; Mochizuki, S.;
Hamasaki, H.; et al. 4.2 A 12nm Autonomous-Driving Processor with 60.4TOPS, 13.8TOPS/W CNN Executed by Task-
Separated ASIL D Control. In Proceedings of the 2021 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco,
CA, USA, 13–22 February 2021; pp. 56–58. [CrossRef]
254. Agrawal, A.; Lee, S.K.; Silberman, J.; Ziegler, M.; Kang, M.; Venkataramani, S.; Cao, N.; Fleischer, B.; Guillorn, M.; Cohen, M.; et al.
9.1 A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling. In
Electronics 2024, 13, 2988 43 of 44
Proceedings of the 2021 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 13–22 February
2021; pp. 144–146. [CrossRef]
255. Park, J.-S.; Jang, J.W.; Lee, H.; Lee, D.; Lee, S.; Jung, H.; Lee, S.; Kwon, S.; Jeong, K.; Song, J.H.; et al. 9.5 A 6K-MAC Feature-
Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC. In Proceedings of the 2021 IEEE International Solid-
StateCircuits Conference (ISSCC), San Francisco, CA, USA, 13–22 February 2021; pp. 152–154. [CrossRef]
256. Eki, R.; Yamada, S.; Ozawa, H.; Kai, H.; Okuike, K.; Gowtham, H.; Nakanishi, H.; Almog, E.; Livne, Y.; Yuval, G.; et al. 9.6 A
1/2.3inch 12.3Mpixel with On-Chip 4.97TOPS/W CNN Processor Back-Illuminated Stacked CMOS Image Sensor. In Proceedings
of the 2021 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 13–22 February 2021; pp. 154–156.
[CrossRef]
257. Lin, C.-H.; Cheng, C.C.; Tsai, Y.M.; Hung, S.J.; Kuo, Y.T.; Wang, P.H.; Tsung, P.K.; Hsu, J.Y.; Lai, W.C.; Liu, C.H.; et al. 7.1 A
3.4-to-13.3TOPS/W 3.6TOPS Dual-Core Deep-Learning Accelerator for Versatile AI Applications in 7nm 5G Smartphone SoC. In
Proceedings of the 2020 IEEE International Solid- State Circuits Conference—(ISSCC), San Francisco, CA, USA, 16–20 February
2020; pp. 134–136. [CrossRef]
258. Huang, W.-H.; Wen, T.H.; Hung, J.M.; Khwa, W.S.; Lo, Y.C.; Jhang, C.J.; Hsu, H.H.; Chin, Y.H.; Chen, Y.C.; Lo, C.C.; et al. A
Nonvolatile Al-Edge Processor with 4MB SLC-MLC Hybrid-Mode ReRAM Compute-in-Memory Macro and 51.4-251TOPS/W. In
Proceedings of the 2023 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 19–23 February
2023; pp. 15–17. [CrossRef]
259. Tambe, T.; Zhang, J.; Hooper, C.; Jia, T.; Whatmough, P.N.; Zuckerman, J.; Dos Santos, M.C.; Loscalzo, E.J.; Giri, D.;
Shepard, K.; et al. 22.9 A 12nm 18.1TFLOPs/W Sparse Transformer Processor with Entropy-Based Early Exit, Mixed-Precision
Predication and Fine-Grained Power Management. In Proceedings of the 2023 IEEE International Solid- State Circuits Conference
(ISSCC), San Francisco, CA, USA, 19–23 February 2023; pp. 342–344. [CrossRef]
260. Chiu, Y.-C.; Khwa, W.S.; Li, C.Y.; Hsieh, F.L.; Chien, Y.A.; Lin, G.Y.; Chen, P.J.; Pan, T.H.; You, D.Q.; Chen, F.Y.; et al. A 22nm 8Mb
STT-MRAM Near-Memory-Computing Macro with 8b-Precision and 46.4-160.1TOPS/W for Edge-AI Devices. In Proceedings of
the 2023 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 19–23 February 2023; pp. 496–498.
[CrossRef]
261. Desoli, G.; Chawla, N.; Boesch, T.; Avodhyawasi, M.; Rawat, H.; Chawla, H.; Abhijith, V.S.; Zambotti, P.; Sharma, A.;
Cappetta, C.; et al. 16.7 A 40-310TOPS/W SRAM-Based All-Digital Up to 4b In-Memory Computing Multi-Tiled NN Accelerator
in FD-SOI 18nm for Deep-Learning Edge Applications. In Proceedings of the 2023 IEEE International Solid- State Circuits
Conference (ISSCC), San Francisco, CA, USA, 19–23 February 2023; pp. 260–262. [CrossRef]
262. Shih, M.-E.; Hsieh, S.-W.; Tsa, P.-Y.; Lin, M.-H.; Tsung, P.-K.; Chang, E.-J.; Liang, J.; Chang, S.-H.; Nian, Y.-Y.; Wan, Z.; et al.
NVE: A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High Resolution Visual-Quality Enhancement on Smart
Devices. In Proceedings of the 2024 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA,
18–22 February 2024.
263. Khwa, W.-S.; Wu, P.-C.; Wu, J.-J.; Su, J.-W.; Chen, H.-Y.; Ke, Z.-E.; Chiu, T.-C.; Hsu, J.-M.; Cheng, C.-Y.; Chen, Y.-C.; et al.
A 16nm 96Kb Integer/Floating-Point Dual Mode-Gain-CellComputing-in-Memory Macro Achieving 73.3 163.3TOPS/W and
33.2-91.2TFLOPS/W for AI-Edge Devices. In Proceedings of the 2024 IEEE International Solid- State Circuits Conference (ISSCC),
San Francisco, CA, USA, 18–22 February 2024.
264. Nose, K.; Fujii, T.; Togawa, K.; Okumura, S.; Mikami, K.; Hayashi, D.; Tanaka, T.; Toi, T. A 23.9TOPS/W @ 0.8V, 130TOPS
AI Acceleratorwith 16× Performanc e-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot
Applications. In Proceedings of the 2024 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA,
18–22 February 2024.
265. Apple. Press Release. Apple Unveils M2, Taking the Breakthrough Performance and Capabilities of M1 Even Further. 6 June
2022. Available online: https://www.apple.com/newsroom/2022/06/apple-unveils-m2-with-breakthrough-performance-and-
capabilities/ (accessed on 10 July 2023).
266. Dahad, N. Hardware Inference Chip Targets Automotive Applications. 24 December 2019. Available online: https://www.
embedded.com/hardware-inference-chip-targets-automotive-applications/ (accessed on 25 June 2022).
267. Jouppi, N.P.; Yoon, D.H.; Kurian, G.; Li, S.; Patil, N.; Laudon, J.; Young, C.; Patterson, D. A domain-specific supercomputer for
training deep neural networks. Commun. ACM 2020, 63, 67–78. [CrossRef]
268. Google. How Google Tensor Powers Up Pixel Phones. Available online: https://store.google.com/intl/en/ideas/articles/
google-tensor-pixel-smartphone/ (accessed on 6 July 2022).
269. Wikichip. Intel Nirvana, Neural Network Processor (NNP). Available online: https://en.wikichip.org/wiki/nervana/nnp
(accessed on 14 July 2023).
270. Smith, L. 4th Gen Intel Xeon Scalable Processors Launched. StorageReview. 10 January 2023. Available online: https://www.
storagereview.com/news/4th-gen-intel-xeon-scalable-processors-launched (accessed on 12 May 2023).
271. Burns, J.; Chang, L. Meet the IBM Artificial Intelligence Unit. 18 October 2022. Available online: https://research.ibm.com/blog/
ibm-artificial-intelligence-unit-aiu (accessed on 16 December 2022).
Electronics 2024, 13, 2988 44 of 44
272. Gupta, K. IBM Research Introduces Artificial Intelligence Unit (AIU): It’s First Complete System-on-Chip Designed to Run and
Train Deep Learning Models Faster and More Efficiently than a General-Purpose CPU. MarkTecPost. 27 October 2022. Available
online: https://www.marktechpost.com/2022/10/27/ibm-research-introduces-artificial-intelligence-unit-aiu-its-first-complete-
system-on-chip-designed-to-run-and-train-deep-learning-models-faster-and-more-efficiently-than-a-general-purpose-cpu/ (ac-
cessed on 20 December 2022).
273. Clarke, P. Startup Launches Near-Binary Neural Network Accelerator. EENews 19 May 2020. Available online: https://www.
eenewseurope.com/en/startup-launches-near-binary-neural-network-accelerator/ (accessed on 20 December 2022).
274. NIDIA Jetson Nano B01. Deep Learning with Raspberry pi and Alternatives. 5 April 2023. Available online: https://qengineering.
eu/deep-learning-with-raspberry-pi-and-alternatives.html#Compare_Jetson (accessed on 3 July 2023).
275. Ambarella. Available online: https://www.ambarella.com/products/iot-industrial-robotics/ (accessed on 5 March 2024).
276. Research and Markets. Neuromorphic Chips: Global Strategic Business Report. Research and Markets, ID: 4805280. Available
online: https://www.researchandmarkets.com/reports/4805280/neuromorphic-chips-global-strategic-business (accessed on 16
May 2023).
277. GrAI VIP. Life Ready AI Processors. Available online: https://www.graimatterlabs.ai/product (accessed on 16 July 2023).
278. Cassidy, S.; Alvarez-Icaza, R.; Akopyan, F.; Sawada, J.; Arthur, J.V.; Merolla, P.A.; Datta, P.; Tallada, M.G.; Taba, B.;
Andreopoulos, A.; et al. Real-Time Scalable Cortical Computing at 46 Giga-Synaptic OPS/Watt with ~100× Speedup in
Time-to-Solution and ~100,000× Reduction in Energy-to-Solution. In Proceedings of the SC ’14: Proceedings of the International
Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, USA, 16–21 November
2014; pp. 27–38. [CrossRef]
279. Wax-forton, S. Innatera Unveils Neuromorphic AI Chip to Accelerate Spiking Networks. EETimes. 7 July 2021. Available online:
https://www.linleygroup.com/newsletters/newsletter_detail.php?num=6302&year=2021&tag=3 (accessed on 25 May 2023).
280. Aufrace, J.L. Innatera Neuromorphic AI Accelerator for Spiking Neural Networks Enables Sub-mW AI Inference. CNX Software-
Embedded Systems News. 16 July 2021. Available online: https://www.cnx-software.com/2021/07/16/innatera-neuromorphic-
ai-accelerator-for-spiking-neural-networks-snn-enables-sub-mw-ai-inference/ (accessed on 25 May 2023).
281. Yousefzadeh, A.; Van Schaik, G.J.; Tahghighi, M.; Detterer, P.; Traferro, S.; Hijdra, M.; Stuijt, J.; Corradi, F.; Sifalakis, M.;
Konijnenburg, M. SENeCA: Scalable energy-efficient neuromorphic computer architecture. In Proceedings of the 2022 IEEE 4th
International Conference on Artificial Intelligence Circuits and Systems (AICAS), Incheon, Republic of Korea, 13–15 June 2022;
pp. 371–374.
282. Konikore. Technology That Sniffs Out Danger. Available online: https://theindexproject.org/post/konikore (accessed on
26 May 2023).
283. Syntiant. NDP200 Neural Decision Processor, NDP200 Always-on Vision, Sensor and Speech Recognition. Available online:
https://www.syntiant.com/ndp200 (accessed on 28 June 2023).
284. Demler, M. Syntiant Knows All the Best Words, NDP10x Speech-Recognition Processors Consume Just 200uW. Microprocessors
Report. 2019. Available online: https://www.syntiant.com/post/syntiant-knows-all-the-best-words (accessed on 29 June 2023).
285. MemComputing. MEMCPU. Available online: https://www.memcpu.com/ (accessed on 1 July 2023).
286. IniLabs. IniLabs. Available online: https://inilabs.com/ (accessed on 1 July 2023).
287. Tavanaei, A.; Ghodrati, M.; Kheradpisheh, S.R.; Masquelier, T.; Maida, A. Deep learning in spiking neural networks. Neural Netw.
2019, 111, 47–63. [CrossRef] [PubMed]
288. Amazon, Coral Edge TPU, Amazon. USB Edge TPU ML Accelerator Coprocessor for Raspberry Pi and Other Embedded Single
Board Computers. Available online: https://www.amazon.com/Google-Coral-Accelerator-coprocessor-Raspberry/dp/B07R5
3D12W (accessed on 2 July 2024).
289. Shakir, U. Tesla Slashes Full Self-Driving Price after Elon Musk Said It Would only Get More Expensive. 22 April 2024. Available
online: https://www.theverge.com/2024/4/22/24137056/tesla-full-self-driving-fsd-price-cut-8000 (accessed on 5 July 2024).
290. Amazon. NVIDIA Jetson AGX Orin. NVIDIA Jetson AGX Orin 64GB Developer Kit. Available online: https://www.amazon.
com/NVIDIA-Jetson-Orin-64GB-Developer/dp/B0BYGB3WV4?th=1 (accessed on 8 July 2024).
291. Bill Dally. ‘Hardware for Deep Learning’, NVIDIA Corporation. In Proceedings of the HotChip Conference 2023, Palo Alto, CA,
USA, 27–29 August 2023.
292. Kim, J.H.; Ro, Y.; So, J.; Lee, S.; Kang, S.-H.; Cho, Y.; Kim, H.; Kim, B.; Kim, K.; Park, S.; et al. Samsung PIM/PNM for Transformer
based AI: Energy Efficiency on PIM/PNM Cluster. In Proceedings of the HotChips Conference, Palo Alto, CA, USA, 27–29
August 2023.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.