[go: up one dir, main page]

0% found this document useful (0 votes)
33 views11 pages

Sys Array

The paper presents DiP, a novel scalable systolic array architecture designed to enhance matrix multiplication efficiency for transformer models in AI applications. By eliminating the need for synchronization FIFOs and implementing diagonal-input movement with permutated weight-stationary dataflow, DiP achieves significant improvements in throughput and energy efficiency compared to traditional weight stationary systolic arrays. Evaluations demonstrate that DiP outperforms existing TPU-like architectures, achieving up to 1.81x energy improvements and 1.49x latency improvements across various transformer workloads.

Uploaded by

aayushkumarbce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views11 pages

Sys Array

The paper presents DiP, a novel scalable systolic array architecture designed to enhance matrix multiplication efficiency for transformer models in AI applications. By eliminating the need for synchronization FIFOs and implementing diagonal-input movement with permutated weight-stationary dataflow, DiP achieves significant improvements in throughput and energy efficiency compared to traditional weight stationary systolic arrays. Evaluations demonstrate that DiP outperforms existing TPU-like architectures, achieving up to 1.81x energy improvements and 1.49x latency improvements across various transformer workloads.

Uploaded by

aayushkumarbce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

1

DiP: A Scalable, Energy-Efficient Systolic Array


for Matrix Multiplication Acceleration
Ahmed J. Abdelmaksoud, Shady Agwa and Themis Prodromakis

Abstract—Transformers are gaining increasing attention exceptional capabilities of these new NLP models have led
across Natural Language Processing (NLP) application domains to a transformer-driven transformation across numerous ap-
due to their outstanding accuracy. However, these data-intensive plication domains, including Machine Translation [5], Speech
models add significant performance demands to the existing com-
arXiv:2412.09709v1 [cs.AR] 12 Dec 2024

puting architectures. Systolic arrays are spatial architectures that Recognition [6], Multimodal Applications [7], and Computer
have been adopted by commercial AI computing platforms (like Vision [8].
Google TPUs), due to their energy-efficient approach of data- However, transformers are data-intensive models that
reusability. However, these spatial architectures face a penalty in handle massive workloads in comparison to Deep Neural Net-
throughput and energy efficiency due to the need for input and works (DNNs) and Convolutional Neural Networks (CNNs)
output synchronization using First-In-First-Out (FIFO) buffers.
This paper proposes a novel scalable systolic-array architec- [9]. Additionally, transformer models have been growing ex-
ture featuring Diagonal-Input and Permutated weight-stationary ponentially, evolving from the original vanilla transformer
(DiP) dataflow ¯for the acceleration
¯ ¯of matrix multiplication. The model with around 65 million parameters to models with
proposed architecture eliminates the synchronization FIFOs re- hundreds of billions of parameters [10, 11]. A clear example
quired by state-of-the-art weight stationary systolic arrays. Aside of the new challenging level of scalability is GPT (Generative
from the area, power, and energy savings achieved by eliminating
these FIFOs, DiP architecture maximizes the computational Pre-trained Transformer) model (the core of ChatGPT) that
resources (PEs) utilization. Thus, it outperforms the weight- incorporates billions of parameters, primarily involving matrix
stationary counterparts in terms of throughput by up to 50%. multiplications [12].
Analytical models are developed for both weight stationary and Conventional Von-Neumann architectures are struggling
DiP architectures, including latency, throughput, time to full PEs to meet these increasing performance demands due to the
utilization, and FIFOs overhead. Additionally, a comprehensive
hardware design space exploration is demonstrated using com- memory/data-movement bottleneck. Systolic arrays, intro-
mercial 22nm technology, highlighting the scalability advantages duced in 1970s, are spatial architectures that aim at maximiz-
of DiP over the conventional approach across various dimensions ing the data utilization to mitigate the memory/data-movement
where DiP offers improvement of energy efficiency per area up to bottleneck; These architectures are receiving increased atten-
2.02x. Furthermore, DiP is evaluated using various transformer tion nowadays as a promising architecture for AI hardware
workloads from widely-used models, consistently outperforming
TPU-like architectures, achieving energy improvements of up to acceleration [13]. Usually, the systolic array consists of a set
1.81x and latency improvements of up to 1.49x across a range of two-dimensional (2D) interconnected Processing Elements
of transformer workloads. At a 64x64 size with 4096 PEs, DiP (PEs). The PE is composed of basic arithmetic, mainly multi-
achieves a peak performance of 8.2 TOPS with energy efficiency plication and accumulation, along with register units. Systolic
9.55 TOPS/W. arrays are spatial architectures that enhance local data utiliza-
Index Terms—Hardware Acceleration, Systolic Arrays, Spa- tion by increasing the number of computation operations per
tial Architectures, Matrix Multiplication, Weight Stationary. each memory access. Accordingly, data flows among different
PEs in a wave fashion, while the communication with the
I. I NTRODUCTION synchronization First-In-First-Out (FIFO)s occurs only at the
boundary PEs. Moreover, the interconnection of the systolic
RTIFICIAL Intelligence (AI) is continuously dominating
A various application domains that are vital in our daily
life [1]. Natural Language Processing (NLP) is one of the
array naturally realizes data reuse through the interchange of
data via PEs, especially in matrix multiplications [14].
Although many systolic arrays are extensively used for
emerging AI applications that are gaining an increasing atten- hardware accelerators [15–21], different adopted dataflows
tion nowadays [2, 3]. Thanks to Transformers, NLP tasks have require further synchronization interfacing hardware that limits
been revolutionized by providing highly effective and scalable not only the energy efficiency, but also the performance
models for language understanding and generation [4]. The capabilities of these dataflows. Therefore, transformers with
massive matrix multiplication workloads will have serious
This work was supported by EPSRC FORTE Programme (Grant No.
EP/R024642/2) and by the RAEng Chair in Emerging Technologies (Grant challenges to leverage the systolic array scalability for a next
No. CiET1819/2/93) generation of AI hardware.
Ahmed J.Abdelmaksoud, S. Agwa and T. Prodromakis are with Tensor Processing Unit (TPU) is one of the well-known
the Centre for Electronics Frontiers, Institute for Integrated Micro and
Nano Systems, School of Engineering, The University of Edinburgh, EH9 AI computing architectures that is introduced by Google to
3BF, Edinburgh, United Kingdom (e-mails: a.j.abdelmaksoud@ed.ac.uk; handle massive matrix multiplication workloads with higher
shady.agwa@ed.ac.uk; t.prodromakis@ed.ac.uk). performance and energy efficiency than CPUs and GPUs [22].
TPUs adopt weight stationary dataflow, which maximizes the

This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no longer be accessible.
2

data utilization of both weights and inputs. The first-generation registers overhead are extracted for different systolic array
TPU (TPU v1) is designed primarily for inference, featuring sizes.
a 256x256 systolic array optimized for 8-bit integer (INT8) • Hardware Design space exploration and implementation
operations, achieving a peak throughput of 92 TOPS [23]. are presented for DiP and WS using commercial 22nm
TPU v2 shifted to mixed-precision training with a smaller technology, offering energy efficiency per area improve-
128x128 systolic array per core optimized for FP16 and ment up to 2.02x with area and power consumption
bfloat16 operations, boosting throughput to 180 TeraFLOPS savings up to 8.12% and 19.95%, respectively.
[24]. TPU v3 and v4 maintained the same array size but v3 • DiP is evaluated using various transformers workloads
doubled the throughput to 420 TFLOPS per chip, aided by from widely-used models, outperforming TPU-like archi-
high memory bandwidth, while TPU v4 uses four cores of tectures, and achieving energy improvement up to 1.81x
128x128 architecture, achieving up to 1 PFLOPS [25]. In and latency improvement up to to 1.49x across various
conventional WS systolic array, synchronization FIFOs are transformer workloads.
necessary to synchronize both inputs and outputs, adding This paper is organized as follows; Section II discusses
significant overhead in terms of throughput, energy, latency, the systolic arrays background. Section III presents DiP archi-
power, and area. Moreover, the computations propagate as a tecture. Section IV shows hardware design space exploration,
diagonal wavefront from the top-left corner to the bottom- evaluation, and results. Finally, section V concludes the work.
right corner of the systolic array due to the WS dataflow. This
significantly decreases the overall PEs utilization resulting in
II. S YSTOLIC A RRAYS BACKGROUND
degraded performance and increased latency.
Meissa is one of systolic architectures that separates mul- Systolic arrays usually adopt one of the dataflows to
tipliers from the adders rather than combining them in a unified control the data movement across the PEs. Each dataflow
array [26]. It stands for multiplying matrices efficiently In a retains one of the input/output/weight data to be stationary
scalable systolic architecture. like to the TPUs, it adopted the during computations for the maximum duration to exploit
WS dataflow for the systolic array, but it eliminates the input the data reuse [27, 28]. The following dataflows are the most
synchronization FIFOs to reduce the overall latency. However, common for the systolic array design:
it has bulky adder trees per each column of the systolic array • Weight Stationary (WS): weights are initially loaded to
instead of having the partial summations accumulated through PEs and kept stationary during processing. The input
the PEs. These adder trees impose scalability limitations due matrix and partial summations are moved among PEs
to serious physical implementation challenges. The larger during processing.
the adder trees the deeper pipelines they require to achieve • Input Stationary (IS): input matrix is initially loaded
higher frequency. This increases the overall latency, area, to the systolic array, while weight matrix and partial
and energy consumption. The routing congestion is another summations are moved among PEs during processing.
expensive challenge, caused by delivering all products from • Output Stationary (OS): input and weight matrices are
all PEs in the same column to the adder tree. Consequently, moved across the PEs, while the partial summations are
Meissa is not scalable to large NxN dimensions, which is accumulated inside the PEs.
vital to large language models. Moreover, it still requires the • Row Stationary (RS): This dataflow is proposed by Ey-
output synchronization FIFOs which still add a considerable eriss [29]. It adopts spatial architecture that uses coarse-
area/power/energy penalty. grained PEs with internal memories to store weights and
In this paper, we present a novel Diagonal-Input Permu- inputs. Inputs are broadcasted diagonally across the PEs,
tated (DiP) weight-stationary Systolic Array that overcomes while weight matrix broadcasted horizontally, and partial
the main challenges of the conventional WS systolic arrays. summations move vertically.
DiP is a scalable architecture of NxN PEs that maximizes OS dataflow moves both input and weight matrices si-
the PEs utilization by featuring diagonal-input movement and multaneously, which effectively doubles the required memory
permutated weight-stationary dataflow for the acceleration of bandwidth for the systolic array. With RS dataflow, data re-
matrix multiplication. dundancy increases because copies of the data are loaded into
The main contributions of this work are highlighted as different PEs. Additionally, the circulation of weights within
follows: each PE reduces energy efficiency. On the other hand, the WS
dataflow is widely used in many architectures, such as the
• We introduce DiP, a novel scalable spatial architecture Google TPU, due to its scalability and flexibility in handling
that maximizes the PEs utilization and energy efficiency convolutions and matrix multiplication [23–25]. Additionally,
achieving improvement in throughput by up to 1.49x and it requires less memory bandwidth. Therefore, we will focus
energy efficiency per area by up to 2.017x. on improving it.
• The proposed architecture eliminates the input/output
synchronization FIFOs of the WS by implementing a new
dataflow with diagonal-input movement and permutated A. WS Dataflow
weights. WS dataflow is widely used in many architectures where
• The analytical models for DiP and WS including through- weights are initially loaded to the systolic array’s PEs, while
put, latency, time to full PEs utilization (TFPU), and input matrix circulates among the PEs in a systolic fashion.
3

groups for input and output synchronization. Each group


consists of N −1 FIFOs, as shown in Fig. 1. Additionally, each
FIFO group includes N (N − 1)/2 registers. Consequently,
the total register overhead for a typical WS systolic array is
calculated as shown in (3). TFPU is another metric introduced
to calculate the required number of cycles to reach full
utilization of PEs. This metric shows the overhead when the
input matrix is initially loaded to PEs. TFPU is calculated, as
shown in (4), where it takes 2N − 1 cycles for WS to reach
full PEs utilization.

Latency for WS = 3N + S − 3 (1)


3
2N
Throughput for WS = (2)
3N + S − 3
Registers overhead for WS = N (N − 1) (3)
TFPU for WS = 2N − 1 (4)

III. D I P A RCHITECTURE
In this section, we discuss DiP architecture, DiP dataflow,
Fig. 1. Top-level schematic for NxN weight stationary systolic array. There
are two FIFO groups for input/output synchronization. The weights (black
and DiP analytical models. Then, we compare the analytical
crossed buses) are loaded vertically, and psums (grey buses) are accumulated models for DiP and WS.
vertically. Inputs (I) (blue buses) are shifted horizontally from the input FIFO
group, and the output is shifted out to the output FIFO group.
A. DiP Architecture
DiP is a scalable spatial architecture consisting of NxN
This approach mitigates the memory bottleneck by reducing PEs, designed to accelerate matrix multiplication computa-
the number of memory accesses and increasing data reusabil- tions, as shown in Fig. 2 (a). The input matrix moves diago-
ity. However, the WS dataflow requires input and output nally across the PEs, passing from one row of PEs to the next.
FIFOs to synchronize the data for proper functionality. Figure The boundary PEs are diagonally connected, such that the reg-
1 shows NxN weight stationary systolic array. There are istered inputs of the leftmost PE column are connected to the
two FIFO groups for input/output synchronization. The input inputs of the rightmost PE column in the subsequent row. The
FIFO group consists of series of FIFOs with incrementally weight matrix is loaded vertically, and psums are accumulated
increasing depth starting from one element in the second row vertically. Figure 2 (b) shows the architecture of each PE. The
up to N − 1 elements in the last row. The output FIFO proposed PE uses a 2-stage pipelined MAC unit to perform
group is structured similarly to the input FIFO group, but the multiply and accumulate operations. It employs four enabled
FIFO depths decrease from N − 1 elements to one element registers for weight, input, multiplier output, and adder output.
moving from left to right across the systolic array columns. The weight and input registers are 8-bit, while the multiplier
Weights are loaded vertically, and partial summations (psums) and adder registers are 16-bit. Control signals wshift, pe en,
are accumulated vertically. Input matrix is shifted horizontally mul en, and adder en manage the PE operations. Specifically,
from the input FIFO group, and the output from the last PE wshift enables the weight register, while pe en enables the
row is shifted out to the output FIFO group. The use of input register. mul en and adder en selectively enable their
FIFOs increases area, power, energy consumption, and latency respective registers only during active computation cycles,
for matrix multiplication. Consequently, FIFO-based systolic reducing power consumption during inactive cycles. The wshift
arrays suffer from lower energy efficiency, and reduced PE signal is shared across all PEs in the systolic array, whereas
utilization, leading to higher latency and decreased throughput. pe en, mul en, and adder en are shared across each row.
The analytical modeling of the WS systolic array is stud-
ied for latency, throughput, TFPU, and FIFOs overhead, pro- B. DiP Dataflow
viding insights into dataflow performance. For the WS latency The proposed DiP architecture adopts a novel dataflow
analytical model, as shown in (1), WS requires 3N + S − 3 to control the movement of inputs, weights, and partial
cycles to complete the processing, where N represents the summations across the whole systolic array. DiP dataflow
number of rows/columns per WS systolic array, and S is the relies on two major upgrades compared to WS dataflow, the
number of pipeline stages per Multiply-Accumulate (MAC) diagonal movement of input matrix, and the weight matrix
unit, equals one for 1-stage pipelined MAC, and two for 2- permutation. Firstly, the input matrix moves diagonally across
stage pipelined MAC. The throughput, as indicated in (2), PE rows, as shown in Fig. 2 (a). Secondly, the proposed
is calculated as the ratio of total operations to WS latency. dataflow permutates the weights by shifting and rotating each
Regarding register overhead, WS systolic array uses two FIFO column by its column index, as shown in Fig. 2 (c). The
4

Fig. 2. (a) General NxN DiP systolic array architecture, inputs(I) move diagonally across PE rows, transitioning from one row to the next. The boundary PEs
are diagonally connected, so that the registered inputs from the leftmost PE column feed into the inputs of the rightmost PE column in the subsequent row.
Weights are loaded vertically, and psums are accumulated vertically along the columns as well. (b) PE block diagram, consisting of 2-stage pipelined MAC
unit and four enabled registers. Control signals wshift, pe en, mul en, and adder en are used for operations control. wshift is shared between all systolic
array PEs, while pe en, mul en, and adder en are shared across each PE row. The grey buses represent the partial sum (psum) buses, while the weight buses
are indicated by black crossed buses. Additionally, the input data buses are shown in blue, and control signals are represented with dashed lines. (c) General
weight matrix permutation for DiP dataflow. The weight matrix is permutated by shifting and rotating each column by its column index

the rightmost PE in the next PE row. The weight matrix is


initially permutated to be prepared for weights loading. Each
column is shifted and rotated by its column index, as shown in
Fig. 4 (b). The weights are initially loaded, row by row, to the
systolic array, as shown in Fig. 4 from Cycle -2 to Cycle 0. The
loading of the last weight row and the loading of first input
Fig. 3. Pseudocode for weight matrix permutation, where each column matrix are performed in parallel at Cycle 0. The input data is
undergoes an incremental row shift based on the column index, creating a loaded in parallel to Row-0 including inputs for P E00 , P E01 ,
unique, wrap-around pattern across rows. and P E02 . After accomplishing the required computations by
Row-0, the input data is shifted diagonally from P E00 to
P E12 and from P E01 to P E10 and from P E02 to P E11 .
weights permutation is prepared offline on software level
Then after accomplishing the required computations by Row-
according to the permutation pseudo code in Fig. 3. For each
1, the input data is shifted diagonally again from P E10 to
column, it iterates over all rows, assigning each element in the
P E22 and from P E11 to P E20 and from P E12 to P E21 .
permutated matrix based on its row and column index. The
The processing starts from Cycle 1 to Cycle 5, while the first
permutation is done at software level or at run-time in memory
output row becomes ready at Cycle 3, and the last output row
at almost zero cost. This permutation eliminates the input and
becomes ready at Cycle 5. The processing goes as follows:
output synchronization FIFOs required by conventional WS
systolic arrays. Moreover, it increases the PE utilization and • Cycle -2: The last row of weight matrix (c, d, h) is loaded
throughput, and decreases the required chip area and latency. to the first PE row.
Figure 4 shows a complete example for 3x3 DiP systolic • Cycle -1: The last row of weights matrix (c, d, h) is
array. It consists of three rows: a) Row-0: P E00 , P E01 , P E02 , shifted to the second PE row, and new weights row (b, f,
b) Row-1: P E10 , P E11 , P E12 , c) Row-2: P E20 , P E21 , g) is loaded to the first PE row.
P E22 . The PE array is diagonally connected, as shown in • Cycle 0: The last row of the weight matrix (c, d, h) is
Fig. 4 (a). The leftmost PE in each PE row is connected to shifted to the last PE row, the weights in the first PE
5

Fig. 4. Example for 3x3 DiP systolic array (a) shows the diagonal input connections for 3x3 DiP, (b) shows 3x3 DiP weight matrix permutation by shifting
and rotating each column by its column index, and (c) shows the processing flow for 3x3 DiP example cycle by cycle. Cycles (-2, -1, 0) are dedicated to
weight matrix loading, while Cycle 0 involves loading the last row of the weight matrix and the first row of the input matrix. Cycles from Cycle 1 to Cycle
5 are allocated for matrix multiplication processing, with final output rows becoming ready starting from Cycle 3.

row (b, f, g) are shifted to the second PE row, and new summations (1a+2b, 2e+3f, 3i+1g) to the third row, and
weights row (a, e, i) is loaded to the first PE row. To save the input matrix row (2, 3, 1) is permutated to (3, 1, 2)
one cycle, the first input matrix row (1, 2, 3) is loaded and loaded to the third PE row.
to the first PE row, simultaneously. • Cycle 3: The first PE row shift the partial summations
• Cycle 1: The first PE row shifts the partial summations (7a, 8e, 9i) to the second row, and the input matrix row (7,
(1a, 2e, 3i) to the second row. Using the diagonal con- 8, 9) is permutated to (8, 9, 7) and loaded to the second
nections, the input matrix row (1, 2, 3) is permutated to PE row. Similarly, The second PE row shift the partial
(2, 3, 1), and loaded to the second PE row at the same summations (4a+5b, 5e+6f, 6i+4g) to the second row,
cycle. and the input matrix row (5, 6, 4) is permutated to (6, 4,
• Cycle 2: The first PE row shifts the partial summations 5) and loaded to the third PE row. In addition, The third
(4a, 5e, 6i) to the second row, and the input matrix row (4, PE row shift the first output row (1a+2b+3c, 2e+3f+1d,
5, 6) is permutated to (5, 6, 4) and loaded to the second 3i+1g+2h).
PE row. Similarly, The second PE row shift the partial • Cycle 4: The first PE row becomes idle, unless more input
6

rows are loaded, and the second PE row shift the partial Moreover, Fig. 5 (c) shows the percentage of saved reg-
summations (7a+8b, 8e+9f, 9i+7g) to the third row, and isters as another design improvement of the proposed design
the input matrix row (8, 9, 7) is permutated to (9, 7, 8) compared to WS systolic array. By eliminating input/output
and loaded to the third PE row. In addition, the third PE FIFOs leading to percentage of saved registers reach up to 20%
row shift the second output row (4a+5b+6c, 5e+6f+4d, for 64x64 systolic array. The saved registers is calculated as
6i+4g+5h). the difference between WS and DiP used registers, divided
• Cycle 5: The first and second PE rows are idle, and by the number of registers used by WS. The registers of
the third PE row shift the third output row (7a+8b+9c, WS systolic array is distributed between input synchronization
8e+9f+7d, 9i+7g+8h). FIFOs, Output synchronization FIFOs, and internal PE regis-
Meanwhile, more new inputs may loaded if the input matrix ters. In contrast, the proposed diagonal input systolic array
is larger, and the processing continues till the end of the relies solely on internal PE registers, eliminating the need for
workload. input/output FIFOs. The represented number of registers are
normalized to 8-bit as the baseline bandwidth.
C. DiP Analytical Model TFPU calculates the required number of cycles to reach
full utilization of PEs . This metric shows the overhead when
The analytical models for DiP are studied for latency the inputs are initially loaded, particularly for large matrix-
and throughput. Regarding DiP latency, DiP systolic array matrix multiplication. Figure. 5 (d) shows TFPU for WS and
consumes 2N+S-2 cycles for processing, where N is the DiP systolic arrays. The proposed DiP rapidly utilizes all
number of rows/columns per DiP systolic array, and S is the PEs row by row, whereas WS gradually activates PEs in a
number of pipelined stages per MAC unit. As a result, it diagonal pattern, starting from the top-left and moving to the
takes 2N-1 cycles for 1-stage pipelined PE, and 2N cycles bottom-right. Consequently, DiP outperforms WS, requiring
for 2-stage pipelined PE, as shown in (5). The throughput is only almost half the time of WS to fully utilize the entire
calculated as the number of operations (multiplications and systolic array.
additions) divided by the latency, as shown in (6). TFPU is
calculated, as shown in (7), where it takes N cycles to reach
full PEs utilization, outperforming WS by N − 1 cycles. For IV. E VAULATION & R ESULTS
the FIFO overhead, DiP systolic array eliminates the FIFOs
overhead passing the whole input row in parallel without This section explores the hardware design space for
using any input synchronization FIFOs. Correspondingly, the the proposed DiP architecture, followed by benchmarking
output is generated row by row without any need of output DiP using transformers and evaluating it against TPU-like
synchronization FIFOs. architecture across various transformer workloads. Finally, DiP
is compared with existing accelerators in the literature.
Latency f or DiP = 2N + S − 2 (5)
2N 3 A. Hardware Design Space Exploration
T hroughput f or DiP = (6)
2N + S − 2
A hardware design space exploration is developed for DiP
T F P U f or DiP = N (7)
and WS at different sizes. Both designs are scaled from 4x4
to 64x64 with variants (4x4, 8x8, 16x16, 32x32, 64x64). A
D. DiP/WSSA Analytical Comparison parameterized HDL design using Verilog is developed. Then,
The scalability of systolic arrays is important to meet all designs are implemented from synthesize to GDSII using
the acceleration requirements. The proposed systolic array commercial 22nm technology at frequency of 1GHz. Table I
is gradually scaled up from 3x3 to 64x64 with sizes (3x3, shows a comparison between WS and DiP at different sizes in
4x4, 8x8, 16x16, 32x32, 64x64). An analytical comparison terms of area and power consumption. Additionally, the saved
of DiP and WS is conducted, evaluating throughput, latency, area and power consumption are presented for each design. It
register savings, and TFPU across different systolic array sizes. is depicted that the saved area percentage reach up to 8.12%.
Figure 5 (a) shows the latency for DiP compared to WS, with For the power consumption, the saved percentage reaches up
the percentage of latency savings calculated as the difference to 19.95%.
between WS and DiP latencies, divided by WS latency. The Table II presents improvements in throughput, power
saved percentage starts at 28% for a 3x3 systolic array and consumption, area, and overall improvement (energy efficiency
reaches 33% for a 64x64 systolic array. per area) for different WS/DiP design space. DiP outperforms
In addition, The throughput for both the DiP and WS WS across all metrics, with overall performance from 1.7x
systolic arrays is compared, as shown in Fig. 5 (b). The to 2.02x. At a size of 32×32, DiP achieves 1.48× higher
throughput improvement is calculated as the ratio between DiP throughput than WS, 1.25× lower power consumption, and
to WS throughput. The improvement ratio starts at 33.3% for 1.09× smaller area footprint, resulting in a total improvement
a 3x3 systolic array and reaches 49.2% for a 64x64 systolic of 2.02×. Additionally, at size of 64x64, throughput is im-
array. The proposed architecture significantly increases the proved by 1.49x, power consumption is reduced by 1.21x,
PEs utilization. Thus, it outperforms the conventional WS and area is decreased by 1.07x compared to WS, with overall
counterparts in terms of throughput by up to 50%. improvement 1.93x.
7

Fig. 5. (a) Latency (per single tile processing) for both WS and DiP systolic arrays. The grey curve indicates the percentage of saved latency of DiP over
WS, (b) Throughput, measured in operations per cycle (OPS/Cycle), for WS and DiP, with the grey curve showing the throughput improvement percentage
of DiP compared to WS, (c) The number of used registers for DiP compared to WS, normalized to 8-bit (baseline datawidth). The grey curve represents
the percentage of saved registers, and (d) TFPU represented the time required to activate all PEs, with the grey curve representing the TFPU improvement
percentage. WS and DiP are shown in blue and black, respectively.

TABLE I
C OMPARISON OF A REA , P OWER C ONSUMPTION , AND S AVING P ERCENTAGES FOR D IFFERENT WS/D I P S IZES USING C OMMERCIAL 22 NM
T ECHNOLOGY AT F REQUENCY OF 1GH Z

Area Power (mW) Saved Power


Size WS DiP Saved Area (%) WS DiP Consumption (%)

4x4 5,178 (µm2 ) 4,872 (µm2 ) 5.91 4.168 3.582 14.06

8x8 18,703 (µm2 ) 17,376 (µm2 ) 7.10 16.2 13.72 15.31

16x16 71,204 (µm2 ) 65,421 (µm2 ) 8.12 64.28 53.63 16.57

32x32 0.275 (mm2 ) 0.253 (mm2 ) 7.97 264.2 211.5 19.95

64x64 1.085 (mm2 ) 1.012 (mm2 ) 6.73 1041 857.8 17.60

B. Transformers Benchmarking In MHA, the input X is first projected in Queries (Qi ),


Keys (Ki ), and Values (Vi ) per each head i using learned
Transformer workloads are becoming increasingly mas- weight matrices, as shown in (8.1). The attention scores (Si )
sive and heavily dependent on matrix multiplication, especially for each head are computed by taking the scaled dot product
in Multi-Head Attention (MHA) and Feed-Forward Networks of Queries (Qi ) and transposed Keys (Ki ), followed by a
(FFN). MHA, a core component of transformer models pro- softmax normalization, as described in (8.2). The attention
posed by Vaswani et al., enables the capture of complex scores (Si ) are multiplied with Values (Vi ), producing the
dependencies in data [10]. By leveraging multiple attention attention output Attni for each head according to (8.3). The
heads, MHA captures diverse representational subspaces, al- outputs from all heads are concatenated into a single matrix
lowing the model to understand relationships across different Attnconcat (8.4), which is finally projected back to the model’s
perspectives simultaneously. hidden dimension (dmodel ) using a learned output projection
8

TABLE II
I MPROVEMENTS IN T HROUGHPUT, P OWER C ONSUMPTION , A REA , AND OVERALL P ERFORMANCE OF D I P C OMPARED TO WS AT D IFFERENT S IZES

Throughput Power Consumption Area Overall


Size
Improvement (×) Improvement (×) Improvement (×) Improvement* (×)
4x4 1.38 1.16 1.06 1.70
8x8 1.44 1.18 1.08 1.84
16x16 1.47 1.20 1.09 1.93
32x32 1.48 1.25 1.09 2.02
64x64 1.49 1.21 1.07 1.93
* Overall improvement represents energy efficiency per area

matrix W O , as shown in (8.5). This process enables the model TABLE III
to capture information from multiple representation subspaces M ATRIX M ULTIPLICATION D IMENSIONS FOR MHA AND FFN
W ORKLOADS IN TERMS OF S EQUENCE L ENGTH (l), M ODEL H IDDEN S IZE
simultaneously, enhancing the model’s ability to represent (dMODEL ), H EAD S IZE (dk ), FFN S IZE (dFFN ). I NPUT MATRICES S IZES
complex sequences effectively. ARE M × N AND N × K, AND O UTPUT M ATRIX IS M × K.

Stage Workload Dimensions


Qi = XWiQ , Ki = XWiK , Vi = XWiV (8.1) (M, N, K)
Input projections Qi , Ki , Vi l × dmodel × dk
Qi K ⊤
Si = softmax( √ i ) (8.2) MHA
Attention scores Qi KiT l × dk × l
dk Attni = Si Vi l × l × dk
Output projection Attnconcat W O l × dmodel × dmodel
Attni = Si Vi (8.3) W1 projection l × dmodel × dFFN
FFN
W2 projection l × dFFN × dmodel
Attnconcat = concat(Attn1 , Attn2 , . . . , Attnh ) (8.4)
MHA = Attnconcat W O (8.5)
These dimensions provide insights into the computational
Where X be the input matrix. WiQ ,
and WiK , are the WiV requirements of transformer models, emphasizing how the
input projection weight matrices, and Qi , Ki , and Vi are the sequence length and model size affect the matrix operations
resulting query, key, and value matrices per head, respectively. in both the attention mechanism and feed-forward network.
Si is the score matrix, and Attni is the attention output per
head. After concatenating all attention outputs from h heads, C. DiP Evaluation
Attnconcat is obtained. Finally, W O is the output projection DiP is assessed against a TPU-like (WS-based) archi-
weight matrix, and MHA is the final multi-head attention tecture using transformer workloads. Nine widely used trans-
output. former models are chosen to span various application domains
FFN in transformers consists of two linear transforma- and represent a spectrum of model sizes, from small language
tions with a non-linear activation function applied between models (SLMs) to large language models (LLMs). These mod-
them. The process begins with the first matrix multiplication, els organized in three types: Encoder-Decoder, Encoder-only,
which projects the input y into a higher-dimensional space and Decoder-only. Encoder-Decoder models include Vanilla
using a weight matrix W1 and adding bias b1 , as shown in Transformer [10], T5 [30], and BART [31]; Encoder-only
(9.1). Next, a non-linear activation function, such as ReLU models include BERT [35], ALBERT [36], and Transformer-
or GELU is applied. Finally, the second matrix multiplication XL [37]; while Decoder-only models include GPT-2 [32],
applies the second linear transformation to the non-linearly GPT-3 [33], and LLaMA [34]. The models are selected
transformed output, mapping it back to the original dimen- with hyper-parameters to cover a diverse range of workloads.
sionality, as shown in (9.2). Sequence lengths are chosen from a range of 64 to 2048, as
(64, 128, 256, 512, 1024, 2048). Additionally, model’s hidden
Z = Non-Linear(yW1 + b1 ) (9.1) size (dmodel ) varies across (512, 768, 1024, 1280, 5120), while
FFN = Z · W2 + b2 (9.2) head size (dk ) is set to either (64, 128). Moreover, FFN size
(dFFN ) is configured with values of (2048, 3072, 4096, 5120).
where y is the FFN input, W1 is the first weight matrix, and DiP and TPU-like architectures, each with a size of
b1 is the bias. W2 is the second FFN weight matrix and b1 64x64, are used for the evaluation. This architecture size aligns
is the second bias. The output of the second transformation is well with matrix tiling, as the head size for most transformer
the FNN result. models is either 64 or 128. Cycle-accurate simulations are
Table III presents the matrix multiplication dimensions performed to evaluate both DiP and TPU-like implementations
for MHA and FFN operations within transformer models, in terms of actual latency for each workload and energy
highlighting the relationship between input sequence length consumption. Matrix tiling is used to process matrix mul-
(l), model size (dmodel ), head size (dk ), and FFN size (dFFN ). tiplication workloads on DiP and TPU-like architectures by
9

dividing the input matrices M1 and M2 into sub-matrices happens for larger workloads as TPU-like architecture hides
(tiles) of 64x64. By studying many transformer models, the the latency associated with loading more M1 tiles per every
majority of MHA and FFN workload dimensions are divisible new M2 tile. In contrast, for small to medium-sized workloads,
by 64. The multiplication is performed per tile as follows; For TPU-like architectures incur the TFPU penalty of loading
DiP and TPU-like architectures, every tile of M2 is loaded M1 tile associated with each new M2 tile. Additionally,
once and remains stationary throughout the computation for TPU-like architectures still face the overhead of Input/Output
the corresponding output tile. For each tile of M2 , respective FIFOs, which impact power consumption, latency, and TFPU.
tiles from M1 is iteratively loaded, multiplied, and saved as The evaluation highlights DiP as an energy-efficient design,
output partial summation (psum) tiles. After processing all making it a compelling alternative to TPU-like architectures.
tiles, the final output matrix O is constructed by accumulating
the associated psum tiles.
D. Comparison with Related Work
Figure 6 compares the energy consumption and latency
of DiP and TPU-like 64x64 architectures for MHA and FFN Table IV compares DiP architecture with Google TPU
workloads across varying dimensions (M-N-K). DiP consis- [23], Groq ThinkFast TSP [38], and Alibaba Hanguang 800
tently outperforms TPU-like implementation for MHA and [39], highlighting DiP’s performance and energy efficiency.
FFN workloads. Energy improvements for MHA workloads The DiP architecture features 4,096 MACs in a 64×64 config-
range from 1.81x for smaller workloads to 1.25x for larger uration, operating at 1 GHz with INT8 precision on 22nm tech-
ones, while FFN workloads show a similar trend with improve- nology. As each accelerator is implemented in different tech-
ments from 1.8x to 1.25x. These results highlight the energy- nology, the performance metrics are normalized to 22nm using
efficiency of DiP. Additionally, actual latency for MHA and DeepScaleTool [40]. The proposed DiP achieves significant
FFN workloads demonstrates DiP’s performance against TPU- energy efficiency, reaching 9.55 TOPS/W, and delivers a nor-
like implementation, offering up to 1.49x improvements for malized performance per area of 8.2 TOPS. This demonstrates
smaller workloads, gradually reducing to approximately 1.03x its capability to provide high computational throughput while
for larger workloads. The breakdown of latency improvement minimizing energy consumption, showcasing its compact and

Fig. 6. Evaluation of DiP and TPU-like architecture at size 64x64 with MHA and FFN transformer’s workloads. The evaluation includes actual energy
consumption (a, b) and latency (c, d) across various workloads dimensions of matrix multiplication (M-N-K).
10

TABLE IV
C OMPARISON WITH OTHER ACCELERATORS

DiP Google TPU [23] Groq ThinkFast Alibaba Hanguang


TSP [38] 800 [39]
64×64, 256×256, Tensor Stream
Architecture Tensor Cores
4,096 MACs 65,536 MACs Processor
Max Frequency 1GHz 700MHz 900MHz 700MHz
Precision INT8 INT8 INT8, FP16 INT8, INT16, FP24
Technology 22nm 28nm 14nm 12nm
Power (W) 0.858 40–50 300 275.9
Area (mm²) 1 200 725 709
Peak Performance (TOPS) 8.2 92 820 825
Norm. Perform. (TOPS)1 8.2 5.75 – –
Area Norm. Perform. (TOPS/mm²)2,3 8.2 0.46 0.411 0.423
Energy Efficiency (TOPS/W)3 9.55 2.15 2.73 2.99

1 Normalized Peak Performance at systolic array size of 64×64


2 Normalized Peak Performance to die area (mm²)
3 Power and Area are normalized to 22nm using DeepScaleTool [40]

optimized design. These metrics make DiP particularly well- R EFERENCES


suited for energy-efficient Transformers-based applications.
[1] B. Mondal, ”Artificial Intelligence: State of the Art,” Recent Trends Adv.
Artif. Intell. Internet Things, vol. 172, pp. 389-425, 2020.
[2] G. G. Chowdhury, ”Natural language processing,” Annu. Rev. Inf. Sci.
Technol., vol. 37, no. 1, pp. 51-89, 2003.
V. C ONCLUSION [3] D. Khurana et al., ”Natural language processing: State of the art, current
trends and challenges,” Multimedia Tools Appl., vol. 82, pp. 3713–3744,
2023.
In this paper, a diagonal-input and permutated weight- [4] A. Gillioz et al., ”Overview of the Transformer-based Models for NLP
stationary (DiP) systolic array is proposed to accelerate matrix Tasks,” Proc. 15th Conf. Comput. Sci. Inf. Syst. (FedCSIS), pp. 179-183,
multiplication. DiP features an architecture of NxN PEs, 2020.
[5] H. Zhang et al., ”A Survey of Controllable Text Generation Using
where each PE performs Multiply-Accumulate operations. DiP Transformer-based Pre-trained Language Models,” ACM Comput. Surv.,
adopts a novel dataflow that eliminates the input and output vol. 56, no. 64, 37 pages, Mar. 2024.
synchronization FIFO buffers required by the conventional [6] Y. Wang et al., ”Transformer in Action: A Comparative Study of
Transformer-Based Acoustic Models for Large Scale Speech Recogni-
WS systolic array. As a result, DiP outperforms WS across tion Applications,” Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
all metrics, including throughput, latency, area, and power (ICASSP), pp. 6778-6782, 2021.
consumption. Additionally, the analytical models for latency, [7] P. Xu, X. Zhu, and D. A. Clifton, ”Multimodal Learning With Trans-
formers: A Survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45,
throughput, FIFO overhead, and TFPU are developed for no. 10, pp. 12113-12132, Oct. 2023.
DiP and WS architectures. The proposed DiP architecture [8] K. Han et al., ”A Survey on Vision Transformer,” IEEE Trans. Pattern
outperforms the WS counterparts in terms of throughput by Anal. Mach. Intell., vol. 45, no. 1, pp. 87-110, Jan. 2023.
up to 50%, and TFPU by up to 50%. Moreover, hardware [9] T. Lin et al., ”A survey of transformers,” AI Open, vol. 3, pp. 111-132,
2022.
design space exploration is presented for both DiP and WS [10] A. Vaswani et al., ”Attention is all you need,” Proc. Adv. Neural Inf.
architectures using commercial 22nm technology, demonstrat- Process. Syst., pp. 6000-6010, 2017.
ing power consumption savings of up to 19.95%, area savings [11] J. Hoffmann et al., ”Training compute-optimal large language models,”
Proc. 36th Int. Conf. Neural Inf. Process. Syst., 2024.
of up to 8.12% at 1 GHz, and energy efficiency per area [12] L. Floridi and M. Chiriatti, ”GPT-3: Its nature, scope, limits and
improvement of up to 2.02x. Furthermore, DiP is evaluated consequences,” Minds Mach., vol. 30, pp. 681-694, 2020.
using various transformer workloads from widely-used models [13] H.-T. Kung, ”Why systolic architectures?,” IEEE Comput., vol. 15, no.
1, pp. 37-46, Jan. 1982.
such as GPT-2, GPT-3, BERT, BART, and LLaMA. DiP [14] Y. H. Hu and S. Kung, ”Systolic arrays,” Handbook Signal Process.
outperforms TPU-like architecture, achieving energy improve- Syst., pp. 939-977, 2018.
ments ranging from 1.25x to 1.81x and latency improvements [15] Z. Yang et al., ”Systolic array based accelerator and algorithm mapping
for deep learning algorithms,” Proc. Netw. Parallel Comput., pp. 153-
ranging from 1.03x to 1.49x across various transformer’s 158, 2018.
MHA and FFN workloads. A comparison between relevant [16] M. Soltaniyeh et al., ”An Accelerator for Sparse Convolutional Neural
accelerators and DiP is discussed, achieving performance of Networks Leveraging Systolic General Matrix-Matrix Multiplication,”
ACM Trans. Archit. Code Optim., vol. 19, no. 3, Article 42, 2022.
8.2 TOPS and energy efficiency of 9.55 TOPS/W, which is
[17] B. Wang et al., ”A novel systolic array processor with dynamic
promising for energy-efficient transformer-based applications. dataflows,” Integration, vol. 85, pp. 42-47, 2022.
This paper serves as the foundation for DiP architecture and [18] H. Waris et al., ”AxSA: On the design of high-performance and power-
dataflow. Future extensions aim to scale the architecture and efficient approximate systolic arrays for matrix multiplication,” J. Signal
Process. Syst., vol. 93, no. 6, pp. 605-615, Jun. 2021.
explore sparsity in transformers, which will further enhance [19] G. Shomron et al., ”SMT-SA: Simultaneous multithreading in systolic
energy efficiency and acceleration rates. arrays,” IEEE Comput. Archit. Lett., vol. 18, no. 2, pp. 99-102, Jul. 2019.
11

[20] M. A. Hanif et al., ”MPNA: A massively-parallel neural array accel- Shady Agwa (Member, IEEE) is a Research Fellow
erator with dataflow optimization for convolutional neural networks,” at the Centre for Electronics Frontiers CEF, The
arXiv:1810.12910, 2018. University of Edinburgh (UK). He received his BSc
[21] C. Peltekis et al., ”ArrayFlex: A Systolic Array Architecture with and MSc degree from Assiut University (Egypt),
Configurable Transparent Pipelining,” Proc. Design, Autom. Test Eur. both in Electrical Engineering. He got his PhD in
Conf. Exhib. (DATE), pp. 1-6, 2023. Electronics Engineering from The American Univer-
[22] J. Dean, D. Patterson, and C. Young, ”A New Golden Age in Computer sity in Cairo (Egypt) in 2018. Following his PhD, he
Architecture: Empowering the Machine-Learning Revolution,” IEEE joined the Computer Systems Laboratory at Cornell
Micro, vol. 38, no. 2, pp. 21-29, 2018. University (USA) as a Postdoctoral Associate for
[23] N. P. Jouppi et al., ”In-datacenter performance analysis of a tensor two years. In 2021, Shady joined the Centre for Elec-
processing unit,” Proc. 44th Annu. Int. Symp. Comput. Archit., pp. 1-12, tronics Frontiers at the University of Southampton
Jun. 2017. (UK) as a Senior Research Fellow and then as a Research Fellow at the
[24] N. P. Jouppi et al., ”A Domain-Specific Supercomputer for Training University of Edinburgh (UK). His research interests span across VLSI and
Deep Neural Networks,” Commun. ACM, vol. 61, no. 10, pp. 67-79, Computer Architecture for AI using conventional and emerging technologies.
2017. His work focuses on ASIC-Driven AI Architectures with extensive expertise
[25] N. P. Jouppi et al., ”Ten Lessons From Three Generations Shaped in In-Memory Computing, Stochastic Computing, Systolic Arrays, Beyond
Google’s TPUv4i,” IEEE Micro, vol. 41, no. 2, pp. 57-66, 2021. Von Neumann Architectures, Memories and Energy-Efficient Digital ASIC
[26] B. Asgari et al., ”Meissa: Multiplying matrices efficiently in a scalable Design.
systolic architecture,” Proc. 2020 IEEE 38th Int. Conf. Comput. Design
(ICCD), pp. 130-137, 2020.
[27] Y. Xu et al., ”A Survey of Design and Optimization for Systolic Array-
Based DNN Accelerators,” ACM Comput. Surv., vol. 56, no. 1, pp. 1-37,
2023.
[28] E. Yago et al., ”Impact of the Array Shape and Memory Bandwidth
on the Execution Time of CNN Systolic Arrays,” Proc. 23rd Euromicro
Conf. Digit. Syst. Design (DSD), pp. 510-517, 2020.
[29] Y.-H. Chen, J. Emer, and V. Sze, ”Eyeriss: a spatial architecture for
energy-efficient dataflow for convolutional neural networks,” Proc. 43rd
Int. Symp. Comput. Archit. (ISCA), pp. 367–379, 2016.
[30] C. Raffel et al., ”Exploring the Limits of Transfer Learning with a
Unified Text-to-Text Transformer,” J. Mach. Learn. Res., vol. 21, no.
140, pp. 1-67, 2020.
[31] M. Lewis et al., ”BART: Denoising Sequence-to-Sequence Pre-training
for Natural Language Generation, Translation, and Comprehension,”
arXiv preprint arXiv:1910.13461, 2019.
[32] A. Radford et al., ”Language Models are Unsupervised Multitask
Learners,” OpenAI, 2019. [Online].
[33] T. B. Brown et al., ”Language Models are Few-Shot Learners,” arXiv
preprint arXiv:2005.14165, 2020.
Themis Prodromakis (Senior Member, IEEE) re-
[34] H. Touvron et al., ”LLaMA: Open and Efficient Foundation Language
ceived the bachelor’s degree in electrical and elec-
Models,” arXiv preprint arXiv:2302.13971, 2023.
tronic engineering from the University of Lincoln,
[35] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ”BERT: Pre-training
U.K., the M.Sc. degree in microelectronics and
of Deep Bidirectional Transformers for Language Understanding,” arXiv
telecommunications from the University of Liver-
preprint arXiv:1810.04805, 2019.
pool, U.K., and the Ph.D. degree in electrical and
[36] Z. Lan et al., ”ALBERT: A Lite BERT for Self-supervised Learning of
electronic engineering from Imperial College Lon-
Language Representations,” arXiv preprint arXiv:1909.11942, 2019.
don, U.K. He then held a Corrigan Fellowship in
[37] Z. Dai et al., ”Transformer-XL: Attentive Language Models Beyond a
nanoscale technology and science with the Centre
Fixed-Length Context,” arXiv preprint arXiv:1901.02860, 2019.
for Bio-Inspired Technology, Imperial College Lon-
[38] D. Abts et al., ”Think fast: A tensor streaming processor (TSP) for
don, and a Lindemann Trust Visiting Fellowship
accelerating deep learning workloads,” Proc. ACM/IEEE 47th Annu. Int.
with the Department of Electrical Engineering and Computer Sciences, Uni-
Symp. Comput. Archit., pp. 145–158, 2020.
versity of California at Berkeley, USA. He was a Professor of nanotechnology
[39] Y. Jiao et al., ”A 12nm programmable convolution-efficient neural-
at the University of Southampton, U.K. He holds the Regius Chair of
processing-unit chip achieving 825TOPS,” Proc. IEEE Int. Solid-State
Engineering at the University of Edinburgh and is Director of the Centre
Circuits Conf., pp. 136–140, 2020.
for Electronics Frontiers. He is currently a Royal Academy of Engineering
[40] S. Sarangi and B. Baas, ”DeepScaleTool: A tool for the accurate
Chair in emerging technologies and a Royal Society Industry Fellowship. His
estimation of technology scaling in the deep-submicron era,” Proc. IEEE
background is in electron devices and nanofabrication techniques. His current
Int. Symp. Circuits Syst., pp. 1–5, 2021.
research interests include memristive technologies for advanced computing
architectures and biomedical applications. He is a fellow of the Royal Society
of Chemistry, the British Computer Society, the IET, and the Institute of
Physics.

Ahmed J. Abdelmaksoud (Member, IEEE) is cur-


rently pursuing his PhD with the Centre for Electron-
ics Frontiers (CEF) at the University of Edinburgh,
UK. He received his BSc and MSc in Electronics
Engineering from Cairo University, Egypt in 2018
and 2022, receptively. Since 2018, he has been
actively involved in Digital ASIC design projects
across both research and industry. His professional
experience includes working as a Research Associate
at the System-on-Chip Center, Khalifa University,
UAE; an ASIC Physical Design Engineer at Si-
Vision, Egypt; and a Research Assistant at the Opto-Nano Electronics Lab,
Egypt. In addition, his current research interests primarily focus on developing
spatial and specialized architectures for efficient AI hardware acceleration.

You might also like