CN118132507B - A new in-memory computing architecture that supports a variety of workloads - Google Patents
A new in-memory computing architecture that supports a variety of workloads Download PDFInfo
- Publication number
- CN118132507B CN118132507B CN202410558566.XA CN202410558566A CN118132507B CN 118132507 B CN118132507 B CN 118132507B CN 202410558566 A CN202410558566 A CN 202410558566A CN 118132507 B CN118132507 B CN 118132507B
- Authority
- CN
- China
- Prior art keywords
- memory
- data
- module
- signals
- signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004364 calculation method Methods 0.000 claims abstract description 51
- 230000002093 peripheral effect Effects 0.000 claims abstract description 19
- 238000013507 mapping Methods 0.000 claims description 31
- 230000001360 synchronised effect Effects 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 5
- 230000008901 benefit Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 12
- 238000000034 method Methods 0.000 description 7
- 241000070918 Cima Species 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000000295 complement effect Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 1
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 1
- 241001409283 Spartina mottle virus Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 239000008358 core component Substances 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/7821—Tightly coupled to memory, e.g. computational memory, smart memory, processor in memory
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C11/00—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
- G11C11/21—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
- G11C11/34—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
- G11C11/40—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
- G11C11/41—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming static cells with positive feedback, i.e. cells not needing refreshing or charge regeneration, e.g. bistable multivibrator or Schmitt trigger
- G11C11/412—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming static cells with positive feedback, i.e. cells not needing refreshing or charge regeneration, e.g. bistable multivibrator or Schmitt trigger using field-effect transistors only
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Static Random-Access Memory (AREA)
Abstract
本发明公开了一种支持多种工作负载的新型存内计算架构,涉及存内计算技术,针对现有技术中计算任务多样性不足的问题提出本方案。包括存内计算阵列、预充电模块、读字线和写字线驱动模块、输入值和控制信号驱动模块、位线驱动模块、灵敏放大器组、外围计算逻辑模块、可重构地址生成单元模块和顶层控制模块。各模块在顶层控制模块的控制下进行数据运算和流转。优点在于,可以支持现有可运行在CPU上的任意种算法,达到通用化计算能力。
The present invention discloses a novel in-memory computing architecture supporting multiple workloads, relates to in-memory computing technology, and proposes this solution to solve the problem of insufficient diversity of computing tasks in the prior art. It includes an in-memory computing array, a precharge module, a read word line and write word line driver module, an input value and control signal driver module, a bit line driver module, a sensitive amplifier group, a peripheral computing logic module, a reconfigurable address generation unit module, and a top-level control module. Each module performs data calculation and circulation under the control of the top-level control module. The advantage is that it can support any existing algorithm that can be run on the CPU to achieve universal computing capability.
Description
技术领域Technical Field
本发明涉及存内计算技术,尤其涉及一种支持多种工作负载的新型存内计算架构。The present invention relates to in-memory computing technology, and in particular to a novel in-memory computing architecture that supports multiple workloads.
背景技术Background technique
在人工智能时代,图形处理单元GPU作为算力的基石,为加速AI计算提供了巨大贡献。然而,GPU同时也存在诸多限制,包括较高的整体功耗、显著的指令执行延迟以及缺乏硬件可编程性等问题。在标准运行条件下,GPU的能耗通常高于那些高度定制的芯片。即便是部署了Volta架构的最新Nvidia GPU芯片,相较于完全定制的ASIC芯片,在实现相同的计算性能方面,其所需功耗仍然较高。此外,GPU在执行多线程任务时,由于DRAM读写操作频繁,导致较高的延迟,从而增加了指令执行的延迟。再者,GPU的硬件结构在设计之初已固定,不具备临时编程的灵活性。In the era of artificial intelligence, graphics processing units (GPUs), as the cornerstone of computing power, have made great contributions to accelerating AI computing. However, GPUs also have many limitations, including high overall power consumption, significant instruction execution delays, and lack of hardware programmability. Under standard operating conditions, GPUs typically consume more energy than highly customized chips. Even the latest Nvidia GPU chips that deploy the Volta architecture still require higher power consumption to achieve the same computing performance than fully customized ASIC chips. In addition, when GPUs perform multi-threaded tasks, the frequent DRAM read and write operations result in higher latency, which increases the latency of instruction execution. Furthermore, the hardware structure of the GPU is fixed at the beginning of its design and does not have the flexibility of temporary programming.
针对这些挑战,存内计算技术提供了一种创新解决方案。该技术通过整合SRAM存储单元与计算单元,显著缩短了数据在两者之间传输的距离,大幅度降低了数据传输导致的能耗与延迟。然而,在现有技术中,存内计算架构仅能优化一种或有限几种运算算子。在AI训练和推理过程中,经常需要在同一个神经网络的不同层之间或多个共享相同硬件资源的神经网络之间(例如,同时进行图像分类和语音识别)进行切换,这要求存内计算架构能够支持多样化的计算任务,而且实际可用的存内计算处理架构缺乏相关通用性。In response to these challenges, in-memory computing technology provides an innovative solution. By integrating SRAM storage units and computing units, this technology significantly shortens the distance that data needs to be transmitted between the two, greatly reducing the energy consumption and latency caused by data transmission. However, in the existing technology, the in-memory computing architecture can only optimize one or a limited number of operators. During AI training and reasoning, it is often necessary to switch between different layers of the same neural network or between multiple neural networks that share the same hardware resources (for example, performing image classification and speech recognition at the same time), which requires the in-memory computing architecture to support a variety of computing tasks, and the actual available in-memory computing processing architecture lacks relevant versatility.
发明内容Summary of the invention
本发明目的在于提供一种支持多种工作负载的新型存内计算架构,以解决上述现有技术存在的问题。The purpose of the present invention is to provide a new in-memory computing architecture that supports multiple workloads to solve the problems existing in the above-mentioned prior art.
本发明中所述一种支持多种工作负载的新型存内计算架构,包括:The present invention provides a novel in-memory computing architecture that supports multiple workloads, including:
存内计算阵列,用于存储输入数据以及根据算子选择控制信号进行数据计算;An in-memory computing array for storing input data and performing data computing according to operator selection control signals;
预充电模块,用于当所述存内计算阵列进入读阶段时,将写位线信号和写位线反相信号充电并拉高到VDD;A precharge module, used for charging and pulling up the write bit line signal and the write bit line inversion signal to VDD when the in-memory computing array enters the read phase;
读字线和写字线驱动模块,用于提供所述存内计算阵列的读字线信号和写字线信号;A read word line and write word line driver module, used for providing a read word line signal and a write word line signal for the in-memory computing array;
输入值和控制信号驱动模块,用于提供下一轮次所述存内计算阵列的部分数据信号、算子选择控制信号,以及提供外围计算逻辑的算子选择控制信号;An input value and control signal driving module, used to provide a part of the data signal and operator selection control signal of the in-memory computing array in the next round, and to provide an operator selection control signal of the peripheral computing logic;
位线驱动模块,用于写位线及其反相信号的驱动信号;A bit line driving module, used for driving signals of write bit lines and their inverse signals;
灵敏放大器组,用于读出所述存内计算阵列存储的数据;A sensitive amplifier group, used for reading out the data stored in the in-memory computing array;
外围计算逻辑模块,用于对所述灵敏放大器组的读出数据进行左移,或右移,或高位填充操作;A peripheral calculation logic module, used for performing left shift, right shift, or high-bit filling operation on the read data of the sense amplifier group;
可重构地址生成单元模块,用于读出下一轮次各运算模块的逻辑地址以及在算法指令映射地址和握手信号同步时将数据信号送出;The reconfigurable address generation unit module is used to read out the logical address of each operation module in the next round and send out the data signal when the algorithm instruction mapping address and the handshake signal are synchronized;
顶层控制模块,用于解析外部指令信号并控制数据信号在新型存内计算架构中流转,以及对数据信号调整读写访问时间。The top-level control module is used to parse external command signals and control the flow of data signals in the new in-memory computing architecture, as well as adjust the read and write access time of data signals.
本发明中所述一种支持多种工作负载的新型存内计算架构,其优点在于,可以支持现有可运行在CPU上的任意种算法,达到通用化计算能力。同时,在1.8V电源电压以及800MHz运行条件下,该架构运行能力可达到47.65TOPS/W,吞吐量平均可达到3.7GOPS,最高可达到5.2GOPS。由于极大的削减数据搬运的路径,相较于其他64Bit处理器存内计算的功耗大约平均下降10倍。The novel in-memory computing architecture supporting multiple workloads described in the present invention has the advantage of supporting any existing algorithm that can be run on the CPU to achieve universal computing capabilities. At the same time, under the conditions of 1.8V power supply voltage and 800MHz operation, the architecture can achieve an operating capacity of 47.65TOPS/W, an average throughput of 3.7GOPS, and a maximum of 5.2GOPS. Due to the significant reduction in the data transfer path, the power consumption of in-memory computing is reduced by about 10 times on average compared to other 64-bit processors.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明中所述新型存内计算架构示意图。FIG1 is a schematic diagram of the novel in-memory computing architecture described in the present invention.
图2是本发明中所述SRAM存储单元的结构示意图。FIG. 2 is a schematic diagram of the structure of the SRAM storage unit in the present invention.
图3是本发明中所述SRAM存储单元与内部计算逻辑单元的交互示意图。FIG. 3 is a schematic diagram of the interaction between the SRAM storage unit and the internal computing logic unit in the present invention.
图4是本发明中所述内部计算逻辑单元的加减运算操作示意图。FIG. 4 is a schematic diagram of addition and subtraction operations of the internal calculation logic unit in the present invention.
图5是本发明中所述外围计算逻辑单元的功能示意图。FIG. 5 is a functional schematic diagram of the peripheral computing logic unit described in the present invention.
图6是本发明中所述可重构地址生成单元模块的功能示意图。FIG. 6 is a functional schematic diagram of the reconfigurable address generation unit module in the present invention.
图7是本发明中所述新型存内计算架构在编译映射地址表的流程示意图。FIG. 7 is a schematic diagram of the process of compiling a mapping address table according to the novel in-memory computing architecture described in the present invention.
图8是本发明中所述可重构地址生成单元模块写入映射地址表数据阶段的时序图。FIG8 is a timing diagram of the phase in which the reconfigurable address generation unit module of the present invention writes data into the mapping address table.
图9是本发明中所述可重构地址生成单元模块参与存内计算阵列计算过程阶段的时序图。FIG. 9 is a timing diagram of the stages in which the reconfigurable address generation unit module of the present invention participates in the in-memory computing array computing process.
图10是本发明中所述新型存内计算架构的工作流程示意图。FIG10 is a schematic diagram of the workflow of the novel in-memory computing architecture described in the present invention.
图11是本发明中所述新型存内计算架构的数据流和控制流示意图。FIG11 is a schematic diagram of data flow and control flow of the novel in-memory computing architecture described in the present invention.
具体实施方式Detailed ways
如图1所示,本发明中所述一种支持多种工作负载的新型存内计算架构作为电路实体包括以下核心部件电路模块:As shown in FIG. 1 , the novel in-memory computing architecture supporting multiple workloads described in the present invention includes the following core component circuit modules as a circuit entity:
(1)存内计算阵列CIMA,包括按行和列布置的多个SRAM存储单元以及行与行之间排布的内部计算逻辑单元。所述多个SRAM存储单元被配置为存储写入数据信号以及下一轮次计算部分数据信号。所述多个内部计算逻辑单元被配置为存储下一轮次计算部分数据信号,并读取SRAM存储单元数据信号,一同参与加、减、乘、布尔逻辑运算。(1) An in-memory computing array CIMA, comprising a plurality of SRAM memory cells arranged in rows and columns and internal computing logic cells arranged between the rows. The plurality of SRAM memory cells are configured to store write data signals and partial data signals for the next round of computing. The plurality of internal computing logic cells are configured to store partial data signals for the next round of computing and read SRAM memory cell data signals, and participate in addition, subtraction, multiplication, and Boolean logic operations together.
(2)预充电模块,被配置为当存内计算阵列存内计算阵列进入读阶段时,将写位线信号WBL和写位线反相信号WBLb充电并拉高到VDD。(2) A precharge module is configured to charge and pull up the write bit line signal WBL and the write bit line inversion signal WBLb to VDD when the in-memory computing array enters the read phase.
(3)读字线和写字线驱动模块,被配置为提供多个读字线信号RWL和写字线信号WWL。(3) A read word line and write word line driver module configured to provide a plurality of read word line signals RWL and write word line signals WWL.
(4)输入值和控制信号驱动模块,被配置为提供下一轮次部分数据信号以及内部计算逻辑单元和外围计算逻辑所需的多个算子选择控制信号。(4) An input value and control signal driving module, configured to provide partial data signals for the next round and multiple operator selection control signals required by the internal computing logic unit and the peripheral computing logic.
(5)位线驱动模块,被配置为提供写位线信号WBL和写位线反相信号WBLb的驱动信号。(5) A bit line driving module configured to provide driving signals of a write bit line signal WBL and a write bit line inverted signal WBLb.
(6)灵敏放大器组,被配置为提供多个灵敏放大器SAs读出存内计算阵列存储数据信号。(6) A sense amplifier group, configured to provide a plurality of sense amplifiers SAs to read out data signals stored in the in-memory computing array.
(7)外围计算逻辑单元,被配置为接收SAs读出信号并进行向左移位,向右移位以及高位填充操作。(7) A peripheral computing logic unit, configured to receive the SAs read signal and perform left shift, right shift, and high bit fill operations.
(8)可重构地址生成单元模块RAGU,包含算法指令映射地址RAM和FIFO存储器。所述算法指令映射地址RAM被配置为可写入算法指令映射地址表,且可读出下一轮存内计算阵列、内部计算逻辑单元和外围计算逻辑地址,同时向FIFO存储器发出握手信号。所述FIFO存储器被配置为接收内部计算逻辑单元和外围计算逻辑单元输出数据信号,当与算法指令映射地址RAM握手信号同步时,将数据信号送出。(8) A reconfigurable address generation unit module RAGU, comprising an algorithm instruction mapping address RAM and a FIFO memory. The algorithm instruction mapping address RAM is configured to be able to write into the algorithm instruction mapping address table, and to be able to read out the next round of in-memory computing arrays, internal computing logic units and peripheral computing logic addresses, and to send a handshake signal to the FIFO memory. The FIFO memory is configured to receive data signals output by the internal computing logic unit and the peripheral computing logic unit, and to send out the data signals when synchronized with the handshake signal of the algorithm instruction mapping address RAM.
(9)顶层控制模块,包括一个顶层控制单元和时序生成单元。顶层控制单元被配置为解析外部指令信号并控制数据信号在新型存内计算架构中流转。时序生成单元被配置为将不高于阈值为150 MHz的时钟频率的数据信号调整为满足t1=7ns写入数据至存内计算阵列,以及t2=10ns从存内计算阵列读出数据的读写访问时间。(9) A top-level control module, including a top-level control unit and a timing generation unit. The top-level control unit is configured to parse external command signals and control data signals to flow in the novel in-memory computing architecture. The timing generation unit is configured to adjust the data signal with a clock frequency not higher than a threshold of 150 MHz to meet a read and write access time of t1 = 7ns for writing data to the in-memory computing array and t2 = 10ns for reading data from the in-memory computing array.
如图2所示,读写解耦8T SRAM存储单元包括NMOS/PMOS晶体管对M1和M2形成的第一反相器,由NMOS/PMOS晶体管对M3和M4形成的第二反相器,以及存储晶体管/传输门M5、M6、M7和M8。当写字线信号WWL 100拉高时,写位线信号WBL 102和写位线反相信号WBLb 103写入数据信号。当读字线信号RWL 101拉高时,读位线信号RBL 111和读位线反相信号RBLb112读出数据信号。读写操作可以解耦进行,杜绝读写干扰。As shown in FIG2 , the read-write decoupled 8T SRAM memory cell includes a first inverter formed by an NMOS/PMOS transistor pair M1 and M2, a second inverter formed by an NMOS/PMOS transistor pair M3 and M4, and storage transistors/transmission gates M5, M6, M7, and M8. When the write word line signal WWL 100 is pulled high, the write bit line signal WBL 102 and the write bit line inversion signal WBLb 103 write the data signal. When the read word line signal RWL 101 is pulled high, the read bit line signal RBL 111 and the read bit line inversion signal RBLb112 read the data signal. The read and write operations can be decoupled to prevent read-write interference.
输入值和控制信号驱动模块将下一轮次部分数据信号104驱动到内部计算逻辑单元,同时将控制信号105驱动到内部计算逻辑单元和外围计算逻辑电路。可将数据精度8位、16位、32位、64位的数据送入内部计算逻辑单元中参与计算。控制信号105可控制内部计算逻辑单元和外围计算逻辑电路的计算类型。The input value and control signal driving module drives the next round of partial data signal 104 to the internal calculation logic unit, and drives the control signal 105 to the internal calculation logic unit and the peripheral calculation logic circuit. Data with data precision of 8 bits, 16 bits, 32 bits, and 64 bits can be sent to the internal calculation logic unit for calculation. The control signal 105 can control the calculation type of the internal calculation logic unit and the peripheral calculation logic circuit.
如图3所示,读字线信号RWL 101被选中拉高至VDD,8T SRAM输出数据信号至读位线信号RBL 111和读位线反相信号RBLb 112。数据最低位LSB到数据最高位MSB从左到右排列。读位线反相信号RBLb 112可视为读位线信号RBL 111的反码,通过链式全加器进行加一操作,得到读位线信号RBL 111的补码 121。该补码121与读位线信号RBL 111经由数据选择器MUX输出至算子选择器OP_MUX,与输入数据104一齐参与计算,输出计算结果106。MUX与OP_MUX可由控制信号105选择正码/补码 121和计算类型。As shown in FIG3 , the read word line signal RWL 101 is selected and pulled high to VDD, and the 8T SRAM outputs the data signal to the read bit line signal RBL 111 and the read bit line inversion signal RBLb 112. The data is arranged from the least significant bit LSB to the most significant bit MSB from left to right. The read bit line inversion signal RBLb 112 can be regarded as the inverse code of the read bit line signal RBL 111, and the read bit line signal RBL 111 is added by a chain full adder to obtain the complement code 121 of the read bit line signal RBL 111. The complement code 121 and the read bit line signal RBL 111 are output to the operator selector OP_MUX through the data selector MUX, and participate in the calculation together with the input data 104, and the calculation result 106 is output. MUX and OP_MUX can select the positive code/complement code 121 and the calculation type by the control signal 105.
如图4所示,输入值104和正码/补码 121经由一个链式全加器,输出64位加法/减法结果106。As shown in FIG. 4 , the input value 104 and the positive code/complement code 121 are passed through a chained full adder to output a 64-bit addition/subtraction result 106 .
如图5所示,外围计算逻辑电路输入灵敏放大器组 SAs读出数据信号107,经由算子选择器 OP_MUX输出移位和填充结果108。OP_MUX可由控制信号105选择计算类型。As shown in Fig. 5, the peripheral calculation logic circuit inputs the sense amplifier group SAs to read out the data signal 107, and outputs the shift and fill result 108 via the operator selector OP_MUX. The OP_MUX can select the calculation type by the control signal 105.
如图6所示,可重构地址生成单元模块包括一块算法指令映射地址存储RAM和一块FIFO存储器。所述算法指令映射地址存储RAM可以被配置写入映射地址表的数据信号,同时设置一个计数器计数计算步骤数,当开始计算时将算法指令映射地址存储RAM配置为读状态,读出映射地址表数据。每读出一次,计数器减一,并发出一个握手信号至FIFO存储器。当计数器为零时,证明算法计算完成,输出中断信号至顶层控制单元。所述FIFO存储器被配置为接收存内计算阵列输出数据信号106和外围计算逻辑电路输出数据信号108,当接收到算法指令映射地址存储RAM发出的握手信号时,同步送出下一轮次数据信号。As shown in Figure 6, the reconfigurable address generation unit module includes an algorithm instruction mapping address storage RAM and a FIFO memory. The algorithm instruction mapping address storage RAM can be configured to write the data signal of the mapping address table, and a counter is set to count the number of calculation steps. When the calculation starts, the algorithm instruction mapping address storage RAM is configured to the read state to read the mapping address table data. Each time it is read out, the counter is reduced by one and a handshake signal is sent to the FIFO memory. When the counter is zero, it proves that the algorithm calculation is completed and an interrupt signal is output to the top-level control unit. The FIFO memory is configured to receive the in-memory calculation array output data signal 106 and the peripheral calculation logic circuit output data signal 108. When the handshake signal sent by the algorithm instruction mapping address storage RAM is received, the next round of data signals are sent synchronously.
算法指令映射地址表编译过程如图7所示,举例一种CSR-scalar的SpMV的算法,将伪代码转换为ANSI C语言程序代码,再将高级语言程序代码转换一种汇编语言。例如RISCV64指令集架构的汇编语言,再根据该汇编语言删减编译,去除多余指令,类似lw加载字指令,转为以32位宽为一行的映射地址表,表中内容包含存内计算阵列、内部计算逻辑单元和外围计算逻辑电路中下一轮次计算地址,选择内部计算逻辑单元或外围计算逻辑电路信号等。The algorithm instruction mapping address table compilation process is shown in Figure 7. Taking a CSR-scalar SpMV algorithm as an example, the pseudo code is converted into ANSI C language program code, and then the high-level language program code is converted into an assembly language. For example, the assembly language of the RISCV64 instruction set architecture is compiled and deleted according to the assembly language, and redundant instructions are removed, such as the lw load word instruction, and converted into a mapping address table with a row of 32 bits wide. The content in the table includes the next round of calculation addresses in the in-memory calculation array, the internal calculation logic unit and the peripheral calculation logic circuit, and the selection of internal calculation logic unit or peripheral calculation logic circuit signals.
可重构地址生成单元模块在写入映射地址表数据和参与存内计算阵列计算过程的运算时序图如图8和图9所示。The operation timing diagrams of the reconfigurable address generation unit module in writing the mapping address table data and participating in the in-memory computing array calculation process are shown in Figures 8 and 9.
图8示出了7个周期操作,输入包括时钟信号CLK,写使能信号WR_EN,读使能信号RD_EN,映射地址表输入地址信号ADDR[31:0],映射地址表输入数据信号RAM DATA Input[31:0]。当写使能拉高,写操作有效,映射地址表输入地址信号选中RAM地址,映射地址表输入数据信号顺序写入。Figure 8 shows a 7-cycle operation, where the input includes the clock signal CLK, the write enable signal WR_EN, the read enable signal RD_EN, the mapping address table input address signal ADDR[31:0], and the mapping address table input data signal RAM DATA Input[31:0]. When the write enable is pulled high, the write operation is valid, the mapping address table input address signal selects the RAM address, and the mapping address table input data signal is written sequentially.
图9示出了14个周期操作,输入包括时钟信号CLK,写使能信号WR_EN,读使能信号RD_EN,映射地址表输出数据信号RAM DATA Output[31:0],握手信号Valid,FIFO存储器输出信号FIFO DATA Output[63:0]。当读使能拉高,读操作有效,延迟一个时钟周期顺序输出映射地址表数据信号,同时触发握手脉冲信号。检测到握手信号上升沿后,同步输出FIFO存储器数据输出信号。FIG9 shows a 14-cycle operation, where the input includes the clock signal CLK, the write enable signal WR_EN, the read enable signal RD_EN, the mapping address table output data signal RAM DATA Output[31:0], the handshake signal Valid, and the FIFO memory output signal FIFO DATA Output[63:0]. When the read enable is pulled high, the read operation is valid, and the mapping address table data signal is output sequentially with a delay of one clock cycle, and the handshake pulse signal is triggered at the same time. After detecting the rising edge of the handshake signal, the FIFO memory data output signal is output synchronously.
本发明中所述新型存内计算架构的工作流程如图10所示,第一步,可重构地址生成单元模块写入映射地址表Address LUT,读写计数器记录LUT行数。第二步,存内计算阵列存入初始数据到SRAM存储单元中并输出部分数据至RAGU中。第三步,可重构地址生成单元模块输出第一次计算类型和参与计算数据至内部计算逻辑单元或外围计算逻辑电路进行计算,存内计算阵列完成第一次计算并将中间计算结果送入可重构地址生成单元模块中。第四步,可重构地址生成单元模块中的计数器减一,同时判断计数器是否为0,是则代表计算结束输出结果,否则代表继续下一轮计算,直至计算完成。The workflow of the novel in-memory computing architecture described in the present invention is shown in Figure 10. In the first step, the reconfigurable address generation unit module writes the mapping address table Address LUT, and the read-write counter records the number of LUT rows. In the second step, the in-memory computing array stores the initial data into the SRAM storage unit and outputs part of the data to the RAGU. In the third step, the reconfigurable address generation unit module outputs the first calculation type and the data involved in the calculation to the internal calculation logic unit or the peripheral calculation logic circuit for calculation. The in-memory calculation array completes the first calculation and sends the intermediate calculation results to the reconfigurable address generation unit module. In the fourth step, the counter in the reconfigurable address generation unit module is reduced by one, and at the same time, it is determined whether the counter is 0. If it is, it means that the calculation is completed and the result is output. Otherwise, it means that the next round of calculation will continue until the calculation is completed.
本发明中所述新型存内计算架构的数据流和控制流如图11所示,详细展示了任意算法的数据输入、计算、结果输出的过程。该架构首先接收从中央处理单元发出的指令以及从其他RAM搬运过来的外部数据,经由控制电路对接收指令译码,将控制信号传递给之后的各个模块。外部初始数据,例如映射地址表数据和算法相关初始数据,经由数据通路分别传递至RAGU模块和CIMA。当CIMA按照映射地址表指示开始计算时,CIMA计算的中间值传递给RAGU,伴随映射地址表给出的下一轮次新地址再分配至CIMA中进行计算,直至计算完成,RAGU发出中断信号,CIMA将计算结果输出,整个算法的运算过程结束,等待下一次开始运算。The data flow and control flow of the new in-memory computing architecture described in the present invention are shown in Figure 11, which shows in detail the process of data input, calculation, and result output of any algorithm. The architecture first receives instructions issued by the central processing unit and external data transferred from other RAMs, decodes the received instructions through the control circuit, and passes the control signal to the subsequent modules. External initial data, such as mapping address table data and algorithm-related initial data, are passed to the RAGU module and CIMA respectively through the data path. When CIMA starts calculating according to the instructions of the mapping address table, the intermediate value calculated by CIMA is passed to RAGU, and the new address of the next round given by the mapping address table is allocated to CIMA for calculation until the calculation is completed, RAGU sends an interrupt signal, CIMA outputs the calculation result, and the calculation process of the entire algorithm ends, waiting for the next calculation to start.
对于本领域的技术人员来说,可根据以上描述的技术方案以及构思,做出其它各种相应的改变以及形变,而所有的这些改变以及形变都应该属于本发明权利要求的保护范围之内。For those skilled in the art, various other corresponding changes and deformations can be made according to the technical solutions and concepts described above, and all of these changes and deformations should fall within the protection scope of the claims of the present invention.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410558566.XA CN118132507B (en) | 2024-05-08 | 2024-05-08 | A new in-memory computing architecture that supports a variety of workloads |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410558566.XA CN118132507B (en) | 2024-05-08 | 2024-05-08 | A new in-memory computing architecture that supports a variety of workloads |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118132507A CN118132507A (en) | 2024-06-04 |
CN118132507B true CN118132507B (en) | 2024-07-12 |
Family
ID=91232053
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410558566.XA Active CN118132507B (en) | 2024-05-08 | 2024-05-08 | A new in-memory computing architecture that supports a variety of workloads |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118132507B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110414677A (en) * | 2019-07-11 | 2019-11-05 | 东南大学 | An in-memory computing circuit suitable for fully connected binary neural networks |
CN117079688A (en) * | 2023-09-12 | 2023-11-17 | 安徽大学 | A current domain 8TSRAM unit and dynamic adaptive quantization storage and calculation circuit |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102342994B1 (en) * | 2020-07-21 | 2021-12-24 | 고려대학교 산학협력단 | In memory computing supporting arithmetic operations |
CN114398308A (en) * | 2022-01-18 | 2022-04-26 | 上海交通大学 | Near memory computing system based on data-driven coarse-grained reconfigurable array |
US20220366968A1 (en) * | 2022-08-01 | 2022-11-17 | Intel Corporation | Sram-based in-memory computing macro using analog computation scheme |
CN115565581A (en) * | 2022-11-07 | 2023-01-03 | 上海浦东复旦大学张江科技研究院 | High-energy-efficiency edge storage calculation circuit |
CN116860696A (en) * | 2023-07-07 | 2023-10-10 | 北京航空航天大学 | An in-memory computing circuit based on non-volatile memory |
CN117389466A (en) * | 2023-08-29 | 2024-01-12 | 清华大学 | Reconfigurable intelligent storage and computing integrated processor and storage and computing architecture design device |
-
2024
- 2024-05-08 CN CN202410558566.XA patent/CN118132507B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110414677A (en) * | 2019-07-11 | 2019-11-05 | 东南大学 | An in-memory computing circuit suitable for fully connected binary neural networks |
CN117079688A (en) * | 2023-09-12 | 2023-11-17 | 安徽大学 | A current domain 8TSRAM unit and dynamic adaptive quantization storage and calculation circuit |
Also Published As
Publication number | Publication date |
---|---|
CN118132507A (en) | 2024-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11335387B2 (en) | In-memory computing circuit for fully connected binary neural network | |
JP4989900B2 (en) | Parallel processing unit | |
JPH08194679A (en) | Method and device for processing digital signal and memory cell reading method | |
US11934824B2 (en) | Methods for performing processing-in-memory operations, and related memory devices and systems | |
CN110058839A (en) | A kind of circuit structure based on subtraction in Static RAM memory | |
US6301185B1 (en) | Random access memory with divided memory banks and data read/write architecture therefor | |
CN110633069B (en) | Multiplication circuit structure based on static random access memory | |
US20210209022A1 (en) | Processing-in-memory (pim) device | |
US20210223996A1 (en) | Processing-in-memory (pim) devices | |
CN114115507B (en) | Memory and method for writing data | |
CN101645305A (en) | Static random access memory (SRAM) for automatically tracking data | |
US11861369B2 (en) | Processing-in-memory (PIM) device | |
US12106819B2 (en) | Processing-in-memory (PIM) device | |
CN118132507B (en) | A new in-memory computing architecture that supports a variety of workloads | |
CN118093507A (en) | Memory calculation circuit structure based on 6T-SRAM | |
WO2024151370A1 (en) | Flexible sram pre-charge systems and methods | |
JPH0390942A (en) | Control system for main storage device | |
US11474787B2 (en) | Processing-in-memory (PIM) devices | |
Lee et al. | Design of 16-Kb 6T SRAM Supporting Wide Parallel Data Access for Enhanced Computation Speed | |
JPH0877769A (en) | Synchronous semiconductor storage device | |
CN119248225B (en) | Five-tube half adder circuit, digital in-memory computing array and static random access memory | |
US12260900B2 (en) | In-memory computing circuit and method, and semiconductor memory | |
JPS618791A (en) | Static semiconductor memory | |
CN111522753B (en) | SDRAM (synchronous dynamic random access memory) control method and system based on state machine | |
JP3923642B2 (en) | Semiconductor memory device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |