[go: up one dir, main page]

WO2019136752A1 - 人工智能卷积处理方法、装置、可读存储介质、及终端 - Google Patents

人工智能卷积处理方法、装置、可读存储介质、及终端 Download PDF

Info

Publication number
WO2019136752A1
WO2019136752A1 PCT/CN2018/072665 CN2018072665W WO2019136752A1 WO 2019136752 A1 WO2019136752 A1 WO 2019136752A1 CN 2018072665 W CN2018072665 W CN 2018072665W WO 2019136752 A1 WO2019136752 A1 WO 2019136752A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
matrix
processed
module
data matrix
Prior art date
Application number
PCT/CN2018/072665
Other languages
English (en)
French (fr)
Inventor
肖梦秋
Original Assignee
深圳鲲云信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳鲲云信息科技有限公司 filed Critical 深圳鲲云信息科技有限公司
Priority to PCT/CN2018/072665 priority Critical patent/WO2019136752A1/zh
Priority to CN201880002147.0A priority patent/CN109313723B/zh
Publication of WO2019136752A1 publication Critical patent/WO2019136752A1/zh
Priority to US16/929,819 priority patent/US11874898B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present invention relates to the field of artificial intelligence, and in particular to an artificial intelligence convolution processing method, apparatus, readable storage medium, and terminal.
  • AI Artificial Intelligence
  • the artificial intelligence algorithm is a neural network model algorithm that simulates the human brain. Its computational complexity is very large. AlphaGo, which also uses artificial intelligence algorithms, requires thousands of traditional processors (CPUs) and hundreds of graphics processors (GPUs). It is clear that today, as artificial intelligence ushers in a new wave of revival, traditional processors are becoming a bottleneck that hinders the spread of artificial intelligence.
  • the object of the present invention is to provide an artificial intelligence convolution processing method and an artificial intelligence processing device for solving the technical problems such as insufficient pipeline level of the artificial intelligence algorithm in the prior art.
  • the present invention provides an artificial intelligence convolution processing method, which is applied to a processing module, and the method includes: adding a first end of a first to-be-processed data matrix stored in a first cache module Multiple columns of invalid data to form a second to-be-processed data matrix; wherein, the number of columns of the second to-be-processed data matrix is an integer multiple of data transmission parallelism; and causing the data transmission module to follow the second to-be-processed data matrix
  • the preset mode is taken out from the first cache module to the convolution module to be subjected to a convolution operation.
  • the step of adding a plurality of columns of invalid data to the first end of the first to-be-processed data matrix stored in the first cache module includes: causing the value of the data transmission parallelism to be pv, Then, the first end of the first to-be-processed data matrix adds (pv-2) column invalid data to form pv column data with the first two columns of valid data of the first to-be-processed data matrix.
  • the data transmission module causes the second to-be-processed data matrix to be taken out from the first cache module to a convolution module in a preset manner, to be subjected to a convolution operation
  • the method includes: causing the data transmission module to take out the second to-be-processed data matrix from the first cache module and into the second cache module in batches according to the data size of pv*1;
  • the data transmission module takes the second to-be-processed data matrix from the second cache module in batches according to the data size of pv*k, and puts it into the matrix module for data combination; wherein k is a convolution The size of the kernel matrix.
  • the data transmission module causes the second data matrix to be processed to be batched out of the second cache module in batches according to the data size of pv*k. Inserting into the matrix module specifically includes: the second to-be-processed data matrix acts as a set of data per k; the data transmission module sequentially performs the following operations on each set of data: from each set of data in each clock cycle The third to-be-processed data matrix of data size pv*k is sequentially taken out and placed in the matrix module until all the data of the group is taken out.
  • the first third to-be-processed data matrix extracted by the data transmission module includes (pv-2) column invalid data and two columns of valid data,
  • the calculation result value of the first third to-be-processed data matrix is made to be an invalid value.
  • each of the third to-be-processed data matrices is the same as the previous one.
  • the last two columns of the three to-be-processed data matrix are combined to form a k*(pv+2)-order fourth to-be-processed data matrix; wherein each of the k*(pv+2)-order fourth to-be-processed data matrices can follow the steps.
  • the matrix is extracted by 1 to obtain pv k*k order fifth to-be-processed data matrices for transmission to the convolutional module for convolution calculation with the convolution kernel matrix.
  • an artificial intelligence processing apparatus including: a first cache module storing a first to-be-processed data matrix; and a processing module for using a first end of the first to-be-processed data matrix Adding a plurality of columns of invalid data to form a second to-be-processed data matrix; wherein, the number of columns of the second to-be-processed data matrix is an integer multiple of data transmission parallelism; and the data transmission module is communicatively connected and controlled by the processing module And the second to-be-processed data matrix is taken out from the first cache module to the convolution module in a preset manner, to be subjected to a convolution operation.
  • the adding a plurality of columns of invalid data at the first end of the first to-be-processed data matrix includes: if the data transmission parallelism is a pv value, the processing module is at the first The first end of the to-be-processed data matrix adds (pv-2) column invalid data to form pv column data with the first two columns of valid data of the first to-be-processed data matrix.
  • the artificial intelligence processing apparatus includes: a second cache module, configured to store the data transmission module to be taken out from the first cache module in batches according to a data size of pv*1 The second to-be-processed data matrix; the matrix module, configured to store the second to-be-processed data matrix that the data transmission module fetches from the second cache module in batches according to the data size of pv*k Where k is the size of the convolution kernel matrix.
  • the artificial intelligence processing apparatus includes: the second to-be-processed data matrix acts as a set of data per k; and the data transmission module performs the following operations on each set of data: at each clock During the period, the pv*k third to-be-processed data matrix is sequentially taken out from the group of data until the group of data is all taken out; wherein the matrix module is further used to: from the data transmission module in each group of data
  • the second third data matrix to be processed starts, and each third data matrix to be processed is combined with the last two columns of the previous third data matrix to be processed to form a k*(pv+2) order fourth to be processed.
  • the data matrix is such that each of the fourth data to be processed obtains pv calculation result values.
  • the present invention provides a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the artificial intelligence convolution processing method.
  • an artificial intelligence processing terminal including: a processor and a memory; the memory is for storing a computer program, and the processor is configured to execute the computer program of the memory storage, The terminal is caused to execute the artificial intelligence convolution processing method.
  • the artificial intelligence convolution processing method, apparatus, readable storage medium, and terminal of the present invention have the following beneficial effects: the present invention adds multiple columns of invalid data to the data matrix to be processed, so that the matrix after invalid data is added.
  • the number of columns is a multiple of the parallelism of data transmission, so that the number of output convolution calculation results is unified into pv, so the pipeline processing of artificial intelligence convolution can be realized, which greatly improves the operational efficiency of artificial intelligence convolution calculation and greatly Improved convolution calculation performance.
  • FIG. 1 is a flow chart showing a method for processing artificial intelligence convolution in an embodiment of the present invention.
  • FIG. 2 is a schematic diagram showing a data matrix to be processed in an embodiment of the present invention.
  • FIG. 3 is a schematic diagram showing data to be processed by a data transmission module according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram showing data to be processed by a data transmission module according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram showing an artificial intelligence processing apparatus according to an embodiment of the present invention.
  • the artificial intelligence convolution processing method is applied to a processing module, which may be, for example, an ARM module, an MCU module, or a Soc module or the like.
  • the artificial intelligence convolution processing method specifically includes:
  • S101 Add multiple columns of invalid data to the first end of the first to-be-processed data matrix stored in the first cache module to form a second to-be-processed data matrix; wherein the number of columns of the second to-be-processed data matrix is data An integer multiple of the degree of parallelism of the transmission.
  • the first cache module may be a RAM or a ROM memory, such as three generations, four generations of DDR SDRAMs, and the like.
  • the data to be processed is stored in the cache module, and the data to be processed is stored in a matrix form, which is the first data matrix to be processed in this embodiment.
  • FIG. 2 a schematic diagram of a data matrix to be processed in an embodiment of the present invention is shown.
  • the data transmission parallelism pv represents the number of columns of the data transmission module that is to be processed each time, and the size of the data transmission parallelism is associated with the efficiency of the artificial intelligence convolution processing method; for example, the data transmission module It can be a DMA controller, ie a DMA interface circuit, for data transfer between the external memory and the Programmable Logic side.
  • the processing module adds 6 columns of invalid data to the first end of the first to-be-processed data matrix to form a second to-be-processed data matrix of 34*40, and the number of columns of the second to-be-processed data matrix is 40, which can be
  • the data transmission parallelism is divisible.
  • the blank boxes represent valid data in Figure 2, and the added invalid data is represented by boxes filled with slashes.
  • the valid data may include zero-padding data.
  • the zero-padded data and the non-padded data are collectively referred to as valid data.
  • the data transmission module is configured to take the second to-be-processed data matrix from the first cache module in a preset manner, to perform a convolution operation.
  • the data transmission module batches the second to-be-processed data matrix from the first cache module and puts it into the second cache module in rows and according to the data size of pv*1.
  • a schematic diagram of the transmission module taking out the second to-be-processed data matrix is described below in conjunction with a specific illustration.
  • FIG. 3 a schematic diagram of data to be processed by a data transmission module in an embodiment of the present invention is shown.
  • the data transmission module starts from the leftmost side of the data to be processed in the first row, and takes out pv*1 data each time until all the data to be processed in the first row is taken out. Based on the same principle, the data transmission module continues to take the second row, the third row... until the entire second to-be-processed data matrix is taken out.
  • the first pv*1 data includes 6 invalid data and 2 valid data, and includes 8 valid data from the second pv*2 data.
  • the data transmission module After the data transmission module stores the second to-be-processed data matrix in the first cache module, the data matrix is processed in batches according to the data size of pv*k.
  • the second cache module is taken out and placed in a matrix module for data combination; wherein k is a size of a convolution kernel matrix, and the convolution kernel matrix is a weight matrix for convolution calculation; the convolution
  • the kernel matrix can be set to an odd-order matrix, and in the present embodiment, the convolution kernel matrix is set to a 3*3 matrix.
  • the data transmission module sequentially extracts 3*8 orders of the third pending processing from the first three rows of the 34*40 second to-be-processed data matrix in order from left to right in each clock cycle.
  • Data matrix That is, a total of five 3*8-order third to-be-processed data matrices can be taken out in the first three rows.
  • the data transmission module continues to fetch the pending data of the subsequent row after the first three rows are taken.
  • the third to-be-processed data matrix of the first three rows is represented by rectangular dotted frames R1 to R5 in FIG.
  • the first third to-be-processed data matrix M1 taken by the data transmission module includes 6 columns of invalid data and 2 columns of valid data, and the third to-be-processed data matrix M1
  • the convolution result is an invalid value.
  • the data transmission module takes out a second third to-be-processed data matrix M2, and the third to-be-processed data matrix M2 is combined with the last two columns of the third to-be-processed data matrix M1. It is 3*10 fourth to-be-processed data matrix M12, and the line L1 represents the data to be processed combined with each other.
  • the data matrix M2 is combined with the last two columns of the data matrix M1 to obtain a data matrix M12 having a column number of 10.
  • the 3*10 fourth to-be-processed data matrix M12 can perform matrix extraction according to the step size 1 to obtain 8 3*3 fifth to-be-processed data matrices; and the 8 3*3 fifth to-be-processed data matrices It is used for transmission to the convolution module to perform convolution calculation with the 3*3 convolution kernel matrix and obtain 8 calculation result values.
  • the eight 3*3 fifth to-be-processed data matrices specifically refer to: a rectangular dashed box R6 as shown in FIG. 4, starting from the matrix covered in FIG. 4, and rowwise to the right according to the step size 1 Move, each time you move a column, you get a matrix with a size of 3*3. It can be seen that the rectangular dotted frame R6 can be moved a total of 7 times in the 3*10 fourth to-be-processed data matrix M12, for a total of 8 3*3 matrices, that is, pv k*k matrices.
  • the data transmission module takes out a third third to-be-processed data matrix M3, the third to-be-processed data matrix M3 and the last of the third to-be-processed data matrix M2.
  • the two columns are combined into a 3*10 fourth to-be-processed data matrix M23, and the line L2 represents the data to be processed combined with each other.
  • the data matrix M3 is combined with the last two columns of the data matrix M2 to obtain a data matrix M23 having a column number of 10.
  • the 3*10 fourth to-be-processed data matrix M23 can perform matrix extraction according to the step size 1 to obtain 8 3*3 fifth to-be-processed data matrices; and the 8 3*3 fifth to-be-processed data matrices It is used for transmission to the convolution module to perform convolution calculation with the 3*3 convolution kernel matrix and obtain 8 calculation result values.
  • the data transmission module can process the entire second to-be-processed data matrix after a plurality of clock cycles based on the same principle.
  • the first 3*8-order third to-be-processed data matrix read in can extract 6 3*3 matrices.
  • the last two columns of the last third to-be-processed data matrix can be combined into a 3*10 matrix, so that 8 3*3 matrices are sequentially extracted to the volume.
  • the product outputs 8 calculated result values.
  • the obtained convolution result will be calculated according to the results of 6 convolutions, and 8 convolution calculations.
  • the result of convolution calculation of the first three rows of the 34*40 matrix second to-be-processed data matrix and the 3*3 convolution kernel matrix is: invalid value, 8 Convolution calculation result value, 8 convolution calculation result values, 8 convolution calculation result values, and 8 convolution calculation result values.
  • the convolution result of the convolution calculation of the entire 34*40 matrix second to-be-processed data matrix and the 3*3 convolution kernel matrix will be calculated according to the invalid value, 8 convolutions, and 8 convolutions. Calculate the result value... keep looping.
  • the artificial intelligence convolution processing method provided by the present invention uniformly outputs the number of output convolution calculation results to pv, thereby realizing the pipeline processing of artificial intelligence convolution, and greatly improving the operation of the artificial intelligence convolution calculation. Efficiency and greatly improved convolution calculation performance.
  • an artificial intelligence processing apparatus includes: a first cache module 51, a second cache module 52, a data transmission module 53, a processing module 54, and a matrix module 55.
  • the first cache module 51, the second cache module 52, the data transfer module 53, the matrix module 55, and the convolution module 56 are commonly disposed on the Programmable Logic terminal 50 of the FPGA, which is commonly referred to as the PL end.
  • the first cache module 51 stores a first to-be-processed data matrix, and the first to-be-processed data is retrieved from the external storage module 57 by the data transmission module 53 through a system bus.
  • the external storage module 57 is, for example, a DDR memory.
  • the processing module 54 is configured to add multiple columns of invalid data to the first end of the first to-be-processed data matrix to form a second to-be-processed data matrix; wherein the number of columns of the second to-be-processed data matrix is data transmission parallelism An integer multiple; the data transmission module 53 is communicatively coupled and controlled by the processing module 54 for taking the second to-be-processed data matrix from the first cache module 51 to be convolved Operation.
  • the first cache module 51 can be, for example, a BRAM memory, that is, a block RAM, which is a RAM storage resource of an FPGA (Field-Programmable Gate Array) field programmable gate array.
  • the processing module 54 can be, for example, an ARM module, an MCU module, or a Soc module, and the like.
  • the implementation of the artificial intelligence processing device is similar to the implementation of the artificial intelligence convolution processing method, and therefore will not be described again. Those skilled in the art should be able to understand the artificial based on the artificial intelligent convolution processing method. The principle and implementation of the intelligent processing device.
  • the aforementioned computer program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the above-described method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
  • the present invention also provides an artificial intelligence processing terminal, comprising: a processor and a memory; the memory is for storing a computer program, the processor is configured to execute the computer program stored by the memory, so that the terminal performs the manual Intelligent convolution processing method.
  • the above memory may include random access memory (RAM), and may also include non-volatile memory, such as at least one disk storage.
  • RAM random access memory
  • non-volatile memory such as at least one disk storage.
  • the above processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP for short), and the like; or a digital signal processor (DSP), an application specific integrated circuit (DSP). ApplicationSpecificIntegratedCircuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • CPU central processing unit
  • NP Network Processor
  • DSP digital signal processor
  • DSP application specific integrated circuit
  • ASIC ApplicationSpecificIntegratedCircuit
  • FPGA Field-Programmable Gate Array
  • the artificial intelligence processing apparatus, method, readable storage medium, and terminal provided by the present invention add multiple columns of invalid data to the data matrix to be processed, so that the number of matrix columns after adding invalid data is data transmission parallelism.
  • the multiples are unified so that the number of output convolution calculation results is unified into pv, so that the artificial intelligence convolution pipeline processing can be realized, which greatly improves the operational efficiency of the artificial intelligence convolution calculation and greatly improves the convolution calculation performance. Therefore, the present invention effectively overcomes various shortcomings in the prior art and has high industrial utilization value.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种人工智能卷积处理方法,应用于处理模块,所述方法包括:将存储于第一缓存模块中的第一待处理数据矩阵的首端增设多列无效数据,以形成第二待处理数据矩阵;其中,所述第二待处理数据矩阵的列数为数据传输并行度的整数倍(S101);令数据传输模块将所述第二待处理数据矩阵按照预设方式从所述第一缓存模块中取出至卷积模块,以待进行卷积运算(S102)。该方法为待处理数据矩阵增设多列无效数据,以使增设无效数据后的矩阵列数为数据传输并行度的倍数,从而使输出的卷积计算结果值的数量统一为pv个,故而能够实现人工智能卷积的流水处理,大大提升了人工智能卷积计算的运行效率并大幅改善了卷积计算性能。

Description

人工智能卷积处理方法、装置、可读存储介质、及终端 技术领域
本发明涉及人工智能领域,特别是涉及人工智能卷积处理方法、装置、可读存储介质、及终端。
背景技术
人工智能(Artificial Intelligence),英文缩写为AI。它是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学。
人工智能算法是模拟人脑的神经网络模型算法,其运算量非常巨大,同样采用了人工智能算法的AlphaGo,需要用到上千块传统处理器(CPU)和上百块图形处理器(GPU);很显然,在人工智能迎来新一波复兴的今天,传统处理器正成为阻碍人工智能普及的瓶颈。
但是,目前人工智能算法的流水线实现度不够,如何实现高度化的流水线成为人工智能技术领域的关键技术。
发明内容
鉴于以上所述现有技术的缺点,本发明的目的在于提供人工智能卷积处理方法及人工智能处理装置,用于解决现有技术中人工智能算法的流水线程度不够等技术问题。
为实现上述目的及其他相关目的,本发明提供一种人工智能卷积处理方法,应用于处理模块,所述方法包括:将存储于第一缓存模块中的第一待处理数据矩阵的首端增设多列无效数据,以形成第二待处理数据矩阵;其中,所述第二待处理数据矩阵的列数为数据传输并行度的整数倍;令数据传输模块将所述第二待处理数据矩阵按照预设方式从所述第一缓存模块中取出至卷积模块,以待进行卷积运算。
于本发明的一实施例中,所述将存储于第一缓存模块中的第一待处理数据矩阵的首端增设多列无效数据,具体包括:令所述数据传输并行度的值为pv,则第一待处理数据矩阵的首端增设(pv-2)列无效数据,以与所述第一待处理数据矩阵的前2列有效数据组成pv列数据。
于本发明的一实施例中,所述令数据传输模块将所述第二待处理数据矩阵按照预设方式从所述第一缓存模块中取出至卷积模块,以待进行卷积运算,具体包括:令所述数据传输模块按照pv*1的数据尺寸,按行分批将所述第二待处理数据矩阵从所述第一缓存模块中取出并置入第二缓存模块中;令所述数据传输模块按照pv*k的数据尺寸,按行分批将所述第二待处理数据矩阵从所述第二缓存模块中取出并置入矩阵模块中以进行数据组合;其中,k为卷积 核矩阵的尺寸。
于本发明的一实施例中,所述令所述数据传输模块按照pv*k的数据尺寸,按行分批将所述第二待处理数据矩阵从所述第二缓存模块中分批取出并置入矩阵模块中,具体包括:所述第二待处理数据矩阵每k行为一组数据;所述数据传输模块依次对每一组数据进行如下操作:在每个时钟周期内,从该组数据中依次取出数据尺寸为pv*k的第三待处理数据矩阵并置入矩阵模块中,直至该组数据全部被取出。
于本发明的一实施例中,在所述每一组数据中,所述数据传输模块取出的第一个第三待处理数据矩阵包括(pv-2)列无效数据和2列有效数据,以令所述第一个第三待处理数据矩阵的计算结果值为无效值。
于本发明的一实施例中,在所述每一组数据中,从所述数据传输模块取出的第二个第三待处理数据矩阵开始,每个第三待处理数据矩阵均与前一个第三待处理数据矩阵的最后2列组合形成k*(pv+2)阶第四待处理数据矩阵;其中,每个所述k*(pv+2)阶第四待处理数据矩阵均能够按照步长为1进行矩阵提取,得到pv个k*k阶第五待处理数据矩阵,用于传输至所述卷积模块以与所述卷积核矩阵进行卷积计算。
为实现上述目的及其他相关目的,本发明提供一种人工智能处理装置,包括:第一缓存模块,存储有第一待处理数据矩阵;处理模块,用于在第一待处理数据矩阵的首端增设多列无效数据以形成第二待处理数据矩阵;其中,所述第二待处理数据矩阵的列数为数据传输并行度的整数倍;数据传输模块,通信连接并受控于所述处理模块,用于将所述第二待处理数据矩阵按照预设方式从所述第一缓存模块中取出至卷积模块,以待进行卷积运算。
于本发明的一实施例中,所述在第一待处理数据矩阵的首端增设多列无效数据,具体包括:若令所述数据传输并行度为pv值,则所述处理模块在第一待处理数据矩阵的首端增设(pv-2)列无效数据,以与所述第一待处理数据矩阵的前2列有效数据组成pv列数据。
于本发明的一实施例中,所述人工智能处理装置包括:第二缓存模块,用于存储所述数据传输模块按照pv*1的数据尺寸按行分批从所述第一缓存模块中取出的所述第二待处理数据矩阵;矩阵模块,用于存储所述数据传输模块按照pv*k的数据尺寸按行分批从所述第二缓存模块中取出的所述第二待处理数据矩阵;其中,k为卷积核矩阵的尺寸。
于本发明的一实施例中,所述人工智能处理装置包括:所述第二待处理数据矩阵每k行为一组数据;所述数据传输模块对每一组数据进行如下操作:在每个时钟周期内,从该组数据中依次取出pv*k第三待处理数据矩阵,直至该组数据全部被取出;其中,所述矩阵模块还用于,从所述数据传输模块在每一组数据中取出的第二个第三待处理数据矩阵开始,将每个 第三待处理数据矩阵均与前一个第三待处理数据矩阵的最后2列组合形成k*(pv+2)阶第四待处理数据矩阵,以令每个所述第四待处理数据矩阵得到pv个计算结果值。
为实现上述目的及其他相关目的,本发明提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现所述人工智能卷积处理方法。
为实现上述目的及其他相关目的,本发明提供一种人工智能处理终端,包括:处理器及存储器;所述存储器用于存储计算机程序,所述处理器用于执行所述存储器存储的计算机程序,以使所述终端执行所述人工智能卷积处理方法。
如上所述,本发明的人工智能卷积处理方法、装置、可读存储介质、及终端,具有以下有益效果:本发明为待处理数据矩阵增设多列无效数据,以使增设无效数据后的矩阵列数为数据传输并行度的倍数,从而使输出的卷积计算结果值的数量统一为pv个,故而能够实现人工智能卷积的流水处理,大大提升了人工智能卷积计算的运行效率并大幅改善了卷积计算性能。
附图说明
图1显示为本发明一实施例中人工智能卷积处理方法的流程图。
图2显示为本发明一实施例中待处理数据矩阵的示意图。
图3显示为本发明一实施例中数据传输模块取出待处理数据的示意图。
图4显示为本发明一实施例中数据传输模块取出待处理数据的示意图。
图5显示为本发明一实施例中人工智能处理装置的示意图。
元件标号说明
R1~R6         矩形虚线框
D1~D3         Pv*1数据
M1             第三待处理数据矩阵
M2             第三待处理数据矩阵
M3             第三待处理数据矩阵
M12            第四待处理数据矩阵
M23            第四待处理数据矩阵
L1             直线
L2             直线
T1             时钟周期
T2             时钟周期
T3             时钟周期
50             Programmable Logic端
51             第一缓存模块
52             第二缓存模块
53             数据传输模块
54             处理模块
55             矩阵模块
56             卷积模块
57             外部存储模块
S101~S102     步骤
具体实施方式
以下通过特定的具体实例说明本发明的实施方式,本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用,本说明书中的各项细节也可以基于不同观点与应用,在没有背离本发明的精神下进行各种修饰或改变。需说明的是,在不冲突的情况下,以下实施例及实施例中的特征可以相互组合。
需要说明的是,以下实施例中所提供的图示仅以示意方式说明本发明的基本构想,遂图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制,其实际实施时各组件的型态、数量及比例可为一种随意的改变,且其组件布局型态也可能更为复杂。
如图1所示,展示本发明一实施例中的人工智能卷积处理方法的流程图。所述人工智能卷积处理方法应用于处理模块,所述处理模块例如可以是ARM模块、MCU模块、或者Soc模块等等。所述人工智能卷积处理方法具体包括:
S101:将存储于第一缓存模块中的第一待处理数据矩阵的首端增设多列无效数据,以形成第二待处理数据矩阵;其中,所述第二待处理数据矩阵的列数为数据传输并行度的整数倍。
所述第一缓存模块,可以是RAM或ROM存储器,例如三代、四代DDR SDRAM等等。所述缓存模块中存储有待处理数据,所述待处理数据以矩阵形式存储,于本实施例中令其为第一待处理数据矩阵。
如图2所示,展示本发明一实施例中待处理数据矩阵的示意图。所述第一待处理数据矩阵设定为34*34矩阵,且设定数据传输并行度pv=8。其中,所述数据传输并行度pv表示所述数据传输模块每一次传输待处理数据的列数,所述数据传输并行度的大小与人工智能卷积处理方法的效率关联;所述数据传输模块例如可以是DMA控制器,也即DMA接口电路,用于在外部存储器与Programmable Logic端之间进行数据传输。
所述处理模块在所述第一待处理数据矩阵的首端增设6列无效数据后形成34*40的第二待处理数据矩阵,所述第二待处理数据矩阵的列数为40,可被数据传输并行度整除。为便于区分,在图2中用空白方框代表有效数据,用填充有斜线的方框代表增设的无效数据。但需要说明的是,所述有效数据可包括补零数据,于本发明中将补零数据和非补零数据统称为有效数据。
S102:令数据传输模块将所述第二待处理数据矩阵按照预设方式从所述第一缓存模块中取出,以待进行卷积运算。
具体的,所述数据传输模块按行且按照pv*1的数据尺寸分批将所述第二待处理数据矩阵从所述第一缓存模块中取出并置入第二缓存模块中。下面结合具体图示说明所述传输模块取出所述第二待处理数据矩阵的示意图。
如图3所示,展示本发明一实施例中数据传输模块取出待处理数据的示意图。所述数据传输模块从第一行待处理数据的最左侧开始,每次取出pv*1个数据,直至第一行的待处理数据全部取出。基于同样的原理,所述数据传输模块继续取第二行,第三行…,直至整个所述第二待处理数据矩阵都被取出为止。
具体的,以第一行为例,第一个pv*1数据包括6个无效数据和2个有效数据,从第二个pv*2数据开始,均包括有8个有效数据。所述数据传输模块将第一个pv*1数据D1取出后置入第二缓存模块中地址Addr=0的位置,将第二个pv*1数据D2取出后置入地址Addr=1的位置,将第三个pv*1数据D3取出后置入地址Addr=2的位置,以此类推将全部所述第二待处理数据矩阵全部从所述第一缓存模块中取出并置入第二缓存模块中。
所述数据传输模块将所述第二待处理数据矩阵存入所述第一缓存模块中后,又按按行且按照pv*k的数据尺寸,分批将所述第二待处理数据矩阵从所述第二缓存模块中取出并置入矩阵模块中以进行数据组合;其中,k为卷积核矩阵的尺寸,所述卷积核矩阵是用于卷积计算的权重矩阵;所述卷积核矩阵可设为奇数阶矩阵,于本实施例中将所述卷积核矩阵设为3*3矩阵。
如图2所示,所述数据传输模块在每个时钟周期内按照从左到右的顺序,依次从34*40 第二待处理数据矩阵的前三行中取出3*8阶第三待处理数据矩阵。也即,前三行共可取出5个3*8阶第三待处理数据矩阵。基于上述相同的原理,所述数据传输模块在取完前三行后继续取出后续行的待处理数据。为方便本领域技术人员理解,图2中用矩形虚线框R1~R5表示前3行的第三待处理数据矩阵。
如图4所示,展示本发明一实施例中数据传输模块取出待处理数据的示意图。在第一个时钟周期T1内,所述数据传输模块取出的第一个第三待处理数据矩阵M1,其包括6列无效数据和2列有效数据,且所述第三待处理数据矩阵M1的卷积结果为无效值。
在第二个时钟周期T2内,所述数据传输模块取出第二个第三待处理数据矩阵M2,所述第三待处理数据矩阵M2与所述第三待处理数据矩阵M1的最后两列组合成3*10第四待处理数据矩阵M12,图中用直线L1代表相互组合的待处理数据。所述数据矩阵M2通过与数据矩阵M1的最后两列相互组合,得到列数为10的数据矩阵M12。所述3*10第四待处理数据矩阵M12能够按照步长1进行矩阵提取,从而得到8个3*3的第五待处理数据矩阵;所述8个3*3的第五待处理数据矩阵用于传输至卷积模块中,以与所述3*3卷积核矩阵进行卷积计算并得到8个计算结果值。
所述8个3*3的第五待处理数据矩阵具体是指:如图4中所示的矩形虚线框R6,以图4中覆盖的矩阵为起始位置,按照步长1逐列向右移动,每移动一列便得到一个尺寸为3*3的矩阵。由此可知,矩形虚线框R6可在所述所述3*10第四待处理数据矩阵M12中总共移动7次,共计8个3*3矩阵,也即pv个k*k矩阵。
同理,在第三个时钟周期T3内,所述数据传输模块取出第三个第三待处理数据矩阵M3,所述第三待处理数据矩阵M3与所述第三待处理数据矩阵M2的最后两列组合成3*10第四待处理数据矩阵M23,图中用直线L2代表相互组合的待处理数据。所述数据矩阵M3通过与数据矩阵M2的最后两列相互组合,得到列数为10的数据矩阵M23。所述3*10第四待处理数据矩阵M23能够按照步长1进行矩阵提取,从而得到8个3*3的第五待处理数据矩阵;所述8个3*3的第五待处理数据矩阵用于传输至卷积模块中,以与所述3*3卷积核矩阵进行卷积计算并得到8个计算结果值。以此类推,所述数据传输模块基于同样的原理,在经历多个时钟周期后可完成处理整个所述第二待处理数据矩阵。
值得注意的是,若所述34*40矩阵第二待处理数据矩阵不进行无效数据的添加,则读入的第一个3*8阶第三待处理数据矩阵可提取6个3*3矩阵以卷积输出6个计算结果值。但是,从第二个3*8阶第三待处理数据矩阵开始,可与上一个第三待处理数据矩阵的最后两列结合为3*10矩阵,故依次提取8个3*3矩阵以卷积输出8个计算结果值。由此可知,在不增设无 效数据的情况下整个所述第二待处理数据矩阵与3*3卷积核矩阵,得到的卷积结果将按照6个卷积计算结果值,8个卷积计算结果值,8个卷积计算结果值…不断循环,卷积结果值的数量不统一导致无法实现人工智能卷积的流水处理,因而大大降低了卷积计算的效率。
在本发明提供的人工智能卷积处理方法中,所述34*40矩阵第二待处理数据矩阵的前三行与3*3卷积核矩阵进行卷积计算得到的结果为:无效值、8个卷积计算结果值、8个卷积计算结果值、8个卷积计算结果值、8个卷积计算结果值。以此类推,整个所述34*40矩阵第二待处理数据矩阵与3*3卷积核矩阵进行卷积计算的卷积结果将按照无效值,8个卷积计算结果值,8个卷积计算结果值…不断循环。由此可知,本发明提供的人工智能卷积处理方法,输出的卷积计算结果值的数量统一为pv个,故而能够实现人工智能卷积的流水处理,大大提升了人工智能卷积计算的运行效率并大幅改善了卷积计算性能。
如图5所示,展示本发明一实施例中的人工智能处理装置,其包括:第一缓存模块51、第二缓存模块52、数据传输模块53、处理模块54、以及矩阵模块55。其中,所述第一缓存模块51、第二缓存模块52、数据传输模块53、矩阵模块55与卷积模块56共同设于FPGA的Programmable Logic端50,也即通常称为PL端。
所述第一缓存模块51存储有第一待处理数据矩阵,所述第一待处理数据有所述数据传输模块53通过系统总线从外部存储模块57中取出。其中,所述外部存储模块57例如为DDR存储器。
所述处理模块54用于在第一待处理数据矩阵的首端增设多列无效数据以形成第二待处理数据矩阵;其中,所述第二待处理数据矩阵的列数为数据传输并行度的整数倍;所述数据传输模块53通信连接并受控于所述处理模块54,用于将所述第二待处理数据矩阵从所述第一缓存模块51中取出,以待以待进行卷积运算。
所述第一缓存模块51例如可以是BRAM存储器,也即Block RAM,是FPGA(Field-Programmable Gate Array)现场可编程门阵列的RAM存储资源。所述处理模块54例如可以是ARM模块、MCU模块、或者Soc模块等等。
所述人工智能处理装置的实施方式与所述人工智能卷积处理方法的实施方式类似,故不再赘述,本领域技术人员应该能够在所述人工智能卷积处理方法的基础上理解所述人工智能处理装置的原理及实施方式。
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过计算机程序相关的硬件来完成。前述的计算机程序可以存储于一计算机可读存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟 或者光盘等各种可以存储程序代码的介质。
本发明还提供一种人工智能处理终端,包括:处理器及存储器;所述存储器用于存储计算机程序,所述处理器用于执行所述存储器存储的计算机程序,以使所述终端执行所述人工智能卷积处理方法。
上述存储器可能包含随机存取存储器(RandomAccessMemory,简称RAM),也可能还包括非易失性存储器(non-volatilememory),例如至少一个磁盘存储器。
上述的处理器可以是通用处理器,包括中央处理器(CentralProcessingUnit,简称CPU)、网络处理器(NetworkProcessor,简称NP)等;还可以是数字信号处理器(DigitalSignalProcessing,简称DSP)、专用集成电路(ApplicationSpecificIntegratedCircuit,简称ASIC)、现场可编程门阵列(Field-ProgrammableGateArray,简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。
综上所述,本发明提供的人工智能处理装置、方法、可读存储介质、及终端,为待处理数据矩阵增设多列无效数据,以使增设无效数据后的矩阵列数为数据传输并行度的倍数,从而使输出的卷积计算结果值的数量统一为pv个,故而能够实现人工智能卷积的流水处理,大大提升了人工智能卷积计算的运行效率并大幅改善了卷积计算性能。所以,本发明有效克服了现有技术中的种种缺点而具高度产业利用价值。
上述实施例仅例示性说明本发明的原理及其功效,而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下,对上述实施例进行修饰或改变。因此,举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等效修饰或改变,仍应由本发明的权利要求所涵盖。

Claims (12)

  1. 一种人工智能卷积处理方法,其特征在于,应用于处理模块,所述方法包括:
    将存储于第一缓存模块中的第一待处理数据矩阵的首端增设多列无效数据,以形成第二待处理数据矩阵;其中,所述第二待处理数据矩阵的列数为数据传输并行度的整数倍;
    令数据传输模块将所述第二待处理数据矩阵按照预设方式从所述第一缓存模块中取出,以待进行卷积运算。
  2. 根据权利要求1所述的人工智能卷积处理方法,其特征在于,所述将存储于第一缓存模块中的第一待处理数据矩阵的首端增设多列无效数据,具体包括:
    令所述数据传输并行度的值为参数pv,则第一待处理数据矩阵的首端增设(pv-2)列无效数据,以与所述第一待处理数据矩阵的前2列有效数据组成pv列数据。
  3. 根据权利要求2所述的人工智能卷积处理方法,其特征在于,令数据传输模块将所述第二待处理数据矩阵按照预设方式从所述第一缓存模块中取出,以待进行卷积运算,具体包括:
    令所述数据传输模块按行且按照pv*1的数据尺寸,分批将所述第二待处理数据矩阵从所述第一缓存模块中取出并置入第二缓存模块中;
    令所述数据传输模块按行且按照pv*k的数据尺寸,分批将所述第二待处理数据矩阵从所述第二缓存模块中取出并置入矩阵模块中以进行数据组合;其中,k为卷积核矩阵的尺寸。
  4. 根据权利要求3所述的人工智能卷积处理方法,其特征在于,所述令所述数据传输模块按行且按照pv*k的数据尺寸,分批将所述第二待处理数据矩阵从所述第二缓存模块中取出并置入矩阵模块中以进行数据组合,具体包括:
    令所述第二待处理数据矩阵每k行为一组数据;
    所述数据传输模块依次对每一组数据进行如下操作:在每个时钟周期内,从该组数据中依次取出数据尺寸为pv*k的第三待处理数据矩阵并置入矩阵模块中,直至该组数据全部被取出。
  5. 根据权利要求4所述的人工智能卷积处理方法,其特征在于,在所述每一组数据中,所述数据传输模块取出的第一个数据尺寸为pv*k的第三待处理数据矩阵包括(pv-2)列无效数据和2列有效数据,以令其计算结果值为无效值。
  6. 根据权利要求4所述的人工智能卷积处理方法,其特征在于,在所述每一组数据中,从所述数据传输模块取出的第二个第三待处理数据矩阵开始,每个第三待处理数据矩阵均与前一个第三待处理数据矩阵的最后2列组合形成k*(pv+2)阶第四待处理数据矩阵;其中, 每个所述k*(pv+2)阶第四待处理数据矩阵均能够按照步长为1进行矩阵提取,以得到pv个k*k阶第五待处理数据矩阵;所述k*k阶第五待处理数据矩阵用于传输至所述卷积模块以与所述卷积核矩阵进行卷积计算。
  7. 一种人工智能处理装置,其特征在于,包括:
    第一缓存模块,存储有第一待处理数据矩阵;
    处理模块,用于在第一待处理数据矩阵的首端增设多列无效数据以形成第二待处理数据矩阵;其中,所述第二待处理数据矩阵的列数为数据传输并行度的整数倍;
    数据传输模块,通信连接并受控于所述处理模块,用于将所述第二待处理数据矩阵按照预设方式从所述第一缓存模块中取出,以待进行卷积运算。
  8. 根据权利要求9所述的人工智能处理装置,其特征在于,所述在第一待处理数据矩阵的首端增设多列无效数据,具体包括:
    若令所述数据传输并行度为pv值,则所述处理模块在第一待处理数据矩阵的首端增设(pv-2)列无效数据,以与所述第一待处理数据矩阵的前2列有效数据组成pv列数据。
  9. 根据权利要求8所述的人工智能处理装置,其特征在于,包括:
    第二缓存模块,用于存储来自所述第一缓存模块的所述第二待处理数据矩阵;
    矩阵模块,用于存储来自所述第二缓存模块的所述第二待处理数据矩阵。
  10. 根据权利要求9所述的人工智能处理装置,其特征在于,包括:
    所述第二待处理数据矩阵每k行为一组数据;所述数据传输模块对每一组数据进行如下操作:在每个时钟周期内,从该组数据中依次取出pv*k第三待处理数据矩阵,直至该组数据全部被取出;
    其中,所述矩阵模块还用于,从所述数据传输模块在每一组数据中取出的第二个第三待处理数据矩阵开始,将每个第三待处理数据矩阵均与前一个第三待处理数据矩阵的最后2列组合形成k*(pv+2)阶第四待处理数据矩阵;其中,每个所述k*(pv+2)阶第四待处理数据矩阵均能够按照步长为1进行矩阵提取,以得到pv个k*k阶第五待处理数据矩阵;所述k*k阶第五待处理数据矩阵用于传输至所述卷积模块以与所述卷积核矩阵进行卷积计算。
  11. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1至6中任一项所述的人工智能卷积处理方法。
  12. 一种人工智能处理终端,其特征在于,包括:处理器及存储器;
    所述存储器用于存储计算机程序,所述处理器用于执行所述存储器存储的计算机程序, 以使所述终端执行如权利要求1至6中任一项所述的人工智能卷积处理方法。
PCT/CN2018/072665 2018-01-15 2018-01-15 人工智能卷积处理方法、装置、可读存储介质、及终端 WO2019136752A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2018/072665 WO2019136752A1 (zh) 2018-01-15 2018-01-15 人工智能卷积处理方法、装置、可读存储介质、及终端
CN201880002147.0A CN109313723B (zh) 2018-01-15 2018-01-15 人工智能卷积处理方法、装置、可读存储介质、及终端
US16/929,819 US11874898B2 (en) 2018-01-15 2020-07-15 Streaming-based artificial intelligence convolution processing method and apparatus, readable storage medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/072665 WO2019136752A1 (zh) 2018-01-15 2018-01-15 人工智能卷积处理方法、装置、可读存储介质、及终端

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/072663 Continuation-In-Part WO2019136751A1 (zh) 2018-01-15 2018-01-15 人工智能并行处理方法、装置、可读存储介质、及终端

Related Child Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2018/072663 Continuation-In-Part WO2019136751A1 (zh) 2018-01-15 2018-01-15 人工智能并行处理方法、装置、可读存储介质、及终端
US16/929,819 Continuation-In-Part US11874898B2 (en) 2018-01-15 2020-07-15 Streaming-based artificial intelligence convolution processing method and apparatus, readable storage medium and terminal

Publications (1)

Publication Number Publication Date
WO2019136752A1 true WO2019136752A1 (zh) 2019-07-18

Family

ID=65221785

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/072665 WO2019136752A1 (zh) 2018-01-15 2018-01-15 人工智能卷积处理方法、装置、可读存储介质、及终端

Country Status (2)

Country Link
CN (1) CN109313723B (zh)
WO (1) WO2019136752A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160570A (zh) * 2019-12-31 2020-05-15 山东浪潮人工智能研究院有限公司 用于预测性维护的基于卷积算子的特征构造方法及系统

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220180176A1 (en) * 2020-12-08 2022-06-09 Huawei Technologies Co., Ltd. System, method and apparatus for intelligent caching
CN113704689B (zh) * 2021-08-25 2022-11-11 北京大学 一种基于昇腾ai处理器的矩阵乘算子的处理方法及装置
CN113705795B (zh) * 2021-09-16 2024-12-17 深圳思谋信息科技有限公司 卷积处理方法、装置、卷积神经网络加速器和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090132794A1 (en) * 2007-11-16 2009-05-21 Paul Michael Ebert Method and apparatus for performing complex calculations in a multiprocessor array
CN106875012A (zh) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 一种基于fpga的深度卷积神经网络的流水化加速系统
CN106909970A (zh) * 2017-01-12 2017-06-30 南京大学 一种基于近似计算的二值权重卷积神经网络硬件加速器计算模块
CN106970896A (zh) * 2017-03-30 2017-07-21 中国人民解放军国防科学技术大学 面向向量处理器的二维矩阵卷积的向量化实现方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100298327B1 (ko) * 1999-06-30 2001-11-01 구자홍 고속 컨벌루션 처리 방법 및 그 장치
CN2919379Y (zh) * 2005-11-22 2007-07-04 清华大学 一种采用直线轨迹扫描的图像重建装置
CN102446160B (zh) * 2011-09-06 2015-02-18 中国人民解放军国防科学技术大学 面向双精度simd部件的矩阵乘实现方法
CN106228240B (zh) * 2016-07-30 2020-09-01 复旦大学 基于fpga的深度卷积神经网络实现方法
CN107451654B (zh) * 2017-07-05 2021-05-18 深圳市自行科技有限公司 卷积神经网络的加速运算方法、服务器及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090132794A1 (en) * 2007-11-16 2009-05-21 Paul Michael Ebert Method and apparatus for performing complex calculations in a multiprocessor array
CN106909970A (zh) * 2017-01-12 2017-06-30 南京大学 一种基于近似计算的二值权重卷积神经网络硬件加速器计算模块
CN106875012A (zh) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 一种基于fpga的深度卷积神经网络的流水化加速系统
CN106970896A (zh) * 2017-03-30 2017-07-21 中国人民解放军国防科学技术大学 面向向量处理器的二维矩阵卷积的向量化实现方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160570A (zh) * 2019-12-31 2020-05-15 山东浪潮人工智能研究院有限公司 用于预测性维护的基于卷积算子的特征构造方法及系统

Also Published As

Publication number Publication date
CN109313723A (zh) 2019-02-05
CN109313723B (zh) 2022-03-15

Similar Documents

Publication Publication Date Title
US11423285B2 (en) Buffer addressing for a convolutional neural network
WO2019136752A1 (zh) 人工智能卷积处理方法、装置、可读存储介质、及终端
WO2019136751A1 (zh) 人工智能并行处理方法、装置、可读存储介质、及终端
CN107844828B (zh) 神经网络中的卷积计算方法和电子设备
CN108108811B (zh) 神经网络中的卷积计算方法和电子设备
WO2019136762A1 (zh) 人工智能处理器、及其所应用的处理方法
WO2019127517A1 (zh) 数据处理方法、设备、dma控制器及计算机可读存储介质
US11550586B2 (en) Method and tensor traversal engine for strided memory access during execution of neural networks
WO2019136750A1 (zh) 人工智能计算辅助处理装置、方法、存储介质、及终端
WO2019127507A1 (zh) 数据处理方法、设备、dma控制器及计算机可读存储介质
CN108710505A (zh) 一种基于fpga的可扩展稀疏矩阵向量乘处理器
CN114995782B (zh) 数据处理方法、装置、设备和可读存储介质
WO2021083101A1 (zh) 数据处理方法、装置及相关产品
CN107894957B (zh) 面向卷积神经网络的存储器数据访问与插零方法及装置
KR20210014561A (ko) 다수 컨벌루션 윈도우 중의 이미지 데이터를 추출하는 방법, 장치, 기기 및 컴퓨터 판독 가능한 저장매체
TW202024922A (zh) 存取張量資料的方法和裝置
CN112416433A (zh) 一种数据处理装置、数据处理方法及相关产品
CN111242286A (zh) 一种数据格式变换方法、装置及计算机可读存储介质
US11874898B2 (en) Streaming-based artificial intelligence convolution processing method and apparatus, readable storage medium and terminal
WO2020103883A1 (zh) 执行矩阵乘法运算的方法、电路及soc
CN113806261A (zh) 一种面向向量处理器的池化向量化实现方法
CN111047026B (zh) 可执行人工智能运算的存储器芯片及其操作方法
WO2021082723A1 (zh) 运算装置
CN109741428B (zh) 一种适用于二维流体模拟的三阶高精度对流插值算法
CN116420136A (zh) 共享操作数的垂直和水平广播

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18899801

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.11.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18899801

Country of ref document: EP

Kind code of ref document: A1