[go: up one dir, main page]

CN111967582B - CNN convolutional layer operation method and CNN convolutional layer operation accelerator - Google Patents

CNN convolutional layer operation method and CNN convolutional layer operation accelerator Download PDF

Info

Publication number
CN111967582B
CN111967582B CN202010791455.5A CN202010791455A CN111967582B CN 111967582 B CN111967582 B CN 111967582B CN 202010791455 A CN202010791455 A CN 202010791455A CN 111967582 B CN111967582 B CN 111967582B
Authority
CN
China
Prior art keywords
image
matrix
read
processed
cnn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010791455.5A
Other languages
Chinese (zh)
Other versions
CN111967582A (en
Inventor
杨继林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202010791455.5A priority Critical patent/CN111967582B/en
Publication of CN111967582A publication Critical patent/CN111967582A/en
Application granted granted Critical
Publication of CN111967582B publication Critical patent/CN111967582B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

本发明提供一种CNN卷积层运算方法及CNN卷积层运算加速器,均能够:读入用于对待处理特征图像进行CNN卷积层运算的卷积核,并将读入的卷积核转化为权重矩阵H(hpq);按照预先设定的图像大小阈值,读取待处理特征图像上的一个分块图像,并依据所述的权重矩阵H(hpq)计算当前所读入的分块图像对应的CNN卷积层局部运算结果;判断整幅待处理特征图像是否读取完成:若是,则按照各所述分块图像之间的相对位置关系排布所得到的各CNN卷积层局部运算结果,并拼接得到整幅待处理特征图像对应的CNN卷积层运算结果;若否则继续读入下一个分块图像。本发明用于降低CNN卷积运算的复杂度、降低存储带宽压力以及降低完成CNN卷积层运算的成本。

Figure 202010791455

The present invention provides a CNN convolution layer operation method and a CNN convolution layer operation accelerator, both of which can: read in the convolution kernel used to perform the CNN convolution layer operation on the feature image to be processed, and convert the read convolution kernel into is the weight matrix H(h pq ); according to the preset image size threshold, read a block image on the feature image to be processed, and calculate the currently read score according to the weight matrix H(h pq ) The local operation result of the CNN convolution layer corresponding to the block image; judge whether the entire feature image to be processed has been read: if so, arrange the obtained CNN convolution layers according to the relative positional relationship between the block images. Local operation results, and splicing to obtain the CNN convolution layer operation results corresponding to the entire feature image to be processed; otherwise, continue to read the next block image. The present invention is used to reduce the complexity of the CNN convolution operation, reduce the pressure on the storage bandwidth and reduce the cost of completing the CNN convolution layer operation.

Figure 202010791455

Description

CNN convolutional layer operation method and CNN convolutional layer operation accelerator
Technical Field
The invention relates to the field of convolution operation acceleration, in particular to a CNN convolution layer operation method and a CNN convolution layer operation accelerator.
Background
With the continuous development of CNNs (Convolutional Neural Networks), CNNs are applied more and more widely in the fields of image classification, image recognition, and the like.
The operation of the CNN convolutional layer is two-dimensional convolution, and a common implementation scheme is that sliding windows are used for realizing convolution calculation, namely, a special control module is used for acquiring a feature map two-dimensional window with the same size for a k × k convolution kernel, then the feature map two-dimensional window slides on a feature map which needs convolutional layer calculation, and then multiplication and addition operation is carried out on the feature map which needs convolutional layer calculation and the corresponding point of the convolution kernel. The method for realizing two-dimensional convolution by the sliding window is more intuitive, and the subsequent calculation process is relatively simple as long as the correct two-dimensional window of the characteristic diagram can be obtained. However, the control module for generating the two-dimensional window is relatively complex to implement. In addition, for k × k convolution kernels, and for feature maps requiring convolution layer calculation for online input, additional k-1 line storage is often required, thereby increasing the cost. In addition, in the operation process of the conventional CNN convolution layer, the weight in the convolution kernel is required to be repeatedly read for many times, and the storage bandwidth pressure is increased to a certain extent.
Therefore, the present invention provides a CNN convolutional layer operation method and a CNN convolutional layer operation accelerator, which are used to solve the above problems.
Disclosure of Invention
In view of the above-mentioned deficiencies of the prior art, the present invention provides a CNN convolutional layer operation method and a CNN convolutional layer operation accelerator, which are used for reducing the complexity of CNN convolutional operation. The invention also provides for reducing storage bandwidth pressure. The invention is also used for reducing the cost of completing the CNN convolutional layer operation.
In a first aspect, the present invention provides a CNN convolutional layer operation method, including the steps of:
s1, reading convolution kernels used for CNN convolution layer operation on the characteristic image to be processed, and converting the read convolution kernels into a weight matrix H (H)pq) Where the convolution kernel is a k × k convolution kernel, hpqIs a weight matrix H (H)pq) P is 0, 1, 2, …, k-1; q is 0, 1, 2, …, k-1;
s2, reading a block image on the characteristic image to be processed according to the preset image size threshold value and the block image, and according to the weight matrix H (H)pq) Calculating a local operation result of the CNN convolution layer corresponding to the currently read block image;
s3, judging whether the whole characteristic image to be processed is read completely, if so, continuing to execute the step S4, otherwise, repeatedly executing the step S2;
s4, arranging the local operation results of the CNN convolutional layers obtained in the step S2 according to the relative position relationship among the block images, and splicing to obtain the CNN convolutional layer operation results corresponding to the whole characteristic image to be processed;
wherein, the step S2 is based on the weight matrix H (H)pq) The implementation method for calculating the local operation result of the CNN convolution layer corresponding to the currently read block image comprises the following steps:
p1, converting the block image read currently into image matrix A (a)ij) Wherein the block image is a digital image of m × n pixels, aijIs an image matrix A (a)ij) The (i, j) element of (a); wherein i ═ 0, 1, 2.., m-1; j ═ 0, 1, 2,. ang, n-1;
p2, read rightsWeight matrix H (H)pq) Each element h ofpqAnd respectively obtaining each element hpqEach corresponding image matrix A (a)ij) And a product matrix formed by all elements required to be multiplied with the element, and each element hpqMultiplying with the corresponding product matrix to obtain each element hpqRespective corresponding local matrices;
p3, calculating the sum of the obtained local matrixes, wherein the sum is the local operation result of the CNN convolution layer corresponding to the block image read in currently;
wherein, the block images read in step S2 each time are different from each other;
in step S2, a block image on the feature image to be processed is read according to a preset image size threshold, and the reading method includes:
when reading the block image for the first time, reading a block image meeting the requirement of the image size threshold from the feature image to be processed according to a preset reading initial position;
when the block images are read again, each read block image contains k-1 rows or k-1 columns of pixels of its respective adjacent block image.
Further, the element h involved in step P2pqThe corresponding product matrix includes the following cases:
when p is 0 and q is 0, the element h is concernedpqThe corresponding product matrix is the image matrix A (a)ij) Wherein all rows and columns remaining after removing the n-k +1, n-k +2, n-k +3, …, n-1 th column and the m-k +1, m-k +2, m-k +3, …, m-1 th row form an (m-k +1) × (n-k +1) matrix;
when p is 0 and q is not equal to 0, the element h involvedpqThe corresponding product matrix is the image matrix A (a)ij) A (m-k +1) × (n-k +1) matrix formed by splicing all rows and columns except the 0 th, 1, 2, …, q-1, n-k + q +2, n-k + q +3, … and n-1 th columns and the m-k +1, m-k +2, m-k +3, … and m-1 th rows;
when p ≠ 0 and q ≠ 0, the element h involvedpqCorresponding toThe product matrix is the image matrix A (a)ij) The (m-k +1) × (n-k +1) matrix is formed by splicing all rows and columns which are remained after removing the n-k +1, n-k +2, n-k +3, … and n-1 columns and removing the 0, 1, 2, …, p-1, m-k + p +2, m-k + p +3, … and m-1 rows;
when p ≠ 0 and q ≠ 0, the element h involvedpqThe corresponding product matrix is the image matrix A (a)ij) Wherein the (m-k +1) × (n-k +1) matrix is formed by splicing all rows and columns which are left after removing the 0 th, 1 st, 2 nd, … rd, q-1 th, n-k + q +1, n-k + q +2, n-k + q +3, … th and n-1 st columns and removing the 0 th, 1 st, 2 nd, … th, p-1 th, m-k + p +2 th, m-k + p +3 th, … th and m-1 st rows.
Further, the weight matrix H (H) obtained by the conversion in step S1pq) Storing in a cache; the image matrix A (a) to be converted in step P1ij) Stored in a cache.
Further, the CNN convolutional layer operation method is realized based on an FPGA.
Further, in step P2, each element h is divided into multiple elements by a multiplier arraypqMultiplying with the corresponding product matrix to obtain each element hpqA respective corresponding partial matrix.
In another aspect, the present invention provides a CNN convolutional layer arithmetic accelerator, including:
the first data pre-reading module is used for reading a convolution kernel used for performing CNN convolution layer operation on the characteristic image to be processed and converting the read convolution kernel into a weight matrix H (H)pq) Wherein the convolution kernel is k × k convolution kernel, hpqIs a weight matrix H (H)pq) The (p, q) element of (a), p ═ 0, 1, 2,. ang, k-1; q-0, 1, 2,. k-1;
the second data pre-reading module is used for reading a block image on the characteristic image to be processed according to a preset image size threshold;
a local operation module for calculating the weight matrix H (H) according to the weight matrixpq) Calculating a local operation result of the CNN convolution layer corresponding to the block image currently read by the second data pre-reading module;
the judging module is used for judging whether the whole characteristic image to be processed is read completely;
the convolutional layer operation result output module is used for arranging the local operation results of the CNN convolutional layers obtained by the local operation module according to the relative position relationship between the block images when the judgment module judges that the whole characteristic image to be processed is read, and then splicing the CNN convolutional layers to obtain and output the CNN convolutional layer operation results corresponding to the whole characteristic image to be processed;
the calling module is used for calling the data pre-reading module to continue executing when the judging module judges that the whole characteristic image to be processed is not read;
wherein, the local operation module comprises:
an image matrix conversion unit for converting the currently read block image into an image matrix A (a)ij) Wherein the block image is a digital image of m × n pixels, aijIs an image matrix A (a)ij) The (i, j) element of (a); wherein i ═ 0, 1, 2.., m-1; j ═ 0, 1, 2,. ang, n-1;
a local matrix acquisition unit for reading the weight matrix H (H)pq) Each element h of (2)pqAnd respectively obtaining each element hpqEach corresponding image matrix A (a)ij) And a product matrix formed by all elements required to be multiplied with the element, and each element hpqMultiplying with the corresponding product matrix to obtain each element hpqRespective corresponding local matrices;
the local operation result acquisition unit is used for calculating the sum of all the obtained local matrixes, wherein the sum is the local operation result of the CNN convolution layer corresponding to the currently read block image;
the second data pre-reading module reads different block images of the characteristic image to be processed each time;
the second data pre-reading module reads a block image on the characteristic image to be processed according to a preset image size threshold, and the reading method comprises the following steps:
when reading the block image for the first time, reading a block image meeting the requirement of the image size threshold from the characteristic image to be processed according to a preset reading initial position;
when the block images are read again, each read block image contains k-1 rows or k-1 columns of pixels of its respective adjacent block image.
Further, the element h involved in the local matrix acquisition unitpqThe corresponding product matrix includes the following cases:
when p is 0 and q is 0, the element h is concernedpqThe corresponding product matrix is the image matrix A (a)ij) Wherein all rows and columns remaining after removing the n-k +1, n-k +2, n-k +3, …, n-1 th column and the m-k +1, m-k +2, m-k +3, …, m-1 th row form an (m-k +1) × (n-k +1) matrix;
when p is 0 and q is not equal to 0, the element h involvedpqThe corresponding product matrix is the image matrix A (a)ij) Removing 0, 1, 2, …, q-1, n-k + q +2, n-k + q +3, … and n-1 columns and removing m-k +1, m-k +2, m-k +3, … and all rows and columns which are remained after m-k +1, m-k +2, m-k +3, … and m-1 rows are spliced to form an (m-k +1) x (n-k +1) matrix;
when p ≠ 0 and q ≠ 0, the element h involvedpqThe corresponding product matrix is the image matrix A (a)ij) The (m-k +1) × (n-k +1) matrix is formed by splicing all rows and columns which are remained after removing the n-k +1, n-k +2, n-k +3, … and n-1 columns and removing the 0, 1, 2, …, p-1, m-k + p +2, m-k + p +3, … and m-1 rows;
when p ≠ 0 and q ≠ 0, the element h involvedpqThe corresponding product matrix is the image matrix A (a)ij) Wherein the (m-k +1) × (n-k +1) matrix is formed by splicing all rows and columns which are left after removing the 0 th, 1 st, 2 nd, … rd, q-1 th, n-k + q +1, n-k + q +2, n-k + q +3, … th and n-1 st columns and removing the 0 th, 1 st, 2 nd, … th, p-1 th, m-k + p +2 th, m-k + p +3 th, … th and m-1 st rows.
Furthermore, the CNN convolution layer operation accelerator also comprises a cache;
the weight matrix H (H) converted by the first data pre-reading modulepq) Storing in a cache;
the image matrix A (a) converted by the image matrix conversion unitij) Stored in a cache.
Further, the CNN convolutional layer operation accelerator is realized based on an FPGA.
Further, the local matrix acquisition unit adopts a multiplier array to combine each element hpqMultiplying with the corresponding product matrix to obtain each element hpqA respective corresponding partial matrix.
The beneficial effect of the invention is that,
(1) the CNN convolutional layer operation method and the CNN convolutional layer operation accelerator provided by the invention avoid the use of a feature map two-dimensional window in the prior art, further avoid the use of a control module for generating the feature map two-dimensional window in the prior art, reduce the complexity of CNN convolutional operation to a certain extent and are convenient to realize.
(2) The CNN convolutional layer operation method and the CNN convolutional layer operation accelerator provided by the invention use each weight (corresponding to element h) in the convolutional corepq) For the starting point, directly obtaining all the feature points (corresponding to the image matrix A (a) converted from the block image) in the block image which are respectively corresponding to each weight and need to be multiplied by the read weightij) All the elements of the convolutional kernel), then multiplying each weight in the convolutional kernel by the product matrix corresponding to the weight to obtain a local matrix corresponding to each weight, and then performing matrix addition operation on all the local matrices corresponding to all the weights in the convolutional kernel to obtain a local operation result of the CNN convolutional layer corresponding to each block image.
(3) The CNN convolutional layer operation method and the CNN convolutional layer operation accelerator provided by the invention use the weight matrix H (H) required in the operation processpq) And an image matrix A (a)ij) The storage is in the cache, on one hand, the extra storage is avoided, the cost for completing the CNN convolution layer operation is reduced to a certain extent, and on the other hand, the cost is reducedThe number of times data required for the CNN convolutional layer operation is read from the external memory, which is helpful to increase the rate of the CNN convolutional layer operation to some extent.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.
Fig. 2 is a schematic diagram of the distribution of the relative position relationship of the block image F1 in the feature image to be processed in the present invention.
Fig. 3 is a schematic diagram of the distribution of the relative position relationship of the segmented image F2 in the feature image to be processed in the present invention.
Fig. 4 is a schematic diagram of the distribution of the relative positional relationship of the segmented image F3 in the feature image to be processed in the present invention.
Fig. 5 is a schematic diagram of the distribution of the relative position relationship of the block image F4 in the feature image to be processed in the present invention.
FIG. 6 is a schematic diagram of the arrangement positions of the matrix C1, the matrix C2, the matrix C3 and the matrix C4 in the present invention.
FIG. 7 is a schematic block diagram of a system of one embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
FIG. 1 is a schematic flow chart of a CNN convolutional layer operation method according to an embodiment of the present invention.
As shown in fig. 1, the CNN convolutional layer operation method includes:
step S1, reading convolution kernels used for CNN convolution layer operation on the characteristic image to be processed, and converting the read convolution kernels into a weight matrix H (H)pq) Where the convolution kernel is a k × k convolution kernel, hpqIs a weight matrix H (H)pq) The (p, q) element of (a), p ═ 0, 1, 2,. ang, k-1; q-0, 1, 2,. k-1;
step S2, reading a block image on the characteristic image to be processed according to the preset image size threshold value, and according to the weight matrix H (H)pq) Calculating a local operation result of the CNN convolution layer corresponding to the currently read block image;
step S3, judging whether the whole characteristic image to be processed is read completely, if so, continuing to execute step S4, otherwise, repeatedly executing step S2;
and S4, arranging the local operation results of the CNN convolutional layers obtained in the step S2 according to the relative position relationship among the block images, and splicing to obtain the CNN convolutional layer operation results corresponding to the whole characteristic image to be processed.
Wherein, the step S2 is based on the weight matrix H (H)pq) The implementation method for calculating the local operation result of the CNN convolution layer corresponding to the currently read block image comprises the following steps:
step P1, converting the block image currently read in into image matrix A (a)ij) Wherein the block image is a digital image of m × n pixels, aijIs an image matrix A (a)ij) The (i, j) element of (a); wherein i ═ 0, 1, 2.., m-1; j ═ 0, 1, 2,. ang, n-1;
step P2, read the weight matrix H (H)pq) Each element h ofpqAnd respectively obtaining each element hpqEach corresponding image matrix A (a)ij) All elements of (1) that need to be multiplied with itForming a product matrix, and dividing each element hpqMultiplying with the product matrix corresponding to each element to obtain each element hpqRespective corresponding local matrices;
and step P3, calculating the sum of the obtained local matrixes, wherein the sum is the local operation result of the CNN convolution layer corresponding to the currently read block image.
Wherein, the block images of the characteristic image to be processed read each time in the step S2 are different from each other;
in step S2, a block image on the feature image to be processed is read according to a preset image size threshold, where the reading method includes:
when reading the block image for the first time, reading a block image meeting the requirement of the image size threshold from the characteristic image to be processed according to a preset reading initial position;
when the block images are read again, each read block image contains k-1 rows or k-1 columns of pixels of its respective adjacent block image.
Alternatively, as an embodiment of the present invention, the element h involved in step P2pqThe corresponding product matrix includes the following cases:
when p is 0 and q is 0, the element h is concernedpqThe corresponding product matrix is the image matrix A (a)ij) Wherein all rows and columns remaining after removing the n-k +1, n-k +2, n-k +3, …, n-1 th column and the m-k +1, m-k +2, m-k +3, …, m-1 th row form an (m-k +1) × (n-k +1) matrix;
when p is 0 and q is not equal to 0, the element h involvedpqThe corresponding product matrix is the image matrix A (a)ij) Removing 0, 1, 2, …, q-1, n-k + q +2, n-k + q +3, … and n-1 columns and removing m-k +1, m-k +2, m-k +3, … and all rows and columns which are remained after m-k +1, m-k +2, m-k +3, … and m-1 rows are spliced to form an (m-k +1) x (n-k +1) matrix;
when p ≠ 0 and q ≠ 0, the element h involvedpqThe corresponding product matrix is the image matrix A (a)ij) In which the n-k +1, n-k +2, n-k +3, …, n-1 rows are removed and the 0, 1, 2, …, n-1 rows are removed,p-1, m-k + p +2, m-k + p +3, …, and a (m-k +1) × (n-k +1) matrix formed by splicing all rows and columns left after the m-1 row;
when p ≠ 0 and q ≠ 0, the element h involvedpqThe corresponding product matrix is the image matrix A (a)ij) Wherein the (m-k +1) × (n-k +1) matrix is formed by splicing all rows and columns which are left after removing the 0 th, 1 st, 2 nd, … rd, q-1 th, n-k + q +1, n-k + q +2, n-k + q +3, … th and n-1 st columns and removing the 0 th, 1 st, 2 nd, … th, p-1 th, m-k + p +2 th, m-k + p +3 th, … th and m-1 st rows.
Alternatively, as an embodiment of the present invention, the weight matrix H (H) obtained by conversion in step S1pq) Storing in a cache; the image matrix A (a) to be converted in step P1ij) Stored in a cache.
Optionally, as an embodiment of the present invention, the CNN convolutional layer operation method is implemented based on an FPGA.
Alternatively, as an embodiment of the present invention, in step P2, each element h is divided into multiple elements by using a multiplier arraypqMultiplying with the corresponding product matrix to obtain each element hpqA respective corresponding partial matrix.
In order to facilitate understanding of the present invention, the following describes the CNN convolutional layer operation method provided by the present invention further by using the principle of the CNN convolutional layer operation method of the present invention and combining the process of performing CNN convolutional layer operation on the feature image to be processed in the embodiment.
Specifically, the CNN convolutional layer operation method includes:
l1, reading convolution kernels used for CNN convolution layer operation on the characteristic image to be processed, and converting the read convolution kernels into a weight matrix H (H)pq) Where the convolution kernel is a k × k convolution kernel, hpqIs a weight matrix H (H)pq) P is 0, 1, 2, …, k-1; q is 0, 1, 2, …, k-1.
The characteristic image to be processed is an image which needs to be subjected to CNN convolutional layer operation.
The feature image to be processed and the convolution kernel required for CNN convolution layer operation are both stored in an external DDR (Double Data Rate) in advance.
In the present embodiment, for convenience of description, k — 3 is taken as an example. Correspondingly, the convolution kernel for CNN convolutional layer operation on the feature image to be processed in this embodiment is a convolution kernel of 3 × 3, and further corresponds to a weight matrix H (H)pq) The order of the third-order matrix, specifically,
Figure BDA0002623902260000111
p=q=0,1,2。
to increase the pair weight matrix H (H)pq) For convenience of reading, the third-order matrix obtained by conversion in step L1 is stored in a buffer memory, and a weight matrix H (H) is takenpq) When each element (i.e., weight) in the weight matrix H (H) is used, the weight matrix H (H) can be directly read from the bufferpq) Each element.
Step L2, according to the preset image size threshold, reading a block image on the characteristic image to be processed, and according to the weight matrix H (H)pq) And calculating the local operation result of the CNN convolution layer corresponding to the currently read block image.
In step L2, the block images of the feature image to be processed read each time are different from each other.
In step L2, a block image on the feature image to be processed is read according to a preset image size threshold, and the reading method includes:
when reading the block image for the first time, reading a block image meeting the requirement of the image size threshold from the feature image to be processed according to a preset reading initial position;
when the block images are read again, each of the read block images contains two rows or two columns of pixels of its respective adjacent block image.
Therefore, in this embodiment, when the size of the feature image to be processed is smaller than or equal to the image size threshold, a block image that meets the requirement of the image size threshold on the feature image to be processed, which is read for the first time according to the preset image size threshold in step L2, is the feature image to be processed itself; when the size of the feature image to be processed is larger than the image size threshold, a block image which meets the requirement of the image size threshold on the feature image to be processed and is read for the first time according to the preset image size threshold in step L2 is a local image of the feature image to be processed.
In this embodiment, the size of the feature image to be processed is larger than the image size threshold, the preset image size threshold is 10 × 9 pixels, and the feature image to be processed is 15 × 12 pixels.
In this embodiment, in step L2, a block image F1 with a size of 10 × 9 pixels is read onto the feature image to be processed for the first time according to a preset image size threshold (a reading start position can be preset as a pixel point at the upper left corner of the feature image to be processed). The position of this block image F1 on the feature image to be processed (15 × 12 pixels) is shown in fig. 2. The broken line portion in fig. 2 represents the feature image to be processed of 15 × 12 pixels, each of the broken line boxes represents one pixel of the feature image to be processed, and the portion framed by the black rectangular box in fig. 2 represents the patch image F1.
Wherein the step L2 is based on the weight matrix H (H)pq) The implementation method for calculating the local operation result of the CNN convolutional layer corresponding to the currently read block image F1 comprises the following steps:
step L21, converting the currently read block image F1 into an image matrix A (a)ij),aijIs an image matrix A (a)ij) The (i, j) element of (a); wherein i ═ 0, 1, 2.., m-1; j ═ 0, 1, 2,. ang, n-1; the patch image F1 in the present embodiment corresponds to m 7, n 6; specifically, this time:
image matrix
Figure BDA0002623902260000121
Denoted as image matrix a 1.
The image matrix A (a) to be converted in step L21ij) Stored in a buffer, and when needed, the image matrix A (a) is takenij) And when the data is needed, the data can be directly taken from the cache.
Step (ii) ofL22, read weight matrix H (H)pq) Each element h ofpqAnd respectively obtaining each element hpqEach corresponding image matrix A (a)ij) And a product matrix formed by all elements required to be multiplied with the element, and each element hpqMultiplying with the product matrix corresponding to each element to obtain each element hpqA respective corresponding partial matrix.
Wherein, the element h involved in the step L22pqThe corresponding product matrix includes the following cases:
when p is 0 and q is 0, the element h is concernedpqThe corresponding product matrix is the image matrix A (a)ij) Wherein all rows and columns remaining after removing the (n-k +1, n-k +2, n-k +3, …, n-1) th columns and the (m-k +1), m-k +2, m-k +3, …, m-1 th rows form an (m-k +1) × (n-k +1) matrix;
when p is 0 and q is not equal to 0, the element h involvedpqThe corresponding product matrix is the image matrix A (a)ij) Removing 0, 1, 2, …, q-1, n-k + q +2, n-k + q +3, … and n-1 columns and removing m-k +1, m-k +2, m-k +3, … and all rows and columns which are remained after m-k +1, m-k +2, m-k +3, … and m-1 rows are spliced to form an (m-k +1) x (n-k +1) matrix;
when p ≠ 0 and q ≠ 0, the element h involvedpqThe corresponding product matrix is the image matrix A (a)ij) The (m-k +1) × (n-k +1) matrix is formed by splicing all rows and columns which are remained after removing the n-k +1, n-k +2, n-k +3, … and n-1 columns and removing the 0, 1, 2, …, p-1, m-k + p +2, m-k + p +3, … and m-1 rows;
when p ≠ 0 and q ≠ 0, the element h involvedpqThe corresponding product matrix is the image matrix A (a)ij) Wherein the matrix is a (m-k +1) × (n-k +1) matrix formed by splicing all the rows and columns except the 0 th, 1, 2, …, q-1, n-k + q +2, n-k + q +3, … and n-1 th columns and the 0, 1, 2, …, p-1, m-k + p +2, m-k + p +3, … and m-1 th rows.
In the present embodiment, when the weight matrix H (H) is readpq) Element h ofpqIs h00When it is due top is 0 and q is 0, in which case the element h00The corresponding product matrix is:
all rows and columns remaining after removing the n-k +1, n-k +2, n-k +3, … and n-1 columns and the m-k +1, m-k +2, m-k +3, … and m-1 rows in the image matrix A1 form an (m-k +1) × (n-k +1) matrix, namely a 5 × 4 matrix formed by splicing all rows and columns remaining after removing the 4 th and 5 th columns and removing the 5 th and 6 th rows in the image matrix A1 is recorded as a product matrix 00, and specifically:
Figure BDA0002623902260000141
in the present embodiment, when the weight matrix H (H) is readpq) Element h ofpqIs h01When, since p is 0 and q is 1 ≠ 0, then the element h01The corresponding product matrix is:
a 5 × 4 matrix formed by splicing all rows and columns except the 0 th and 5 th columns and all rows and columns left after the 5 th and 6 th rows in the image matrix a1 is recorded as a product matrix 01, and specifically includes:
Figure BDA0002623902260000142
in the present embodiment, when the weight matrix H (H) is readpq) Element h ofpqIs h10When, since q is 0 and p is 1 ≠ 0, then the element h10The corresponding product matrix is:
a 5 × 4 matrix formed by splicing all rows and columns except the 4 th and 5 th columns and the 0 th and 6 th rows in the image matrix a1 is recorded as a product matrix 10, and specifically includes:
Figure BDA0002623902260000143
in the present embodiment, when the weight matrix H (H) is readpq) Element h ofpqIs h11When q is equal to p is equal to 1, the element h is11The corresponding product matrix is:
a 5 × 4 matrix formed by splicing all rows and columns except the 0 th and 5 th columns and all rows except the 0 th and 6 th rows in the image matrix a1 is recorded as a product matrix 11, and specifically includes:
Figure BDA0002623902260000151
when the weight matrix H (H) is readpq) Other element h ofpqIn time, the corresponding product matrix can be obtained by referring to the above manner.
For the image matrix A1, the weight matrix H (H) is readpq) Each element h ofpqAnd obtains the weight matrix H (H)pq) Each element h ofpqAfter the corresponding product matrix, the weight matrix H (H) is respectively setpq) Each element h ofpqMultiplying with the corresponding product matrix to obtain each element hpqA respective corresponding partial matrix.
It can be seen that the present invention uses each weight (corresponding to element h) in the convolution kernelpq) For the starting point, a product matrix which is formed by all characteristic points which need to be multiplied by the read weights in the block images and correspond to each weight is directly obtained, the reading times of each weight in a convolution kernel from an external storage (DDR) are reduced to a certain extent, and the reduction of the storage bandwidth pressure is facilitated to a certain extent.
And L23, calculating the sum of the local matrixes obtained in the step L22, wherein the sum is the local operation result of the CNN convolution layer corresponding to the currently read block image F1.
The sum of the local matrices is calculated by addition of the matrices.
And step L3, judging whether the whole characteristic image to be processed is completely read, if so, continuing to execute step L4, otherwise, repeatedly executing step L2.
In this embodiment, it is obvious that after the block image F1 is read, the entire feature image to be processed is not completely read, and it is necessary to continue reading the block images of the feature image to be processed.
In this embodiment, in step L2, after the whole feature image to be processed is completely read according to the preset image size threshold (10 × 9 pixels), the total separately read block images include, in addition to the block image F1, a block image F2, a block image F3, and a block image F4, where the schematic positions of the block image F2, the block image F3, and the block image F4 on the whole feature image to be processed are shown in fig. 3, 4, and 5 in sequence.
For each of the block image F2, the block image F3, and the block image F4, the CNN convolutional layer local operation result (which is also a matrix) corresponding thereto may be obtained with reference to the block image F1 in step L2.
It should be noted that, before each new block image is read, the last stored block image may be cleared.
And L4, arranging the local operation results of the CNN convolutional layers obtained in the step L2 according to the relative position relationship among the block images, and splicing to obtain the CNN convolutional layer operation results corresponding to the whole characteristic image to be processed.
For example, the feature image to be processed with 15 × 12 pixels is to be read four times, corresponding to 4 block images, where the 4 block images sequentially include a block image F1, a block image F2, a block image F3, and a block image F4 according to the reading order, and a schematic diagram of a distribution of relative positions of the block image F1, the block image F2, the block image F3, and the block image F4 in the feature image to be processed is shown in fig. 2.
The CNN convolutional layer local operation results corresponding to the block image F1, the block image F2, the block image F3, and the block image F4 are sequentially recorded as a matrix C1, a matrix C2, a matrix C3, and a matrix C4, and then a schematic diagram of the arrangement positions of the matrix C1, the matrix C2, the matrix C3, and the matrix C4 is shown in fig. 6. The matrix C1, the matrix C2, the matrix C3, and the matrix C4 are spliced according to their arrangement positions to form a spliced matrix, which is the operation result of the CNN convolutional layer corresponding to the whole feature image to be processed in this embodiment. And converting the CNN convolutional layer operation result into an image and outputting the image to obtain a corresponding image of the whole characteristic image to be processed after the CNN convolutional layer operation in the embodiment.
The CNN convolutional layer operation method in this embodiment is implemented based on an FPGA.
In summary, the CNN convolutional layer operation method of the present invention uses the weight matrix H (H) required in the CNN convolutional layer operation processpq) And an image matrix A (a)ij) The storage in the cache avoids extra storage, is beneficial to reducing the cost of completing the CNN convolutional layer operation to a certain extent, and reduces the times of reading data required by the CNN convolutional layer operation from external storage (DDR) and is beneficial to increasing the speed of the CNN convolutional layer operation to a certain extent.
In addition, the CNN convolution layer operation method also avoids the use of a feature map two-dimensional window in the prior art, further avoids the use of a control module for generating the feature map two-dimensional window in the prior art, reduces the complexity of CNN convolution operation to a certain extent, and is convenient to implement.
FIG. 7 is a diagram of an embodiment of a CNN convolutional layer arithmetic accelerator according to the present invention.
As shown in fig. 7, the CNN convolutional layer arithmetic accelerator 100 includes:
a first data pre-reading module 101, configured to read in a convolution kernel used for performing CNN convolution layer operation on a feature image to be processed, and convert the read convolution kernel into a weight matrix H (H)pq) Where the convolution kernel is a k × k convolution kernel, hpqIs a weight matrix H (H)pq) The (p, q) element of (a), p ═ 0, 1, 2,. ang, k-1; q-0, 1, 2,. k-1;
the second data pre-reading module 102 is configured to read a block image on the feature image to be processed according to a preset image size threshold;
a local operation module 103 for calculating a weight matrix H (H) according to the weight matrixpq) Calculating the local operation result of the CNN convolution layer corresponding to the block image currently read in by the second data pre-reading module 102;
the judging module 104 is used for judging whether the whole characteristic image to be processed is read completely;
a convolutional layer operation result output module 105, configured to, when the determination module 104 determines that the entire feature image to be processed has been read completely, arrange the local operation results of each CNN convolutional layer obtained by the local operation module 103 according to the relative position relationship between the block images, and then splice to obtain and output a CNN convolutional layer operation result corresponding to the entire feature image to be processed;
the calling module 106 is configured to call the data pre-reading module to continue executing when the judging module 104 judges that the whole feature image to be processed is not read;
wherein, the local operation module 103 includes:
an image matrix converting unit 1031 for converting the currently read block image into an image matrix a (a)ij) Wherein the block image is a digital image of m × n pixels, aijIs an image matrix A (a)ij) The (i, j) element of (a); wherein i is 0, 1, 2, 1, m-1; j ═ 0, 1, 2,. ang, n-1;
a local matrix acquisition unit 1032 for reading the weight matrix H (H)pq) Each element h of (2)pqAnd respectively obtaining each element hpqEach corresponding image matrix A (a)ij) And a product matrix formed by all elements required to be multiplied with the element, and each element hpqMultiplying with the corresponding product matrix to obtain each element hpqRespective corresponding local matrices;
a local operation result obtaining unit 1033, configured to calculate a sum of the obtained local matrices, where the sum is a local operation result of the CNN convolution layer corresponding to the currently read block image.
The second data pre-reading module reads different block images each time.
The second data pre-reading module 102 reads a block image on the feature image to be processed according to a preset image size threshold, and the reading method includes:
when reading the block image for the first time, reading a block image meeting the requirement of the image size threshold from the characteristic image to be processed according to a preset reading initial position;
when the block images are read again, each read block image contains k-1 rows or k-1 columns of pixels of its respective adjacent block image.
Optionally, as an embodiment of the present invention, the element h involved in the local matrix obtaining unit 1032pqThe corresponding product matrix includes the following cases:
when p is 0 and q is 0, the element h is concernedpqThe corresponding product matrix is the image matrix A (a)ij) Wherein all rows and columns remaining after removing the n-k +1, n-k +2, n-k +3, …, n-1 th column and the m-k +1, m-k +2, m-k +3, …, m-1 th row form an (m-k +1) × (n-k +1) matrix;
when p is 0 and q is not equal to 0, the element h involvedpqThe corresponding product matrix is the image matrix A (a)ij) Removing 0, 1, 2, …, q-1, n-k + q +2, n-k + q +3, … and n-1 columns and removing m-k +1, m-k +2, m-k +3, … and all rows and columns which are remained after m-k +1, m-k +2, m-k +3, … and m-1 rows are spliced to form an (m-k +1) x (n-k +1) matrix;
when p ≠ 0 and q ≠ 0, the element h involvedpqThe corresponding product matrix is the image matrix A (a)ij) The (m-k +1) × (n-k +1) matrix is formed by splicing all rows and columns which are remained after removing the n-k +1, n-k +2, n-k +3, … and n-1 columns and removing the 0, 1, 2, …, p-1, m-k + p +2, m-k + p +3, … and m-1 rows;
when p ≠ 0 and q ≠ 0, the element h involvedpqThe corresponding product matrix is the image matrix A (a)ij) Wherein the (m-k +1) × (n-k +1) matrix is formed by splicing all rows and columns which are left after removing the 0 th, 1 st, 2 nd, … rd, q-1 th, n-k + q +1, n-k + q +2, n-k + q +3, … th and n-1 st columns and removing the 0 th, 1 st, 2 nd, … th, p-1 th, m-k + p +2 th, m-k + p +3 th, … th and m-1 st rows.
Optionally, as an embodiment of the present invention, the CNN convolution layer operation accelerator further includes a cache;
the first data pre-reading module 101 converts the weight matrix into a weight matrix H (H)pq) Storing in a cache;
the image matrix converting unit 1031 converts it intoImage matrix A (a)ij) Stored in a cache.
Optionally, as an embodiment of the present invention, the CNN convolutional layer arithmetic accelerator is implemented based on an FPGA.
Optionally, as an embodiment of the present invention, a block image that meets the requirement of the image size threshold on the read to-be-processed feature image is:
when the size of the part which is not read on the feature image to be processed is larger than the image size threshold, reading a local image which is equal to the image size threshold in size on the part which is not read on the feature image to be processed;
and when the size of the part which is not read on the characteristic image to be processed is smaller than or equal to the image size threshold, reading all the images which are not read on the characteristic image to be processed.
In particular implementations, the local matrix unit 1032 may employ a multiplier array to multiply each element hpqMultiplying with the product matrix corresponding to each element to obtain each element hpqRespective corresponding local matrices; the local operation result obtaining unit 1033 may calculate a sum of the obtained local matrices using an accumulator array.
The same and similar parts in the various embodiments in this specification may be referred to each other.
In the embodiments provided in the present invention, it should be understood that the disclosed system and method can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules and the units is only one logical function division, and there may be other division ways in actual implementation.
Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (8)

1.一种CNN卷积层运算方法,其特征在于,包括步骤:1. a CNN convolution layer operation method, is characterized in that, comprises the steps: S1、读入用于对待处理特征图像进行CNN卷积层运算的卷积核,并将读入的卷积核转化为权重矩阵
Figure DEST_PATH_IMAGE001
,其中,卷积核为
Figure 680680DEST_PATH_IMAGE002
卷积核,
Figure DEST_PATH_IMAGE003
为权重矩阵
Figure 610590DEST_PATH_IMAGE001
Figure 712538DEST_PATH_IMAGE004
元素,
Figure DEST_PATH_IMAGE005
S1. Read in the convolution kernel used to perform the CNN convolution layer operation on the feature image to be processed, and convert the read convolution kernel into a weight matrix
Figure DEST_PATH_IMAGE001
, where the convolution kernel is
Figure 680680DEST_PATH_IMAGE002
convolution kernel,
Figure DEST_PATH_IMAGE003
is the weight matrix
Figure 610590DEST_PATH_IMAGE001
of
Figure 712538DEST_PATH_IMAGE004
element,
Figure DEST_PATH_IMAGE005
;
S2、按照预先设定的图像大小阈值,读取待处理特征图像上的一个分块图像,并依据所述的权重矩阵
Figure 124803DEST_PATH_IMAGE001
计算当前所读入的分块图像对应的CNN卷积层局部运算结果;
S2, according to the preset image size threshold, read a block image on the feature image to be processed, and according to the weight matrix
Figure 124803DEST_PATH_IMAGE001
Calculate the local operation result of the CNN convolution layer corresponding to the currently read block image;
S3、判断整幅待处理特征图像是否读取完成,若是,则继续执行步骤S4,否则转而重复执行步骤S2;S3, determine whether the entire feature image to be processed has been read, if so, continue to execute step S4; otherwise, repeat step S2; S4、按照各所述分块图像之间的相对位置关系,排布步骤S2中得到的各CNN卷积层局部运算结果,之后拼接得到整幅待处理特征图像对应的CNN卷积层运算结果;S4, arranging the local operation results of each CNN convolution layer obtained in step S2 according to the relative positional relationship between the block images, and then splicing to obtain the CNN convolution layer operation results corresponding to the entire feature image to be processed; 其中,步骤S2中依据所述的权重矩阵
Figure 72030DEST_PATH_IMAGE006
计算当前所读入的分块图像对应的CNN卷积层局部运算结果的实现方法包括步骤:
Wherein, in step S2, according to the weight matrix
Figure 72030DEST_PATH_IMAGE006
The implementation method of calculating the local operation result of the CNN convolution layer corresponding to the currently read block image includes the steps:
P1、将当前读入的分块图像转化为图像矩阵
Figure DEST_PATH_IMAGE007
,其中,分块图像为
Figure 122026DEST_PATH_IMAGE008
像素的数字图像,
Figure DEST_PATH_IMAGE009
为图像矩阵
Figure 362252DEST_PATH_IMAGE010
Figure DEST_PATH_IMAGE011
元素;其中,
Figure 497698DEST_PATH_IMAGE012
P1, convert the currently read block image into an image matrix
Figure DEST_PATH_IMAGE007
, where the block image is
Figure 122026DEST_PATH_IMAGE008
pixel digital image,
Figure DEST_PATH_IMAGE009
is the image matrix
Figure 362252DEST_PATH_IMAGE010
of
Figure DEST_PATH_IMAGE011
element; of which,
Figure 497698DEST_PATH_IMAGE012
;
P2、读取权重矩阵
Figure 310933DEST_PATH_IMAGE006
的每一个元素
Figure DEST_PATH_IMAGE013
、并分别获取每一个元素
Figure 549908DEST_PATH_IMAGE014
各自对应的由图像矩阵
Figure DEST_PATH_IMAGE015
中的所有需要与其做乘法运算的元素构成的乘积矩阵,并将每一个元素
Figure 462500DEST_PATH_IMAGE016
与其各自对应的乘积矩阵相乘得到每一个元素
Figure DEST_PATH_IMAGE017
各自对应的局部矩阵;
P2, read the weight matrix
Figure 310933DEST_PATH_IMAGE006
every element of
Figure DEST_PATH_IMAGE013
, and get each element separately
Figure 549908DEST_PATH_IMAGE014
The respective corresponding by the image matrix
Figure DEST_PATH_IMAGE015
The product matrix formed by all the elements that need to be multiplied in the
Figure 462500DEST_PATH_IMAGE016
Multiply each element by its corresponding product matrix
Figure DEST_PATH_IMAGE017
their corresponding local matrices;
P3、计算所得到的各局部矩阵的和,该和即为当前所读入的分块图像对应的CNN卷积层局部运算结果;P3. Calculate the sum of the obtained local matrices, and the sum is the local operation result of the CNN convolution layer corresponding to the currently read block image; 其中,步骤S2中每次读取到的待处理特征图像的分块图像互不相同;Wherein, the block images of the feature images to be processed that are read each time in step S2 are different from each other; 其中,步骤S2中按照预先设定的图像大小阈值,读取待处理特征图像上的一个分块图像,读取方法包括:Wherein, in step S2, according to a preset image size threshold, a block image on the feature image to be processed is read, and the reading method includes: 在首次读取分块图像时,按预先设定的读取起始位置从待处理特征图像上读取一个满足所述图像大小阈值要求的分块图像;When the segmented image is read for the first time, a segmented image that meets the image size threshold requirement is read from the feature image to be processed according to a preset reading starting position; 在再次读取分块图像时,每个所读取到的分块图像均包含其各相邻分块图像的k-1行或k-1列像素;When the block image is read again, each read block image contains k-1 rows or k-1 columns of pixels of its adjacent block images; 步骤P2中所涉及到的元素
Figure 819664DEST_PATH_IMAGE018
对应的乘积矩阵包括以下情况:
Elements involved in step P2
Figure 819664DEST_PATH_IMAGE018
The corresponding product matrix includes the following cases:
在p=0且q=0时,所涉及到的元素
Figure 935125DEST_PATH_IMAGE019
对应的乘积矩阵为图像矩阵
Figure DEST_PATH_IMAGE020
中除去第n-k+1、n-k+2、n-k+3、...、n-1列及除去第m-k+1、m-k+2、m-k+3、...、m-1行之后余下的所有的行和列形成
Figure 694133DEST_PATH_IMAGE021
矩阵;
When p=0 and q=0, the elements involved
Figure 935125DEST_PATH_IMAGE019
The corresponding product matrix is the image matrix
Figure DEST_PATH_IMAGE020
Remove the n-k+1, n-k+2, n-k+3, ..., n-1 columns and remove the m-k+1, m-k+2, m-k+3, ..., all remaining rows and columns after the m-1 row form
Figure 694133DEST_PATH_IMAGE021
matrix;
在p=0且
Figure DEST_PATH_IMAGE022
时,所涉及到的元素
Figure 777627DEST_PATH_IMAGE023
对应的乘积矩阵为将图像矩阵
Figure DEST_PATH_IMAGE024
中除去第0、1、2、...、q-1、n-k+q+1、n-k+q+2、n-k+q+3、...、n-1列以及除去第m-k+1、m-k+2、m-k+3、...、m-1行之后余下的所有的行和列拼接形成的
Figure 120622DEST_PATH_IMAGE025
矩阵;
at p=0 and
Figure DEST_PATH_IMAGE022
, the elements involved
Figure 777627DEST_PATH_IMAGE023
The corresponding product matrix is the image matrix
Figure DEST_PATH_IMAGE024
remove columns 0, 1, 2, ..., q-1, n-k+q+1, n-k+q+2, n-k+q+3, ..., n-1 and All remaining rows and columns after removing the m-k+1, m-k+2, m-k+3, ..., m-1 rows are formed by splicing
Figure 120622DEST_PATH_IMAGE025
matrix;
Figure 275659DEST_PATH_IMAGE026
且q=0时,所涉及到的元素
Figure DEST_PATH_IMAGE027
对应的乘积矩阵为,将图像矩阵
Figure 154754DEST_PATH_IMAGE028
中除去第n-k+1、n-k+2、n-k+3、...、n-1列及除去第0、1、2、...、p-1、m-k+p+1、m-k+p+2、m-k+p+3、...、m-1行之后余下的所有的行和列拼接形成的
Figure 940307DEST_PATH_IMAGE029
矩阵;
exist
Figure 275659DEST_PATH_IMAGE026
And when q=0, the elements involved
Figure DEST_PATH_IMAGE027
The corresponding product matrix is, the image matrix
Figure 154754DEST_PATH_IMAGE028
Remove the n-k+1, n-k+2, n-k+3, ..., n-1 columns and remove the 0, 1, 2, ..., p-1, m-k+ P+1, m-k+p+2, m-k+p+3, ..., all remaining rows and columns after the m-1 row are formed by concatenating
Figure 940307DEST_PATH_IMAGE029
matrix;
Figure 98494DEST_PATH_IMAGE030
Figure DEST_PATH_IMAGE031
时,所涉及到的元素
Figure 463747DEST_PATH_IMAGE032
对应的乘积矩阵为将图像矩阵
Figure 259665DEST_PATH_IMAGE033
中除去第0、1、2、...,q-1、n-k+q+1、n-k+q+2、n-k+q+3、...、n-1列及除去第0、1、2、...、p-1、m-k+p+1、m-k+p+2、m-k+p+3、...、m-1行之后余下的所有的行和列拼接形成的
Figure DEST_PATH_IMAGE034
矩阵。
exist
Figure 98494DEST_PATH_IMAGE030
and
Figure DEST_PATH_IMAGE031
, the elements involved
Figure 463747DEST_PATH_IMAGE032
The corresponding product matrix is the image matrix
Figure 259665DEST_PATH_IMAGE033
Remove the 0, 1, 2, ..., q-1, n-k+q+1, n-k+q+2, n-k+q+3, ..., n-1 columns and Remaining after removing lines 0, 1, 2, ..., p-1, m-k+p+1, m-k+p+2, m-k+p+3, ..., m-1 All rows and columns are concatenated to form
Figure DEST_PATH_IMAGE034
matrix.
2.根据权利要求1所述的CNN卷积层运算方法,其特征在于,2. CNN convolution layer operation method according to claim 1, is characterized in that, 步骤S1中将转化得到的权重矩阵
Figure 684961DEST_PATH_IMAGE035
存储在缓存中;
The weight matrix that will be transformed in step S1
Figure 684961DEST_PATH_IMAGE035
stored in the cache;
步骤P1中将转化为的图像矩阵
Figure DEST_PATH_IMAGE036
存储在缓存中。
The image matrix that will be converted to in step P1
Figure DEST_PATH_IMAGE036
stored in the cache.
3.根据权利要求1所述的CNN卷积层运算方法,其特征在于,该CNN卷积层运算方法基于FPGA实现。3. The CNN convolution layer operation method according to claim 1, wherein the CNN convolution layer operation method is realized based on FPGA. 4.根据权利要求1所述的CNN卷积层运算方法,其特征在于,步骤P2中采用乘法器阵列将每一个元素
Figure 2548DEST_PATH_IMAGE037
与其各自对应的乘积矩阵相乘得到每一个元素
Figure 499388DEST_PATH_IMAGE017
各自对应的局部矩阵。
4. CNN convolution layer operation method according to claim 1, is characterized in that, adopts multiplier array in step P2 to each element
Figure 2548DEST_PATH_IMAGE037
Multiply each element by its corresponding product matrix
Figure 499388DEST_PATH_IMAGE017
their respective local matrices.
5.一种CNN卷积层运算加速器,其特征在于,包括:5. A CNN convolutional layer operation accelerator, characterized in that, comprising: 第一数据预读模块,用于读入用于对待处理特征图像进行CNN卷积层运算的卷积核,并将读入的卷积核转化为权重矩阵
Figure DEST_PATH_IMAGE038
,其中,卷积核为
Figure 87495DEST_PATH_IMAGE039
卷积核,
Figure 11589DEST_PATH_IMAGE040
为权重矩阵
Figure DEST_PATH_IMAGE041
Figure 550892DEST_PATH_IMAGE042
元素,
Figure DEST_PATH_IMAGE043
Figure 523528DEST_PATH_IMAGE044
The first data pre-reading module is used to read in the convolution kernel used to perform the CNN convolution layer operation on the feature image to be processed, and convert the read convolution kernel into a weight matrix
Figure DEST_PATH_IMAGE038
, where the convolution kernel is
Figure 87495DEST_PATH_IMAGE039
convolution kernel,
Figure 11589DEST_PATH_IMAGE040
is the weight matrix
Figure DEST_PATH_IMAGE041
of
Figure 550892DEST_PATH_IMAGE042
element,
Figure DEST_PATH_IMAGE043
;
Figure 523528DEST_PATH_IMAGE044
;
第二数据预读模块,用于按照预先设定的图像大小阈值,读取待处理特征图像上的一个分块图像;The second data pre-reading module is used to read a block image on the feature image to be processed according to a preset image size threshold; 局部运算模块,用于依据所述的权重矩阵
Figure 294038DEST_PATH_IMAGE035
计算第二数据预读模块当前所读入的分块图像对应的CNN卷积层局部运算结果;
a local operation module for the weight matrix according to the
Figure 294038DEST_PATH_IMAGE035
Calculate the local operation result of the CNN convolution layer corresponding to the segmented image currently read by the second data pre-reading module;
判断模块,用于判断整幅待处理特征图像是否读取完成;The judgment module is used for judging whether the reading of the entire feature image to be processed is completed; 卷积层运算结果输出模块,用于在判断模块判断整幅待处理特征图像已读取完成时,按照各所述分块图像之间的相对位置关系,排布局部运算模块所得到的各CNN卷积层局部运算结果,之后拼接得到整幅待处理特征图像对应的CNN卷积层运算结果并输出;The convolution layer operation result output module is used for arranging the CNNs obtained by the partial operation module according to the relative positional relationship between the block images when the judgment module judges that the entire feature image to be processed has been read. The local operation results of the convolution layer are then spliced to obtain the CNN convolution layer operation results corresponding to the entire feature image to be processed and output; 调用模块,用于在判断模块判断整幅待处理特征图像未读取完成时调用数据预读模块继续执行;The calling module is used to call the data pre-reading module to continue executing when the judgment module judges that the entire feature image to be processed has not been read; 其中,所述局部运算模块包括:Wherein, the local operation module includes: 图像矩阵转化单元,用于将当前读入的分块图像转化为图像矩阵
Figure DEST_PATH_IMAGE045
,其中,分块图像为
Figure 559672DEST_PATH_IMAGE046
像素的数字图像,
Figure 681211DEST_PATH_IMAGE047
为图像矩阵
Figure 519854DEST_PATH_IMAGE048
Figure DEST_PATH_IMAGE049
元素;其中,
Figure 816975DEST_PATH_IMAGE050
Figure 784668DEST_PATH_IMAGE051
Image matrix conversion unit, used to convert the currently read block image into an image matrix
Figure DEST_PATH_IMAGE045
, where the block image is
Figure 559672DEST_PATH_IMAGE046
pixel digital image,
Figure 681211DEST_PATH_IMAGE047
is the image matrix
Figure 519854DEST_PATH_IMAGE048
of
Figure DEST_PATH_IMAGE049
element; of which,
Figure 816975DEST_PATH_IMAGE050
;
Figure 784668DEST_PATH_IMAGE051
;
局部矩阵获取单元,用于读取权重矩阵
Figure 127925DEST_PATH_IMAGE052
的每一个元素
Figure 770259DEST_PATH_IMAGE053
、并分别获取每一个元素
Figure 453044DEST_PATH_IMAGE054
各自对应的由图像矩阵
Figure DEST_PATH_IMAGE055
中的所有需要与其做乘法运算的元素构成的乘积矩阵,并将每一个元素
Figure 827525DEST_PATH_IMAGE056
与其各自对应的乘积矩阵相乘得到每一个元素
Figure 156613DEST_PATH_IMAGE057
各自对应的局部矩阵;
Local matrix acquisition unit for reading the weight matrix
Figure 127925DEST_PATH_IMAGE052
every element of
Figure 770259DEST_PATH_IMAGE053
, and get each element separately
Figure 453044DEST_PATH_IMAGE054
The respective corresponding by the image matrix
Figure DEST_PATH_IMAGE055
The product matrix formed by all the elements that need to be multiplied in the
Figure 827525DEST_PATH_IMAGE056
Multiply each element by its corresponding product matrix
Figure 156613DEST_PATH_IMAGE057
their corresponding local matrices;
局部运算结果获取单元,用于计算所得到的各局部矩阵的和,该和即为当前所读入的分块图像对应的CNN卷积层局部运算结果;The local operation result acquisition unit is used to calculate the sum of the obtained local matrices, and the sum is the local operation result of the CNN convolution layer corresponding to the currently read block image; 第二数据预读模块每次读取到的分块图像互不相同;The segmented images read each time by the second data pre-reading module are different from each other; 第二数据预读模块按照预先设定的图像大小阈值,读取待处理特征图像上的一个分块图像,读取方法包括:The second data pre-reading module reads a block image on the feature image to be processed according to a preset image size threshold, and the reading method includes: 在首次读取分块图像时,按预先设定的读取起始位置从待处理特征图像上读取一个满足所述图像大小阈值要求的分块图像;When the segmented image is read for the first time, a segmented image that meets the image size threshold requirement is read from the feature image to be processed according to a preset reading starting position; 在再次读取分块图像时,每个所读取到的分块图像均包含其各相邻分块图像的k-1行或k-1列像素;When the block image is read again, each read block image contains k-1 rows or k-1 columns of pixels of its adjacent block images; 所述局部矩阵获取单元中所涉及到的元素
Figure 337058DEST_PATH_IMAGE058
对应的乘积矩阵包括以下情况:
The elements involved in the local matrix acquisition unit
Figure 337058DEST_PATH_IMAGE058
The corresponding product matrix includes the following cases:
在p=0且q=0时,所涉及到的元素
Figure 874350DEST_PATH_IMAGE027
对应的乘积矩阵为图像矩阵
Figure 747628DEST_PATH_IMAGE059
中除去第n-k+1、n-k+2、n-k+3、...、n-1列及除去第m-k+1、m-k+2、m-k+3、...、m-1行之后余下的所有的行和列形成
Figure 3160DEST_PATH_IMAGE060
矩阵;
When p=0 and q=0, the elements involved
Figure 874350DEST_PATH_IMAGE027
The corresponding product matrix is the image matrix
Figure 747628DEST_PATH_IMAGE059
Remove the n-k+1, n-k+2, n-k+3, ..., n-1 columns and remove the m-k+1, m-k+2, m-k+3, ..., all remaining rows and columns after the m-1 row form
Figure 3160DEST_PATH_IMAGE060
matrix;
在p=0且
Figure DEST_PATH_IMAGE061
时,所涉及到的元素
Figure 954673DEST_PATH_IMAGE062
对应的乘积矩阵为将图像矩阵
Figure 612051DEST_PATH_IMAGE063
中除去第0、1、2、...、q-1、n-k+q+1、n-k+q+2、n-k+q+3、...、n-1列以及除去第m-k+1、m-k+2、m-k+3、...、m-1行之后余下的所有的行和列拼接形成的
Figure 656230DEST_PATH_IMAGE064
矩阵;
at p=0 and
Figure DEST_PATH_IMAGE061
, the elements involved
Figure 954673DEST_PATH_IMAGE062
The corresponding product matrix is the image matrix
Figure 612051DEST_PATH_IMAGE063
remove columns 0, 1, 2, ..., q-1, n-k+q+1, n-k+q+2, n-k+q+3, ..., n-1 and All remaining rows and columns after removing the m-k+1, m-k+2, m-k+3, ..., m-1 rows are formed by splicing
Figure 656230DEST_PATH_IMAGE064
matrix;
Figure DEST_PATH_IMAGE065
且q=0时,所涉及到的元素
Figure 867900DEST_PATH_IMAGE019
对应的乘积矩阵为,将图像矩阵
Figure 390148DEST_PATH_IMAGE059
中除去第n-k+1、n-k+2、n-k+3、...、n-1列及除去第0、1、2、...、p-1、m-k+p+1、m-k+p+2、m-k+p+3、...、m-1行之后余下的所有的行和列拼接形成的
Figure 666146DEST_PATH_IMAGE066
矩阵;
exist
Figure DEST_PATH_IMAGE065
And when q=0, the elements involved
Figure 867900DEST_PATH_IMAGE019
The corresponding product matrix is, the image matrix
Figure 390148DEST_PATH_IMAGE059
Remove the n-k+1, n-k+2, n-k+3, ..., n-1 columns and remove the 0, 1, 2, ..., p-1, m-k+ p+1, m-k+p+2, m-k+p+3, ..., all remaining rows and columns after the m-1 row are formed by concatenating
Figure 666146DEST_PATH_IMAGE066
matrix;
Figure 615648DEST_PATH_IMAGE026
Figure DEST_PATH_IMAGE067
时,所涉及到的元素
Figure 111351DEST_PATH_IMAGE068
对应的乘积矩阵为将图像矩阵
Figure 640553DEST_PATH_IMAGE063
中除去第0、1、2、...,q-1、n-k+q+1、n-k+q+2、n-k+q+3、...、n-1列及除去第0、1、2、...、p-1、m-k+p+1、m-k+p+2、m-k+p+3、...、m-1行之后余下的所有的行和列拼接形成的
Figure DEST_PATH_IMAGE069
矩阵。
exist
Figure 615648DEST_PATH_IMAGE026
and
Figure DEST_PATH_IMAGE067
, the elements involved
Figure 111351DEST_PATH_IMAGE068
The corresponding product matrix is the image matrix
Figure 640553DEST_PATH_IMAGE063
Remove the 0, 1, 2, ..., q-1, n-k+q+1, n-k+q+2, n-k+q+3, ..., n-1 columns and Remaining after removing lines 0, 1, 2, ..., p-1, m-k+p+1, m-k+p+2, m-k+p+3, ..., m-1 All rows and columns are concatenated to form
Figure DEST_PATH_IMAGE069
matrix.
6.根据权利要求5所述的CNN卷积层运算加速器,其特征在于,该CNN卷积层运算加速器还包括缓存;6. CNN convolution layer operation accelerator according to claim 5, is characterized in that, this CNN convolution layer operation accelerator also comprises buffer memory; 第一数据预读模块将其转化为的权重矩阵
Figure 239899DEST_PATH_IMAGE070
存储在缓存中;
The first data pre-reading module converts it into a weight matrix
Figure 239899DEST_PATH_IMAGE070
stored in the cache;
图像矩阵转化单元将其转化为的图像矩阵
Figure 360302DEST_PATH_IMAGE071
存储在缓存中。
The image matrix that the image matrix conversion unit converts to
Figure 360302DEST_PATH_IMAGE071
stored in the cache.
7.根据权利要求5所述的CNN卷积层运算加速器,其特征在于,该CNN卷积层运算加速器基于FPGA实现。7. The CNN convolutional layer operation accelerator according to claim 5, wherein the CNN convolutional layer operation accelerator is realized based on FPGA. 8.根据权利要求5所述的CNN卷积层运算加速器,其特征在于,局部矩阵获取单元采用乘法器阵列将每一个元素
Figure 140039DEST_PATH_IMAGE037
与其各自对应的乘积矩阵相乘得到每一个元素
Figure DEST_PATH_IMAGE072
各自对应的局部矩阵。
8. CNN convolutional layer operation accelerator according to claim 5, is characterized in that, local matrix acquisition unit adopts multiplier array to each element
Figure 140039DEST_PATH_IMAGE037
Multiply each element by its corresponding product matrix
Figure DEST_PATH_IMAGE072
their respective local matrices.
CN202010791455.5A 2020-08-07 2020-08-07 CNN convolutional layer operation method and CNN convolutional layer operation accelerator Active CN111967582B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010791455.5A CN111967582B (en) 2020-08-07 2020-08-07 CNN convolutional layer operation method and CNN convolutional layer operation accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010791455.5A CN111967582B (en) 2020-08-07 2020-08-07 CNN convolutional layer operation method and CNN convolutional layer operation accelerator

Publications (2)

Publication Number Publication Date
CN111967582A CN111967582A (en) 2020-11-20
CN111967582B true CN111967582B (en) 2022-07-08

Family

ID=73365849

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010791455.5A Active CN111967582B (en) 2020-08-07 2020-08-07 CNN convolutional layer operation method and CNN convolutional layer operation accelerator

Country Status (1)

Country Link
CN (1) CN111967582B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966729B (en) * 2021-02-26 2023-01-31 成都商汤科技有限公司 Data processing method and device, computer equipment and storage medium
CN112950656B (en) * 2021-03-09 2024-11-29 北京工业大学 Block convolution method for pre-reading data per channel based on FPGA platform
CN114327629B (en) * 2021-12-28 2025-03-14 北京航天自动控制研究所 A two-dimensional multi-channel convolution hardware accelerator based on FPGA
CN115860080B (en) * 2023-02-15 2023-05-09 苏州浪潮智能科技有限公司 Computing core, accelerator, computing method, device, equipment, medium and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549931A (en) * 2018-04-25 2018-09-18 济南浪潮高新科技投资发展有限公司 A kind of accelerator and method of convolutional neural networks
CN110321996A (en) * 2018-03-28 2019-10-11 华为技术有限公司 A kind of method and apparatus of the image procossing based on convolutional neural networks
CN110399591A (en) * 2019-06-28 2019-11-01 苏州浪潮智能科技有限公司 Data processing method and device based on convolutional neural networks

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11574694B2 (en) * 2018-10-11 2023-02-07 International Business Machines Corporation Kernel sets normalization with capacitor charge sharing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110321996A (en) * 2018-03-28 2019-10-11 华为技术有限公司 A kind of method and apparatus of the image procossing based on convolutional neural networks
CN108549931A (en) * 2018-04-25 2018-09-18 济南浪潮高新科技投资发展有限公司 A kind of accelerator and method of convolutional neural networks
CN110399591A (en) * 2019-06-28 2019-11-01 苏州浪潮智能科技有限公司 Data processing method and device based on convolutional neural networks

Also Published As

Publication number Publication date
CN111967582A (en) 2020-11-20

Similar Documents

Publication Publication Date Title
CN111967582B (en) CNN convolutional layer operation method and CNN convolutional layer operation accelerator
WO2019184657A1 (en) Image recognition method, apparatus, electronic device and storage medium
CN111951167B (en) Super-resolution image reconstruction method, super-resolution image reconstruction device, computer equipment and storage medium
CN112991142B (en) Matrix operation method, device, equipment and storage medium for image data
JP7461081B2 (en) Method and apparatus for deconvolution processing of feature data using convolution hardware
JP2005044098A (en) Image processor and image processing method
WO2020258491A1 (en) Universal character recognition method, apparatus, computer device, and storage medium
CN112580675B (en) Image processing method and device and computer readable storage medium
WO2018113224A1 (en) Picture reduction method and device
CN111986092A (en) Image super-resolution reconstruction method and system based on dual networks
JPS62262188A (en) Picture processor
CN110930306A (en) Depth map super-resolution reconstruction network construction method based on non-local perception
US20220343468A1 (en) Image processing method, electronic device and computer-readable storage medium
CN114266697A (en) Image processing and model training method and device, electronic equipment and storage medium
CN112184587A (en) Edge data enhancement model, and efficient edge data enhancement method and system based on model
CN116012588A (en) A Novel Feature Upsampling Method for Semantic Segmentation
CN110942425A (en) Reconstruction method and reconstruction system of super-resolution image and electronic equipment
CN107977923B (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN115004220A (en) Neural network for raw low-light image enhancement
WO2022179075A1 (en) Data processing method and apparatus, computer device and storage medium
US20230252600A1 (en) Image size adjustment structure, adjustment method, and image scaling method and device based on streaming architecture
CN116434039B (en) Target detection method based on multiscale split attention mechanism
CN118537227A (en) Iterative interactive reference type stereo image super-resolution reconstruction method and system
CN113657587B (en) FPGA-based deformable convolution acceleration method and device
CN115147297A (en) An image processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: Building 9, No.1, guanpu Road, Guoxiang street, Wuzhong Economic Development Zone, Wuzhong District, Suzhou City, Jiangsu Province

Patentee after: Suzhou Yuannao Intelligent Technology Co.,Ltd.

Country or region after: China

Address before: Building 9, No.1, guanpu Road, Guoxiang street, Wuzhong Economic Development Zone, Wuzhong District, Suzhou City, Jiangsu Province

Patentee before: SUZHOU LANGCHAO INTELLIGENT TECHNOLOGY Co.,Ltd.

Country or region before: China

CP03 Change of name, title or address