WO2021115039A1

WO2021115039A1 - Fpga platform, performance evaluation and design optimization method therefor, and storage medium

Info

Publication number: WO2021115039A1
Application number: PCT/CN2020/129156
Authority: WO
Inventors: 邵翠萍; 李慧云; 李青峰
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2019-12-02
Filing date: 2020-11-16
Publication date: 2021-06-17
Anticipated expiration: 2022-06-09
Also published as: CN111176962B; CN111176962A

Abstract

An FPGA platform, a performance evaluation and design optimization method therefor, and a storage medium. The method comprises: classifying data to be processed of an algorithm to be operated of the FPGA platform according to variables, the data corresponding to each variable being classified into a same data class, and the number of the data classes being equal to the number of the variables and not less than 2 (S101); calculating a calculation amount and a read amount required by each data class (S102); summing the calculation amounts and the read amounts of the data classes to obtain a total calculation amount and a total read amount of the algorithm to be operated (S103); and performing performance evaluation and/or design optimization on the FPGA platform on the basis of the total calculation amount and the total read amount (S104). The data to be processed is classified according to the variables on the basis of the algorithm to be operated, so as to visually reflect the contribution of the data to be processed in the data classes to the calculation amount and the read amount of the algorithm to be operated, so that the FPGA platform is analyzed on the basis of the total calculation amount and the total read amount to find performance bottlenecks of the FPGA platform.

Description

FPGA platform and its performance evaluation and design optimization methods and storage media

Technical field

本申请涉及高性能计算的技术领域，具体是涉及FPGA平台及其性能评估与设计优化的方法、存储介质。This application relates to the technical field of high-performance computing, and specifically relates to an FPGA platform and its performance evaluation and design optimization methods and storage media.

Background technique

随着大数据、人工智能的快速发展，越来越多的数据密集、计算密集的算法被提出，更多的计算量、更快的处理速度对计算设备性能提出了更高的要求。相较于常用的诸如GPU(图形处理器)、CPU(中央处理器)和ASIC(专用集成电路)的计算设备，FPGA(Field Programmable Gate Array，现场可编程门阵列)平台具有良好的灵活性、出色的性能及较低的功耗等优点，使得FPGA平台在高性能、低功耗、算法多样的应用场景中被大量使用。With the rapid development of big data and artificial intelligence, more and more data-intensive and computationally intensive algorithms have been proposed, and more calculations and faster processing speeds place higher requirements on the performance of computing devices. Compared with commonly used computing devices such as GPU (graphics processing unit), CPU (central processing unit) and ASIC (application-specific integrated circuit), FPGA (Field Programmable Gate Array) platform has good flexibility, Excellent performance and low power consumption make the FPGA platform widely used in application scenarios with high performance, low power consumption, and diverse algorithms.

发明内容Summary of the invention

本申请实施例提供了一种FPGA平台的性能评估及设计优化的方法，其中，该方法包括：将FPGA平台的待运行算法的待处理数据按照变量进行分类；其中，每一个变量所对应的数据被划分到同一个数据类别，数据类别的数目等于变量的数目，且不小于2；计算各个数据类别所需的计算量及读取量；对各个数据类别的计算量及读取量进行求和，以计算待运行算法的总计算量及总读取量；基于总计算量及总读取量对FPGA平台进行性能评估和/或设计优化。The embodiment of the present application provides a method for performance evaluation and design optimization of an FPGA platform, where the method includes: classifying the data to be processed of the algorithm to be run on the FPGA platform according to variables; wherein, the data corresponding to each variable Are divided into the same data category, the number of data categories is equal to the number of variables, and not less than 2; calculate the amount of calculation and reading required for each data category; sum the calculation amount and reading amount of each data category , To calculate the total calculation volume and total read volume of the algorithm to be run; perform performance evaluation and/or design optimization of the FPGA platform based on the total calculation volume and total read volume.

本申请实施例还提供了一种FPGA平台，其中，该FPGA平台包括存储器及处理器，存储器与处理器耦接，存储器用于存储程序数据，处理器用于执行程序数据，以实现上述方法。An embodiment of the present application also provides an FPGA platform, where the FPGA platform includes a memory and a processor, the memory is coupled to the processor, the memory is used to store program data, and the processor is used to execute the program data to implement the above method.

本申请实施例进一步提供了一种计算机存储介质，用于存储计算机程序，其中，计算机程序在被处理器执行时，实现上述方法。The embodiment of the present application further provides a computer storage medium for storing a computer program, where the computer program implements the foregoing method when being executed by a processor.

本申请的有益效果是：本申请提供的方法基于待运行算法对待处理数据按照变量进行分类，使得每一个变量所对应的数据被划分到同一个数据类别，以便于直观地反映出各个数据类别中待处理数据对待运行算法的计算量、读取量的贡献，从而基于待运行算法的总计算量及总读取量对FPGA平台进行分析，以便于找到FPGA平台的性能瓶颈，从而指导FPGA平台的设计优化。The beneficial effect of this application is that the method provided in this application classifies the data to be processed according to variables based on the algorithm to be run, so that the data corresponding to each variable is classified into the same data category, so as to visually reflect the data categories The contribution of the data to be processed to the calculation and reading of the algorithm to be run, so as to analyze the FPGA platform based on the total calculation and total reading of the algorithm to be run, so as to find the performance bottleneck of the FPGA platform and guide the FPGA platform Design Optimization.

Description of the drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其它的附图。In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained from these drawings without creative work.

图1是本申请提供的FPGA平台的性能评估及设计优化的方法一实施例的流程示意图；FIG. 1 is a schematic flowchart of an embodiment of a method for performance evaluation and design optimization of an FPGA platform provided by the present application;

图2是图1中步骤S104一实施方式的流程示意图；FIG. 2 is a schematic flowchart of an embodiment of step S104 in FIG. 1;

图3是本申请提供的FPGA平台一实施例的结构示意图；FIG. 3 is a schematic structural diagram of an embodiment of the FPGA platform provided by the present application;

图4是本申请提供的计算机存储介质一实施例的结构示意图。Fig. 4 is a schematic structural diagram of an embodiment of a computer storage medium provided by the present application.

Detailed ways

下面结合附图和实施例，对本申请作进一步的详细描述。特别指出的是，以下实施例仅用于说明本申请，但不对本申请的范围进行限定。同样的，以下实施例仅为本申请的部分实施例而非全部实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例，都属于本申请保护的范围。The application will be further described in detail below in conjunction with the drawings and embodiments. It is particularly pointed out that the following examples are only used to illustrate the application, but do not limit the scope of the application. Similarly, the following embodiments are only part of the embodiments of the present application, but not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of the present application.

在本申请中提及“实施例”意味着，结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例，也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是，本申请所描述的实施例可以与其它实施例相结合。The reference to "embodiments" in this application means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described in this application can be combined with other embodiments.

本申请的发明人经过长期的研究发现：使用FPGA平台的一个重要问题是FPGA电路设计实现的性能评估，也即是同一个算法在FPGA平台上可以有多种设计方案，每种设计方案实现的性能都可能不一样。因此，评估设计方案所能实现的性能是预估性能瓶颈、指导设计优化的关键。另外，目前大都是针对CNN(Convolutional Neural Networks，卷积神经网络)算法建立数学模型，对算法整体进行分析；并没有考虑算法内部不同变量对应数据的不同特点，也没有为不同变量优内存读取化指明方向。为此，本申请提出了如下实施例。The inventor of this application has discovered through long-term research that an important issue in using the FPGA platform is the performance evaluation of FPGA circuit design and implementation, that is, the same algorithm can have multiple design solutions on the FPGA platform, and each design solution is implemented Performance may be different. Therefore, evaluating the performance that can be achieved by the design is the key to predicting performance bottlenecks and guiding design optimization. In addition, most of the current mathematical models are established for the CNN (Convolutional Neural Networks, convolutional neural network) algorithm to analyze the algorithm as a whole; it does not consider the different characteristics of the data corresponding to different variables in the algorithm, and does not optimize the memory for different variables.化 points out the direction. For this reason, this application proposes the following embodiments.

参阅图1，图1是本申请提供的FPGA平台的性能评估及设计优化的方法一实施例的流程示意图。该方法包括：Refer to FIG. 1, which is a schematic flowchart of an embodiment of a method for performance evaluation and design optimization of an FPGA platform provided by the present application. The method includes:

S101：将FPGA平台的待运行算法的待处理数据按照变量进行分类；其中，每一个变量所对应的数据被划分到同一个数据类别，数据类别的数目等于变量的数目，且不小于2。S101: Classify the to-be-processed data of the algorithm to be run on the FPGA platform according to variables; wherein, the data corresponding to each variable is classified into the same data category, and the number of data categories is equal to the number of variables and is not less than 2.

本申请实施例中，待运行算法至少包括矩阵或向量的乘法和/或加法运算等预设算法。其中，加法算法可以具体包括数据之间的加法和减法。对于其它类型的运算而言，乘法算法也可以具体包括数据之间的乘法和除法。进一步地，基于上述预设算法，FPGA平台会对待处理数据进行处理，以计算得到运算结果。In the embodiment of the present application, the algorithm to be executed includes at least preset algorithms such as matrix or vector multiplication and/or addition operations. Among them, the addition algorithm can specifically include addition and subtraction between data. For other types of operations, the multiplication algorithm can also specifically include multiplication and division between data. Further, based on the above-mentioned preset algorithm, the FPGA platform will process the data to be processed to calculate the result of the operation.

对于某一算法中的大量数据而言，其中每一个数据参与运算的总次数并不一定是完全一样的。例如对于k*A _mn(k可以为常数，A _mn可以表示为包含(m×n)个数据的矩阵，并具体包含(m×n+1)个数据)这一运算而言，其中变量的数目为2；并且常数k参与运算的总次数为(m×n)次，矩阵A _mn中每一个数据参与运算的总次数为1次。换句话说，在FPGA平台进行上述运算的过程中，常数k会被读取(m×n)次；而矩阵A _mn中每一个数据仅会被读取1次。为此，可以将常数k作为一类数据，而将矩阵A _mn中的数据作为另一类数据，也即是数据类别的数目为2。 For a large amount of data in a certain algorithm, the total number of times each data participates in the operation is not necessarily the same. For example, for k*A _mn (k can be a constant, A _mn can be expressed as a matrix containing (m×n) data, and specifically contains (m×n+1) data) this operation, where the variable The number is 2; and the total number of times that the constant k participates in the operation is (m×n) times, and the total number of times that each data in the _{matrix A mn participates in the operation is 1 time.} In other words, in the process of performing the above operations on the FPGA platform, the constant k will be read (m×n) times; and _{each data in the matrix A mn} will only be read once. To this end, the constant k can be regarded as one type of data, and the data in the matrix A _mn can be regarded as another type of data, that is, the number of data types is two.

综上，本申请实施例另辟蹊径地基于待运行算法对待处理数据按照变量进行分类，使得每一个变量所对应的数据被划分到同一个数据类别，以便于直观地反映出各个数据类别中待处理数据对待运行算法的计算量、读取量的贡献，可以为数据读取的优化指明方向，后文中将进行详细的说明。In summary, the embodiments of the present application categorize the data to be processed according to variables based on the algorithm to be run in a different way, so that the data corresponding to each variable is classified into the same data category, so as to visually reflect the data to be processed in each data category. The contribution of the calculation amount and reading amount of the algorithm to be run can point out the direction for the optimization of data reading, which will be explained in detail later.

S102：计算各个数据类别所需的计算量及读取量。S102: Calculate the amount of calculation and reading required for each data category.

基于上述的详细描述，对于某一算法而言，各个数据类别所需的计算量及读取量也并不一定是完全一样的。因此，为了更好地分析各个数据类别在算法运行过程中对计算量及读取量的贡献，本申请另辟蹊径地提出单个数据单次参与计算的贡献度C和单个数据的总参与次数N两个基础分析指标。Based on the above detailed description, for a certain algorithm, the amount of calculation and the amount of reading required for each data category are not necessarily the same. Therefore, in order to better analyze the contribution of each data category to the amount of calculation and reading during the operation of the algorithm, this application proposes a new approach to the contribution of a single data to participate in the calculation C and the total number of participations N of a single data. Basic analysis indicators.

一方面，将各个数据类别中的各个待处理数据参与单次运算时，单次运算所涉及的运算次数及参与单次运算的待处理数据的数目的比值作为各个待处理数据的单次计算贡献度。因此，各个待处理数据的单次计算贡献度C可以用公式(1)进行描述：On the one hand, when each data to be processed in each data category participates in a single operation, the ratio of the number of operations involved in a single operation and the number of data to be processed in a single operation is used as the single calculation contribution of each data to be processed degree. Therefore, the single calculation contribution C of each data to be processed can be described by formula (1):

C＝Ops/Data (1)C=Ops/Data (1)

上述公式(1)中，Ops表示单次运算所涉及的运算次数，Data表示参与单次运算的待处理数据的数目。例如对于上述k*A _mn这一运算而言，单次运算可以看作是两数相乘，因此单次运算所涉及的运算次数为1，参与单次运算的待处理数据的数目为2，使得各个待处理数据的单次计算贡献度为1/2。 In the above formula (1), Ops represents the number of operations involved in a single operation, and Data represents the number of data to be processed participating in a single operation. For example, for the above k*A _mn operation, a single operation can be regarded as the multiplication of two numbers. Therefore, the number of operations involved in a single operation is 1, and the number of data to be processed in a single operation is 2. Make the single calculation contribution degree of each data to be processed is 1/2.

另一方面，可以将各个待处理数据重复参与单次运算的重复次数作为该数据的总参与次数N。例如对于上述k*A _mn这一运算而言，常数k的总参与次数为(m×n)次，矩阵A _mn中每一个数据的总参与次数为1次。 On the other hand, the number of times each data to be processed repeatedly participates in a single operation can be used as the total number of participation times N of the data. For example, for the above k*A _mn operation, the total number of participations of the constant k is (m×n) times, and the total number of participations of each data in the _{matrix A mn is 1.}

针对每一个数据类别，将各个待处理数据的单次计算贡献度与各个待处理数据重复参与单次运算的重复次数的乘积结果进行均值处理，以得到各个数据类别的平均贡献度。因此，各个数据类别的平均贡献度C _A可以用公式(2)进行描述： For each data category, the product result of the single calculation contribution of each data to be processed and the number of repetitions of each data to be processed repeatedly participating in a single operation is averaged to obtain the average contribution of each data category. Thus, each category of data the average contribution of C _A can be described by equation (2):

上述公式(2)中，n表示某一数据类别中待处理数据的数目，C _i表示第i个待处理数据的单次计算贡献度，N _i表示第i个待处理数据重复参与单次运算的重复次数。 Above formula (2), n represents a number of data categories of data to be processed, C _i represents the i-th single computing the contribution of data to be processed, N _i denotes the i-th data to be processed per operation was repeated participation The number of repetitions.

进一步地，将各个数据类别中的待处理数据的数目及平均贡献度的乘积结果作为各个数据类别所需的计算量。此时，各个数据类别所需的计算量可以用式子(D _A×C _A)进行描述。其中D _A表示各个数据类别中的待处理数据的数目。 Further, the product result of the number of data to be processed and the average contribution degree in each data category is taken as the calculation amount required for each data category. At this time, the amount of calculation required for each data category can be described by the formula (D _A ×C _A ). Where D _A represents the number of data to be processed in each data category.

针对每一个数据类别，将各个待处理数据的重复次数进行均值处理，以得到各个数据类别的平均重复次数。因此，各个数据类别的平均重复次数R _A可以用公式(3)进行描述： For each data category, the number of repetitions of each data to be processed is averaged to obtain the average number of repetitions of each data category. Thus, each category of data the average number of repetitions R _A can be described by equation (3):

上述公式(3)中，n表示某一数据类别中待处理数据的数目，N _i表示第i个待处理数据重复参与单次运算的重复次数。 The above equation (3), n represents a number of data categories of data to be processed, N _i denotes the i-th data to be processed participation repetitions repeated single operation.

进一步地，将各个数据类别中的待处理数据的数目、平均重复次数及数据位宽的乘积结果作为各个数据类别所需的读取量。此时，各个数据类别所需的读取量可以用式子(D _A×R _A×D _S)进行描述。其中D _S表示各个数据类别的数据位宽。 Further, the product result of the number of data to be processed, the average number of repetitions, and the data bit width in each data category is taken as the required reading amount for each data category. At this time, the amount of reading required for each data category can be described by the formula (D _A ×R _A ×D _S ). Where D _S represents the data bit width of each data category.

需要说明的是，对于单精度的32位浮点数而言，每一个数据的位宽为4字节；对于双精度的64位浮点数而言，每一个数据的位宽为8字节。It should be noted that for a single-precision 32-bit floating point number, the bit width of each data is 4 bytes; for a double-precision 64-bit floating point number, the bit width of each data is 8 bytes.

S103：对各个数据类别的计算量及读取量进行求和，以计算待运行算法的总计算量及总读取量。S103: The calculation amount and the reading amount of each data category are summed to calculate the total calculation amount and the total reading amount of the algorithm to be run.

基于上述的详细描述，对式子(D _A×C _A)求和即可得到待运行算法的总计算量，对式子(D _A×R _A×D _S)求和即可得到待运行算法的总读取量。因此，待运行算法的总计算量O _A可以用公式(4)进行描述： Based on the above detailed description, _{the total calculation amount of the algorithm to be run can be obtained by summing the formula (D A} ×C _A ), and the algorithm to be run can be obtained by summing the formula (D _A ×R _A ×D _S ) The total number of reads. Therefore, the total calculation amount O _{A of the} algorithm to be run can be described by formula (4):

上述公式(4)中，m表示待运行算法中数据类别的数目，D _Aj表示第j个数据类别中的待处理数据的数目，C _Aj表示第j个数据类别的平均贡献度。 In the above formula (4), m represents the number of data categories in the algorithm to be run, D _Aj represents the number of data to be processed in the j-th data category, and C _Aj represents the average contribution of the j-th data category.

进一步地，待运行算法的总读取量T _A可以用公式(5)进行描述： Further, the total read amount T _{A of the} algorithm to be run can be described by formula (5):

上述公式(5)中，m表示待运行算法中数据类别的数目，D _Aj表示第j个数据类别中的待处理数据的数目，R _Aj表示第j个数据类别的平均重复次数，R _Aj表示第j个数据类别的数据位宽。 The above equation (5), m represents a number to be executed algorithm category of data, D _Aj represents the number of data to be processed in the j-th data categories, R _Aj represents an average repeating number of the j-th data categories, R _Aj represents The data bit width of the j-th data category.

需要说明的是，由于待运行算法可能会涉及非常庞大的计算量及读取量，因此可以借助本领域常用等式(1G＝1000M≈1024M＝2 ¹⁰M)对上述公式(4)及(5)的计算结果进行数量级层面的换算，使得总计算量可以以GOps(十亿次运算)为单位进行表示，总读取量可以以GB(十亿字节)为单位进行表示。 It should be noted that, because the algorithm to be run may involve a very large amount of calculation and reading, the equations (1G=1000M≈1024M=2 ¹⁰ M) commonly used in this field can be used to compare the above formulas (4) and (5). The calculation result of) is converted on an order of magnitude, so that the total calculation amount can be expressed in GOps (billion operations), and the total read amount can be expressed in GB (gigabytes).

下面以2048×1024的矩阵A与1024×1024的矩阵B相乘为例，对上述各个公式及其主要参数进行简单的说明：The following takes the multiplication of a 2048×1024 matrix A and a 1024×1024 matrix B as an example to briefly explain the above formulas and their main parameters:

对于上述两个矩阵相乘的算法而言，(2048×1024+1024×1024)个待处理数据按照变量可以分为两类，一类可以为矩阵A所包含的(2048×1024)个待处理数据，另一类则可以为矩阵B所包含的(1024×1024)个待处理数据。其中，基于矩阵乘法运算的定义，每一次运算可以看作矩阵A中某一行的元素与矩阵B中某一列对应元素相乘之后再求和。进一步地，矩阵A及矩阵B的数据类型可以为单精度的32位浮点数，因此矩阵A及矩阵B的数据位宽D _S可以为4字节。如此，主要的分析结果如下表所示： For the algorithm of multiplying the above two matrices, (2048×1024+1024×1024) data to be processed can be divided into two categories according to variables. One type can be the (2048×1024) data to be processed contained in matrix A Data, the other type can be the (1024×1024) data to be processed contained in matrix B. Among them, based on the definition of matrix multiplication operation, each operation can be regarded as an element in a row of matrix A multiplied by a corresponding element in a column of matrix B and then summed. Further, the data type of the matrix A and the matrix B can be single-precision 32-bit floating point numbers, so the data bit width D _{S of the} matrix A and the matrix B can be 4 bytes. In this way, the main analysis results are shown in the following table:

由此可知，对于上述两个矩阵相乘的算法而言，矩阵B中各个数据对该算法的计算量、读取量的贡献均要高于矩阵A的。为此，既可以将矩阵B设计为片内存储，也可以为矩阵B分配带宽更大的内存，以增加矩阵B中各个数据的读取效率，后文中将进行详细的说明。It can be seen that for the above two matrix multiplication algorithm, the contribution of each data in matrix B to the calculation and reading of the algorithm is higher than that of matrix A. For this reason, either matrix B can be designed as on-chip storage, or a larger bandwidth memory can be allocated to matrix B to increase the efficiency of reading each data in matrix B, which will be described in detail later.

S104：基于总计算量及总读取量对FPGA平台进行性能评估和/或设计优化。S104: Perform performance evaluation and/or design optimization of the FPGA platform based on the total calculation amount and the total read amount.

对于FPGA平台而言，待运行算法的总计算量可以反映其处理器所要承担的负荷，待运行算法的总读取量则可以反映其内存所要承担的负荷。一般地，总计算量越大，处理器所要承担的负荷越大，也即是对处理器的运算性能也就要求越高；总读取量越大，内存所要承担的负荷越大，也即是对内存的读取性能也就要求越高。因此，既可以基于待运行算法的总计算量及总读取量对FPGA平台的性能进行评估，以便于找到FPGA平台的性能瓶颈，又可以基于上述性能瓶颈对FPGA平台的设计进行优化。For the FPGA platform, the total computational load of the algorithm to be run can reflect the load to be borne by its processor, and the total read load of the algorithm to be run can reflect the load to be borne by its memory. In general, the greater the total amount of calculation, the greater the load the processor has to bear, that is, the higher the requirement for the processor's computing performance; the greater the total read amount, the greater the load the memory has to bear, that is, The higher the requirement for memory read performance. Therefore, the performance of the FPGA platform can be evaluated based on the total calculation amount and total read amount of the algorithm to be run, so as to find the performance bottleneck of the FPGA platform, and the design of the FPGA platform can be optimized based on the above-mentioned performance bottleneck.

参阅图2，图2是图1中步骤S104一实施方式的流程示意图。其中，该实施方式主要用于说明如何基于待运行算法的总计算量及总读取量找到FPGA平台的性能瓶颈，以及如何基于性能瓶颈对FPGA平台的设计进行优化。Refer to FIG. 2, which is a schematic flowchart of an embodiment of step S104 in FIG. 1. Among them, this embodiment is mainly used to explain how to find the performance bottleneck of the FPGA platform based on the total calculation amount and total read amount of the algorithm to be run, and how to optimize the design of the FPGA platform based on the performance bottleneck.

一般地，FPGA平台读取数据可以使用DDR DRAM(Double Data Rate,Dynamic Random Access Memory，双倍数据数率动态随机存取内存)和BRAM(Block Random Access Memory，块随机存取内存)两种内存。其中，DRAM属于片外存储，BRAM属于片内存储；片内存储的读取性能要优于片外存储。进一步地，FPGA平台处理数据的乘法和加法运算时可以使用其内部DSP(Digital Signal Processor，数字信号处理器)，使得理论最大运行速度与处理器的使用数量直接相关。Generally, FPGA platform can use DDR DRAM (Double Data Rate, Dynamic Random Access Memory, double data rate dynamic random access memory) and BRAM (Block Random Access Memory, block random access memory) to read data. . Among them, DRAM belongs to off-chip storage, and BRAM belongs to on-chip storage; the read performance of on-chip storage is better than that of off-chip storage. Further, the FPGA platform can use its internal DSP (Digital Signal Processor) when processing data multiplication and addition operations, so that the theoretical maximum operating speed is directly related to the number of processors used.

因此，对于一个确定的FPGA平台而言，其理论最高内存带宽及理论最大运行速度都是确定的。其中，理论最高内存带宽B _max可以用公式(6)进行描述： Therefore, for a certain FPGA platform, the theoretical maximum memory bandwidth and theoretical maximum operating speed are determined. Among them, the theoretical maximum memory bandwidth B _max can be described by formula (6):

上述公式(6)中，n表示该FPGA平台所用内存的种类数目，W _idth 表示第i种内存的内存位宽，f _ram表示内存运行频率。进一步地，B _max的单位可以用GB/s(十亿字节每秒)进行表示。 In the above formula (6), n represents the number of types of memory used by the FPGA platform, W _idth represents the memory bit width of the i-th type of memory, and f _ram represents the operating frequency of the memory. Further, _{the unit of B max} can be expressed in GB/s (gigabytes per second).

进一步地，理论最大运行速度P _max可以用公式(7)进行描述： Further, the theoretical maximum operating speed P _max can be described by formula (7):

P _max＝N _DSP×f _DSP×2 (7) P _max ＝N _DSP ×f _DSP ×2 (7)

上述公式(7)中，N _DSP表示该FPGA平台所用处理器的数量，f _DSP表示处理器的时钟频率，常数2则表示所用处理器在同一时钟脉冲下能够同步执行加法和乘法两种运算。进一步地，P _max的单位可以用GOps/s(十亿次运算每秒)进行表示。 In the above formula (7), N _DSP represents the number of processors used in the FPGA platform, f _DSP represents the clock frequency of the processors, and the constant 2 represents that the processors used can simultaneously perform two operations of addition and multiplication under the same clock pulse. Further, _{the unit of P max} can be expressed in GOps/s (billion operations per second).

基于上述的详细描述，根据上述公式(4)-(5)可以分别计算得到待运行算法的总计算量及总读取量。进一步地，由于待运行算法的总计算量与总读取量相互关联，本实施方式将总计算量O _A与总读取量T _A相除得到常数

其单位可以用GOps/GB表示。该常数可以表示待运行算法在FPGA平台上运行时，单位读取量能够支持的计算量。显然，该常数直接揭示了数据读取速度与算法运行速度之间的关系。进一步地，如果FPGA平台以内存带宽B在单位时间内读取数据T ₁；考虑到数据不一定会被完全使用，则FPGA平台在单位时间内也最多只能以速度P运行算法

次。基于此，可以得到如下关系式(8)： Based on the above detailed description, according to the above formulas (4)-(5), the total calculation amount and total read amount of the algorithm to be run can be calculated respectively. Further, since the amount of calculation algorithm to be run with the total amount of read interrelated, the present embodiment calculates the total amount of the total O _A T _A obtained by dividing the amount of read constant

The unit can be expressed in GOps/GB. This constant can represent the amount of calculation that can be supported by the unit read amount when the algorithm to be run is running on the FPGA platform. Obviously, this constant directly reveals the relationship between the data reading speed and the algorithm running speed. Further, if the FPGA platform uses the memory bandwidth B to read the data T ₁ in a unit time; considering that the data may not be fully used, the FPGA platform can only run the algorithm at the speed P at most per unit time

Times. Based on this, the following relationship (8) can be obtained:

上述公式(8)中，P表示待运行算法在FPGA平台上的实际运行速度，B表示FPGA平台在运行算法时读取待处理数据的内存带宽。其中，该内存带宽既包括片内存储在运算时提供的内存带宽，也包括片外存储在运算时提供的内存带宽。In the above formula (8), P represents the actual running speed of the algorithm to be run on the FPGA platform, and B represents the memory bandwidth for the FPGA platform to read the data to be processed when the algorithm is running. Among them, the memory bandwidth includes not only the memory bandwidth provided by the on-chip storage during operation, but also the memory bandwidth provided by the off-chip storage during operation.

进一步地，由于数据不一定被完全使用，所以

≤一个常数。产生这种情况的原因有两个：一是内存带宽不匹配，例如一类数据的内存带宽设计过大，而另一类数据的内存带宽设计过小，使得大带宽的数据等待，无法充分发挥作用，从而导致FPGA平台的实际运行速度小于最优值；二是处理器运算速度过低，数据处理速度低于数据读取速度，导致平台实际运行速度小于最优值。如果是由于内存带宽不匹配导致的，为使公式(8)取得等号，本申请提出等式

以指导如何为不同数据匹配相应的内存带宽。其中，T _Aj表示第j个数据类别的读取量，B _j表示第j个数据类别所分配内存的带宽。显然，该等式成立意味着各个数据类别的内存带宽与其读取量成正比，以使得待运行算法在运行时读取的各类数据都能够被完全利用，从而改善内存带宽过大或过小的问题。也即是本申请在各个数据类别之间对内存的带宽进行分配，以使得各个数据类别的读取量与其所分配的带宽的比值趋于相等，从而优化FPGA平台的性能。进一步地，本申请根据各个数据类别中的待处理数据重复参与单次运算的重复次数，设置各个数据类别从FPGA平台的片外存储到片内存储的读取优先级别。其中，重复次数越大，读取优先级别越高，以优化不同数据类别的读取效率，从而优化FPGA平台的性能。并且，对于重复次数为1的数据类别，也即是其待处理数据仅参与一次单次运算，可以将该数据类别的待处理数据仅存储于片内DDR DRAM。如果是处理器运算速度导致的，使得公式(8)无法取得等号，那么FPGA平台性能瓶颈在于平台，运算速度等于实际使用处理器的运算速度。 Furthermore, since the data may not be fully used, so

≤a constant. There are two reasons for this situation: one is the memory bandwidth mismatch. For example, the memory bandwidth of one type of data is designed to be too large, while the memory bandwidth of the other type of data is designed to be too small, making large-bandwidth data waiting and unable to make full use of As a result, the actual operating speed of the FPGA platform is less than the optimal value; second, the processor's operating speed is too low, and the data processing speed is lower than the data reading speed, resulting in the actual operating speed of the platform being lower than the optimal value. If it is caused by the memory bandwidth mismatch, in order to get the equal sign of formula (8), this application proposes an equation

To guide how to match the corresponding memory bandwidth for different data. Among them, T _Aj represents the read volume of the j-th data category, and B _j represents the bandwidth of the memory allocated for the j-th data category. Obviously, the establishment of this equation means that the memory bandwidth of each data category is proportional to the amount of reading, so that all kinds of data read by the algorithm to be run at runtime can be fully utilized, thereby improving the memory bandwidth that is too large or too small The problem. That is, this application allocates the bandwidth of the memory among various data categories, so that the ratio of the read amount of each data category to the allocated bandwidth tends to be equal, thereby optimizing the performance of the FPGA platform. Further, the present application sets the read priority level of each data category from the off-chip storage of the FPGA platform to the on-chip storage according to the number of repetitions of the data to be processed in each data category repeatedly participating in a single operation. Among them, the greater the number of repetitions, the higher the read priority level to optimize the read efficiency of different data categories, thereby optimizing the performance of the FPGA platform. In addition, for a data category with a repetition number of 1, that is, its to-be-processed data only participates in a single operation, the to-be-processed data of this data category can only be stored in the on-chip DDR DRAM. If it is caused by the computing speed of the processor, which makes formula (8) unable to obtain the equal sign, then the performance bottleneck of the FPGA platform lies in the platform, and the computing speed is equal to the computing speed of the actual processor used.

基于上述的详细分析，步骤S104具体可以包括：Based on the above detailed analysis, step S104 may specifically include:

S1041：根据FPGA平台的内存的读取性能、总计算量及总读取量计算得到FPGA平台的第一运行速度。S1041: Calculate the first operating speed of the FPGA platform according to the read performance, total calculation amount, and total read amount of the memory of the FPGA platform.

本实施方式中，第一运行速度可以等于总计算量与总读取量的比值与内存的带宽的乘积结果。此时，第一运行速度P ₁可以用公式(9)进行描述： In this embodiment, the first operating speed may be equal to the product result of the ratio of the total calculation amount to the total read amount and the bandwidth of the memory. At this time, the first running speed P ₁ can be described by formula (9):

上述公式(9)中，总计算量O _A可以用运算次数为单位进行表示，总读取量T _A可以用字节为单位进行表示，内存的带宽B可以用字节/秒为单位进行表示，以使得第一运行速度P ₁可以用运算次数/秒为单位进行表示。 In the above formula (9), the total calculation amount O _A can be expressed in units of the number of operations, the total read amount T _A can be expressed in bytes, and the bandwidth B of the memory can be expressed in bytes per second. , So that the first running speed P ₁ can be expressed in units of calculation times/second.

S1042：根据FPGA平台的处理器的运算性能计算得到FPGA平台的第二运行速度。S1042: Calculate the second operating speed of the FPGA platform according to the computing performance of the processor of the FPGA platform.

本实施方式中，第二运行速度可以等于处理器的数量、时钟频率及能够在同一时钟脉冲下同步执行的运算次数的乘积结果，并以运算次数/秒为单位进行表示。此时，第二运行速度P ₂可以用公式(10)进行描述： In this embodiment, the second operating speed may be equal to the product of the number of processors, the clock frequency, and the number of operations that can be performed synchronously under the same clock pulse, and it may be expressed in units of operations/second. At this time, the second running speed P ₂ can be described by formula (10):

P ₂＝N _used×f _DSP×N (10) P ₂ ＝N _used ×f _DSP ×N (10)

上述公式(10)中，N _used表示该FPGA平台中处理器的数量，f _DSP表示处理器的时钟频率，N则表示所用处理器能够在同一时钟脉冲下同步执行的运算次数。例如，N＝2则表示所用处理器在同一时钟脉冲下能够同步执行加法和乘法两种运算。 In the above formula (10), N _used represents the number of processors in the FPGA platform, f _DSP represents the clock frequency of the processor, and N represents the number of operations that the used processor can perform synchronously under the same clock pulse. For example, N=2 means that the processor used can simultaneously perform two operations of addition and multiplication under the same clock pulse.

S1043：比较第一运行速度与第二运行速度的大小。S1043: Compare the magnitude of the first running speed and the second running speed.

在步骤S1043的比较结果为第一运行速度小于第二运行速度时，执行步骤S1044；而在步骤S1043的比较结果为第二运行速度小于第一运行速度时，执行步骤S1045。When the comparison result of step S1043 is that the first operating speed is less than the second operating speed, step S1044 is executed; and when the comparison result of step S1043 is that the second operating speed is less than the first operating speed, step S1045 is executed.

本实施方式中，通过比较第一运行速度与第二运行速度的大小即可找到FPGA平台的性能瓶颈，并可以基于比较结果进一步对FPGA平台的设计进行优化。具体如下：In this embodiment, the performance bottleneck of the FPGA platform can be found by comparing the size of the first operating speed and the second operating speed, and the design of the FPGA platform can be further optimized based on the comparison result. details as follows:

S1044：判定FPGA平台的性能受制于内存的读取性能。S1044: Determine that the performance of the FPGA platform is restricted by the read performance of the memory.

本实施方式中，在处理器的运算性能已确定的情况下，通过调整内存的读取性能，可以使得第一运行速度大于或等于第二运行速度，从而优化FPGA的设计。In this embodiment, when the computing performance of the processor has been determined, by adjusting the read performance of the memory, the first operating speed can be made greater than or equal to the second operating speed, thereby optimizing the design of the FPGA.

S1045：判定FPGA平台的性能受制于处理器的运算性能。S1045: Determine that the performance of the FPGA platform is restricted by the computing performance of the processor.

本实施方式中，在内存的读取性能已确定的情况下，通过调整处理器的运算性能，可以使得第二运行速度大于或等于第一运行速度，从而优化FPGA的设计。In this embodiment, when the read performance of the memory has been determined, by adjusting the computing performance of the processor, the second operating speed can be made greater than or equal to the first operating speed, thereby optimizing the design of the FPGA.

下面基于上述2048×1024的矩阵A与1024×1024的矩阵B相乘的例子，对本实施方式的可行性进行验证，并对其中的主要参数进行简单的说明：The following is based on the above example of multiplying the 2048×1024 matrix A and the 1024×1024 matrix B to verify the feasibility of this embodiment, and briefly explain the main parameters:

本实施方式基于OpenCL 17.1开发环境，FPGA平台为友晶C5P开发板，开发板进一步通过PCIe接口与电脑主板相连并与主机进行通信。This embodiment is based on the OpenCL 17.1 development environment, the FPGA platform is the C5P development board of Friends of the crystal, and the development board is further connected to the computer motherboard through the PCIe interface and communicates with the host.

其中，对于2048×1024的矩阵A与1024×1024的矩阵B相乘的运算而言，由于总计算量O _A为4GOps，总读取量T _A为16GB，使得常数

为0.25GOps/GB。 Wherein, for the matrix A 2048 × 1024 matrix B 1024 × 1024 multiplies the calculation, since the amount of calculation is O _A 4GOps, T _A is the total amount of read 16GB, such constant

It is 0.25GOps/GB.

进一步地，本实施方式总共验证三次。其中，第一次验证不使用片内存储，所有待处理数据均从片外存储读取，也即是预先将待处理数据存入片外DDR DRAM；第二次使用片内存储，也即是预先将待处理数据存入片内BRAM(其中，片内BRAM的总位宽为64位，频率为400MHz)；第三次使用片内存储(其中，片内BRAM的总位宽为2048位，频率为400MHz)。如此，主要的分析结果如下表所示：Further, this embodiment is verified three times in total. Among them, the first verification does not use on-chip storage, all the data to be processed is read from off-chip storage, that is, the data to be processed is stored in off-chip DDR DRAM in advance; the second time on-chip storage is used, that is Store the data to be processed in the on-chip BRAM in advance (the total bit width of the on-chip BRAM is 64 bits and the frequency is 400MHz); the third use of on-chip storage (where the total bit width of the on-chip BRAM is 2048 bits, The frequency is 400MHz). In this way, the main analysis results are shown in the following table:

To BB P ₁ P ₁ N _used N _used f _DSP f _DSP P ₂ P ₂ PP 验证一Verification One 0.20.2 0.050.05 11 0.130.13 0.260.26 0.050.05 验证二Verification two 3.23.2 0.80.8 11 0.130.13 0.260.26 0.240.24 验证三Verification Three 100100 2525 3232 0.130.13 8.328.32 8.248.24

需要说明的是，P ₁是基于公式(9)计算得到的FPGA平台的第一运行速度，P ₂基于公式(10)计算得到的FPGA平台的第二运行速度，P则表示FPGA平台的实际运行速度。 It should be noted that P ₁ is the first operating speed of the FPGA platform calculated based on formula (9), P _{2 is} the second operating speed of the FPGA platform calculated based on formula (10), and P represents the actual operation of the FPGA platform speed.

由此可知，验证一满足步骤S1043的比较结果，FPGA平台的性能瓶颈在于内存的读取性能，也即是待处理数据存放于片外DDR DRAM，导致内存带宽过低。此时，如果想要改善FPGA平台的性能，可以将待处理数据预先存入片内BRAM，例如可以将矩阵B的数据设计为片内存储，以增加矩阵B中数据的读取效率。验证二也满足步骤S1043的比较结果，并且FPGA平台的实际运行速度约等于一个处理器的理论运行速度，说明FPGA平台的性能瓶颈在于处理器的运算性能，导致处理数据的速度低于数据读取的速度。此时，如果想要改善FPGA平台的性能，可以增加处理器的数量，以使得更多的处理器能够参数运算。进一步地，验证三与验证二类似，虽然FPGA平台的读取性能及运算性能均得到一定程度的优化，但是FPGA平台的性能瓶颈依旧在于处理器的运算性能。这主要是因为读取性能的优化(内存优化)与运算性能的优化(处理器优化)不匹配，导致处理器的运算速度依旧跟不上内存的读取速度。It can be seen that the performance bottleneck of the FPGA platform is the read performance of the memory, that is, the data to be processed is stored in off-chip DDR DRAM, which results in too low memory bandwidth. At this time, if you want to improve the performance of the FPGA platform, you can store the data to be processed in the on-chip BRAM in advance. For example, you can design the data of matrix B to be stored on-chip to increase the efficiency of reading the data in matrix B. Verification 2 also satisfies the comparison result of step S1043, and the actual operating speed of the FPGA platform is approximately equal to the theoretical operating speed of a processor, indicating that the performance bottleneck of the FPGA platform lies in the computing performance of the processor, resulting in data processing speed lower than data reading speed. At this time, if you want to improve the performance of the FPGA platform, you can increase the number of processors so that more processors can perform parameter operations. Further, verification three is similar to verification two. Although the read performance and computing performance of the FPGA platform have been optimized to a certain extent, the performance bottleneck of the FPGA platform still lies in the computing performance of the processor. This is mainly because the optimization of read performance (memory optimization) does not match the optimization of computing performance (processor optimization), causing the processor's computing speed to still fail to keep up with the memory read speed.

基于上述的详细分析，在对FPGA平台的设计进行优化时，可以尽可能减小第一运行速度与第二运行速度之间的差值，也即是尽可能地使得处理器的运算速度与内存的读取速度相匹配。Based on the above detailed analysis, when optimizing the design of the FPGA platform, the difference between the first operating speed and the second operating speed can be reduced as much as possible, that is, to make the processor's computing speed and memory as much as possible. To match the reading speed.

图3是本申请提供的FPGA平台一实施例的结构示意图。Fig. 3 is a schematic structural diagram of an embodiment of the FPGA platform provided by the present application.

本实施例的FPGA平台300包括存储器301及处理器302，存储器301与处理器302可以借助一条数据总线耦接。其中，存储器301可以为片外存储和/或片内存储，并用于存储程序数据。进一步地，处理器302可以为数字信号处理器，并用于执行该程序数据，以实现如下方法步骤：The FPGA platform 300 of this embodiment includes a memory 301 and a processor 302, and the memory 301 and the processor 302 may be coupled via a data bus. The memory 301 may be off-chip storage and/or on-chip storage, and is used to store program data. Further, the processor 302 may be a digital signal processor, and is used to execute the program data to implement the following method steps:

将FPGA平台的待运行算法的待处理数据按照变量进行分类；其中，每一个变量所对应的数据被划分到同一个数据类别，数据类别的数目等于变量的数目，且不小于2；计算各个数据类别所需的计算量及读取量；对各个数据类别的计算量及读取量进行求和，以计算待运行算法的总计算量及总读取量；基于总计算量及总读取量对FPGA平台进行性能评估和/或设计优化。The data to be processed of the algorithm to be run on the FPGA platform are classified according to variables; among them, the data corresponding to each variable is divided into the same data category, and the number of data categories is equal to the number of variables and not less than 2; each data is calculated The amount of calculation and reading required by the category; the calculation amount and reading amount of each data category are summed to calculate the total calculation amount and total reading amount of the algorithm to be run; based on the total calculation amount and total reading amount Perform performance evaluation and/or design optimization on FPGA platform.

需要说明的是，本实施例的FPGA平台300是基于上述任一方法实施例的一实体终端，其实施原理和步骤类似，在此不再赘述。因此，该程序数据在被处理器302执行时，还可以实现上述任一实施例中的其它方法步骤，在此不再赘述。It should be noted that the FPGA platform 300 of this embodiment is a physical terminal based on any of the foregoing method embodiments, and its implementation principles and steps are similar, and will not be repeated here. Therefore, when the program data is executed by the processor 302, other method steps in any of the foregoing embodiments can also be implemented, which will not be repeated here.

本实施例的计算机存储介质400用于存储计算机程序401，计算机程序401被处理器执行以实现如下方法步骤：The computer storage medium 400 of this embodiment is used to store a computer program 401, and the computer program 401 is executed by a processor to implement the following method steps:

需要说明的是，本实施例的计算机程序401所实现的方法是基于上述任一方法实施例的，其实施原理和步骤类似。因此，计算机程序401在被处理器执行时，还可以实现上述任一实施例中的其它方法步骤，在此不再赘述。It should be noted that the method implemented by the computer program 401 in this embodiment is based on any of the foregoing method embodiments, and its implementation principles and steps are similar. Therefore, when the computer program 401 is executed by the processor, it can also implement other method steps in any of the foregoing embodiments, which will not be repeated here.

本申请的实施例以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)或处理器(processor)执行本申请各个实施方式所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。When the embodiments of the present application are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. .

以上所述仅为本申请的部分实施例，并非因此限制本申请的保护范围，凡是利用本申请说明书及附图内容所作的等效装置或等效流程变换，或直接或间接运用在其它相关的技术领域，均同理包括在本申请的专利保护范围内。The above descriptions are only part of the embodiments of this application, and do not limit the scope of protection of this application. Any equivalent device or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly applied to other related The technical field is equally included in the scope of patent protection of this application.

Claims

A method for performance evaluation and design optimization of an FPGA platform, characterized in that the method includes:

The data to be processed of the algorithm to be run on the FPGA platform is classified according to variables; wherein the data corresponding to each variable is classified into the same data category, and the number of data categories is equal to the number of variables, And not less than 2;

Calculate the amount of calculation and reading required for each of the data types;

Summing the calculation amount and the reading amount of each of the data types to calculate the total calculation amount and the total reading amount of the algorithm to be run;

Perform performance evaluation and/or design optimization on the FPGA platform based on the total calculation amount and the total read amount.

The method according to claim 1, wherein the step of calculating the amount of calculation and the amount of reading required for each of the data types comprises:

Taking a product result of the number of the data to be processed and the average contribution in each of the data categories as the calculation amount required for each of the data categories;

The product result of the number of the data to be processed, the average number of repetitions, and the data bit width in each of the data categories is taken as the reading amount required for each of the data categories.

The method according to claim 2, wherein the step of calculating the amount of calculation and the amount of reading required for each of the data types further comprises:

When each of the data to be processed in each of the data categories participates in a single operation, the ratio of the number of operations involved in the single operation to the number of the data to be processed participating in the single operation is used as each State the contribution of a single calculation of the data to be processed;

For each of the data categories, the product result of the single calculation contribution of each of the data to be processed and the number of repetitions of each of the data to be processed repeatedly participating in the single operation is averaged to obtain each of the The average contribution of the data category;

For each of the data categories, the repetition times of each of the data to be processed are averaged to obtain the average repetition times of each of the data categories.

The method according to claim 1, wherein the step of performing performance evaluation and/or design optimization of the FPGA platform based on the total calculation amount and the total read amount comprises:

Calculating the first operating speed of the FPGA platform according to the read performance of the memory of the FPGA platform, the total calculation amount, and the total read amount;

Calculating the second operating speed of the FPGA platform according to the computing performance of the processor of the FPGA platform;

Comparing the magnitude of the first operating speed and the second operating speed;

If the first operating speed is less than the second operating speed, it is determined that the performance of the FPGA platform is restricted by the read performance of the memory;

If the second operating speed is less than the first operating speed, it is determined that the performance of the FPGA platform is restricted by the computing performance of the processor.

The method according to claim 4, wherein the first operating speed is equal to the product result of the ratio of the total calculation amount to the total read amount and the bandwidth of the memory; wherein, the total calculation The amount is expressed in a unit of the number of operations, the total read amount is expressed in a unit of bytes, and the bandwidth of the memory is expressed in a unit of bytes/second;

The second operating speed is equal to the product result of the number of processors, the clock frequency, and the number of operations that can be executed synchronously under the same clock pulse, and is expressed in units of the number of operations/second.

The method according to claim 4, wherein the step of performing performance evaluation and/or design optimization of the FPGA platform based on the total calculation amount and the total read amount further comprises:

When the computing performance of the processor has been determined, adjusting the read performance of the memory so that the first operating speed is greater than or equal to the second operating speed;

When the read performance of the memory is determined, the calculation performance of the processor is adjusted so that the second operating speed is greater than or equal to the first operating speed.

The bandwidth of the memory is allocated among the data categories, so that the ratio of the read amount of each data category to the allocated bandwidth tends to be equal.

The method according to claim 4, wherein the step of performing performance evaluation and/or design optimization of the FPGA platform based on the total calculation amount and the total read amount, further comprising:

According to the number of times the data to be processed in each data category repeatedly participates in a single operation, set the read priority level of each data category from the off-chip storage of the FPGA platform to the on-chip storage; The greater the number of repetitions, the higher the read priority level.

The method according to claim 1, wherein the algorithm to be executed includes at least matrix or vector multiplication and/or addition operations.

An FPGA platform, characterized in that the FPGA platform includes a memory and a processor, the memory is coupled to the processor, the memory is used for storing program data, and the processor is used for executing the program data, To achieve the method according to any one of claims 1-9.

A computer storage medium for storing a computer program, wherein the computer program implements the method according to any one of claims 1-9 when the computer program is executed by a processor.