[go: up one dir, main page]

CN114237716B - High-performance implementation method of FIR filter based on domestic many-core processor - Google Patents

High-performance implementation method of FIR filter based on domestic many-core processor Download PDF

Info

Publication number
CN114237716B
CN114237716B CN202111519880.XA CN202111519880A CN114237716B CN 114237716 B CN114237716 B CN 114237716B CN 202111519880 A CN202111519880 A CN 202111519880A CN 114237716 B CN114237716 B CN 114237716B
Authority
CN
China
Prior art keywords
core
data
input data
fir filter
calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111519880.XA
Other languages
Chinese (zh)
Other versions
CN114237716A (en
Inventor
刘鹏
杨昕瑶
郭俊
张鲁飞
吴东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202111519880.XA priority Critical patent/CN114237716B/en
Publication of CN114237716A publication Critical patent/CN114237716A/en
Application granted granted Critical
Publication of CN114237716B publication Critical patent/CN114237716B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03HIMPEDANCE NETWORKS, e.g. RESONANT CIRCUITS; RESONATORS
    • H03H17/00Networks using digital techniques
    • H03H17/02Frequency selective networks
    • H03H17/0211Frequency selective networks using specific transformation algorithms, e.g. WALSH functions, Fermat transforms, Mersenne transforms, polynomial transforms, Hilbert transforms
    • H03H17/0213Frequency domain filters using Fourier transforms
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03HIMPEDANCE NETWORKS, e.g. RESONANT CIRCUITS; RESONATORS
    • H03H17/00Networks using digital techniques
    • H03H2017/0072Theoretical filter design
    • H03H2017/0081Theoretical filter design of FIR filters

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Algebra (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Analysis (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)
  • Filters That Use Time-Delay Elements (AREA)

Abstract

The invention provides a high-performance implementation method of an FIR filter based on a domestic many-core processor, which is based on a platform of the domestic many-core processor, analog signals are subjected to analog-to-digital conversion to obtain input data, a control core distributes the input data into four core groups by using a message transfer interface, M-1 zero values are supplemented at the front end of the input data, then the result H of FFT of a twiddle factor W and a filter coefficient H [ M ] is calculated, and the twiddle factor W and the calculation result H are transmitted to each operation core by using direct memory access; and then transmitting single-round data to each operation core through direct memory access, and performing single-round FIR filtering calculation, wherein the operation results of each round are sequentially connected in a control core to obtain a final result. Compared with the method for directly calculating the FIR filter algorithm by utilizing a single core of a domestic processor, the method for optimizing the FIR filter algorithm improves the core parallelism, realizes the parallelization of data processing, and improves the algorithm speed.

Description

FIR filter high-performance implementation method based on domestic many-core processor
Technical Field
The invention relates to the field of finite-length impulse response filters, in particular to a method for realizing high performance of an FIR filter based on a domestic many-core processor.
Background
The digital filter is one of the most widely applied algorithms in the field of digital signal processing, and has many applications in the fields of voice and image processing, multimedia application, etc. because of its advantages of strong stability, strict linear phase, easy implementation of hardware, etc.
The calculation of the FIR filter is the process of completing convolution of the filter coefficient and the input signal, and the problem of calculation resource waste generated by direct convolution can be solved by utilizing an overlap preservation method because the length N of the input data is usually far longer than the length M of the filter coefficient. According to the overlap preservation method, the input data is divided into R data blocks with the block size L, the last M-1 points of the last sub-block are added into each sub-block, and R times of FIR filtering calculation with the single block length of F=L+M-1 are carried out (all the first M-1 points of the first sub-block are 0). According to the cyclic convolution theorem, the fast fourier transform (Fast Fourier Transform, FFT) of the two sequences of circumferential convolution results is equal to the product of the two sequences of fast fourier transform results. When the length of the single block FIR filter is F, the equivalent condition of linear convolution and circumferential convolution is satisfied, so that FFT calculation can be carried out on the filter coefficient and each block of input data, then the FFT results of the two sequences are multiplied, discrete Fourier inverse transformation is carried out on the products, the result of a single sub-block is obtained, and finally the sub-block results are spliced to obtain complete output.
The Shenwei processor 26010 adopts a 64-bit autonomous Shenwei instruction set, a full chip 260 core, a chip standard operating frequency of 1.5GHz and a peak operating speed of 3.168TFLOPS. The Shenwei processor has four core groups, each core group has 1 control core and 64 operation cores, the memory space of a single core group is 8 Gbyte, the local memory space of a single operation core is 16 Kbyte, and the data transmission can be carried out between the core groups by utilizing a message transmission interface (MESSAGE PASSING INTERFACE, MPI).
When the FIR filter is calculated, if the input data amount is large, the calculation speed of the Shenwei many cores of the domestic processor is slow, so that a method capable of solving the performance bottleneck of the FIR filter calculation is needed.
Disclosure of Invention
The invention aims to solve the technical problem of providing a high-performance implementation method of an FIR filter based on a domestic many-core processor, which is used for solving the problem of suitability of a digital signal processing function FIR filter and a Shenwei many-core processor and improving the calculation rate of the large-point FIR filter on the Shenwei many-core processor.
In order to solve the technical problems, the invention provides a high-performance implementation method of an FIR filter based on a domestic many-core processor, wherein the domestic many-core processor comprises four core groups, each core group comprises 1 control core and 64 operation cores, and the method comprises the following steps:
S1, analog signals are subjected to analog-to-digital A-D conversion to obtain input data, wherein the input data comprises a filter coefficient h M with the length of M, and the length of the input data is N;
S2, the control core is utilized to read in the input data in the step 1, the message passing interface MPI is utilized to distribute the input data into four core groups, and M-1 zero values are supplemented at the front end of the input data;
in a control core, calculating a twiddle factor W, carrying out final zero padding to 128 points on a filter coefficient H [ M ], carrying out Fast Fourier Transform (FFT) calculation to obtain a result H, and transmitting the twiddle factor W and the result H of the FFT calculation to all operation cores by using a direct memory access DMA;
S3, in each core group, dividing the input data of the step S1 into data blocks with R blocks of L, adding the last M-1 points of the previous sub-block, respectively carrying out R times of calculation of an FIR filter with the monolithic length of F=L+M-1, wherein all the first M-1 points of the first sub-block are 0, and transmitting continuous 4*F point data to each operation core according to the input sequence; transmitting M-1+4L points of data, wherein the data comprises 4*L FIR filtering calculation results for effectively calculating input data; after 64 operation cores in each core group successfully receive the data, the transmission of single-round data is completed;
S4, after each operation core receives the single-round time data transmitted in the step 3, starting to calculate a single-round FIR filter, and simultaneously writing back direct memory access DMA of the result of the previous-round FIR filter and reading in the direct memory access DMA of the input data of the next round; after each operation core receives data, utilizing single instruction stream multiple data stream SIMD technology, storing 4*F point data output in step 3 into vector type variable array, then simultaneously making calculation of 4 groups of fast Fourier transform FFT, then multiplying input signal and fast Fourier transform FFT result of filter coefficient h [ M ], and making inverse discrete Fourier transform so as to obtain single calculation result; after the calculation of the single-round FIR filter of the operation core in each core group is completed, discarding the first M-1 number of each sub-block with the length of F, and transmitting the output result y [4*L ] back to the control core by using Direct Memory Access (DMA);
S5, repeating the operation of the step S3 and the step S4 on the input data of the step 1 until the processing of all the input data is completed, and sequentially storing the operation result y [4*L ] of each round into y [ N ] in a control core to obtain the final operation result y [ N ] of the N-point filter.
As the improvement of the high-performance implementation method of the FIR filter based on the domestic many-core processor, the invention has the advantages that:
In step S3, the input data is divided into data blocks with R blocks of L, in order to divide the FIR filter with large points into a plurality of FIR filters with small points of the same size, 256 operation cores are called to calculate the FIR filters with small points in parallel, 128 points are selected as the length F of the FIR filters with small blocks, the effective input data length processed by each FIR filter block with length F is L, the total block number R of the input data division is determined by the input data amount, and the total round number to be calculated by the operation cores is
As a further improvement of the high-performance implementation method of the FIR filter based on the domestic many-core processor, the invention has the advantages that:
The four input data of length F in step S3 includes 4 x (M-1) data transmitted to avoid aliasing, and 4*L valid calculation input data FIR filter results.
As a further improvement of the high-performance implementation method of the FIR filter based on the domestic many-core processor, the invention has the advantages that:
When each operation core in step S4 calculates the single-pass FIR filter, firstly, data is stored from a standard double-precision floating point type array to the vector type variable array, the length of the vector type variable array is 128, the same number with the same relative position in four L-point data blocks is loaded into the same vector type variable of the vector type array, and simultaneously 4 sets of 128-point fourier transform FFT calculation which are not related to each other are performed.
As a further improvement of the high-performance implementation method of the FIR filter based on the domestic many-core processor, the invention has the advantages that:
the single instruction stream multiple data stream SIMD comprises the following specific steps: after receiving the data of M-1+4+L points transmitted from the control core, the operation core sequentially stores 4 blocks of continuous data of M-1+L points which are overlapped by the M-1 points into the vector variables, wherein the real part and the imaginary part of the data need to be respectively stored into the arrays of two vector variables according to the same sequence, and the real part and the imaginary part of the data are respectively: the imaginary part vector variable array xvi [ F ] and the imaginary part vector variable array xvr [ F ] are subjected to Fourier transform FFT calculation of the point F; the result of FFT calculation is stored in a vector variable array xvi [ F ] of an imaginary part and a vector variable array xvr [ F ] of the imaginary part, multiplied by a FFT calculation result H of a filter coefficient H [ M ] respectively, the products are stored in vector arrays YI [ F ] and YR [ F ] respectively, and finally YI and YR are subjected to IFFT calculation respectively to obtain a single-round calculation result y [4*L ] in an operation core;
When the data volume of the FIR filter calculated by a single operation core is larger than M-1+L64, direct memory access DMA read-write operation needs to be carried out for multiple rounds, and the communication time is hidden by utilizing the calculation time by utilizing a pipelining scheduling method; through the pipeline scheduling design, except the read-in of the first round and the write-back communication process of the last round, when the calculation core calculates the small block FIR filter of the data of the present round, the read-in of the data required by the calculation of the next round and the write-back communication of the calculation result of the last round are carried out, so that the communication time of the intermediate data is hidden.
As a further improvement of the high-performance implementation method of the FIR filter based on the domestic many-core processor, the invention has the advantages that:
The calculation formula of the FIR digital filter is as follows:
wherein y (n) is the output signal; h is a filter coefficient, h (k) represents a kth value of the filter coefficient; x (n) is input data, x (n-k) is the n-k value of the input data, M represents the total length of the filter coefficient h [ M ];
The fast Fourier transform FFT is the same as the discrete Fourier transform DFT in nature, and the calculation formula of the DFT is as follows:
wherein W is a twiddle factor, X (n) is the nth input data, X (k) is the kth output data value, n is an integer in the interval [0, N-1], A twiddle factor representing the length N, n of the input data and a particular value of k:
the beneficial effects of the invention are mainly as follows:
based on multiple cores of a many-core processor, the invention splits the large-point FIR filter into small-point FIR filters with the same size for parallel calculation, thereby improving the core parallelism;
the single instruction stream multiple data stream SIMD parallel instructions of the many-core processor are utilized to parallelize the processing data, so that the processing parallelism of the data stream is improved;
the invention adopts data rearrangement and running water scheduling to reduce redundant calculation and data transmission expenditure of the FIR algorithm, and improves the execution speed of the algorithm.
Drawings
The following describes the embodiments of the present invention in further detail with reference to the accompanying drawings.
FIG. 1 is a flow chart of a method for realizing high performance of an FIR filter based on a domestic many-core processor;
FIG. 2 is a flow diagram of a single instruction stream multiple data stream SIMD operation in a single operation core;
FIG. 3 is a schematic diagram of the acceleration ratio results of the experiment in example 1.
Detailed Description
The invention will be further described with reference to the following specific examples, but the scope of the invention is not limited thereto:
Embodiment 1, a method for implementing high performance of an FIR filter based on a domestic many-core processor, as shown in fig. 1-2, can adapt the architecture and characteristics of the FIR filter algorithm and the Shenwei many-core processor.
The method for realizing the high performance of the FIR filter based on the domestic Shenwei processor 26010 comprises the following steps:
(1) Input data
Analog signals such as voice signals, image signals and the like are subjected to analog-to-digital (A-D) conversion to obtain input data calculated by an FIR filter, wherein the input data calculated by the FIR filter comprises a filter coefficient h [ M ], the length of the filter coefficient h [ M ] is M, and the length of the input data is N; the input data precision is 64-bit double-precision data, the conversion type is complex to complex conversion, namely the input data and the output data are complex; inputting values with the data scale of 60 ten thousand points and higher, wherein the data is double-precision complex data;
the filter types include a high pass filter, a low pass filter, a band pass filter, and a band stop filter, and the filter coefficient length M is supported in a range of one digit to two digits.
(2) The data are distributed to each core group by using a message passing interface MPI, FFT calculation results H of a twiddle factor W and a filter coefficient H [ M ] are calculated and transmitted to an operation core:
The control core is utilized to read in the input data in the step 1, and a message passing interface MPI is utilized to distribute the input data into four core groups, and M-1 zero values are supplemented at the front end of the input data to avoid the generation of aliasing; calculating a rotation factor W required by 128-point Fast Fourier Transform (FFT) according to a formula (3) in a control core, carrying out end zero padding on a filter coefficient H [ M ] to 128 points, then carrying out Fast Fourier Transform (FFT) calculation to obtain a result H, and transmitting the rotation factor W and the result H of the Fast Fourier Transform (FFT) to all operation cores by using Direct Memory Access (DMA);
The FIR digital filter is calculated as follows:
Wherein y (n) is the output signal; h is a filter coefficient, h (k) represents a kth value of the filter coefficient; x (n) is input data, and x (n-k) is the n-k value of the input data; n represents the length of the input data, and M represents the total length of the filter coefficient h [ M ].
The FIR filter calculation can be regarded as the process of convolving the filter coefficient with the input signal, and because the length N of the input data is usually much longer than the total length M of the filter coefficient h [ M ], the problem of calculation resource waste generated by direct convolution can be solved by utilizing an overlap reservation method. According to the overlap preservation method, the input data is divided into R data sub-blocks with the block size of L, the last M-1 points of the last sub-block are added into each data sub-block, so that a large-point FIR filter with the length of N is decomposed into R times of small-point FIR filter calculation with the single block length of F=L+M-1 (all the first M-1 points of the first sub-block are 0).
According to the cyclic convolution theorem, the fast fourier transform (Fast Fourier Transform, FFT) of the two sequences of circumferential convolution results is equal to the product of the two sequences of fast fourier transform results. When the length of the single block FIR filter is F, the equivalent condition of linear convolution and circumferential convolution is satisfied, so that Fast Fourier Transform (FFT) can be performed from zero padding at the tail of a filter coefficient to the point F, fast Fourier Transform (FFT) calculation is performed on each input data sub-block, then the Fast Fourier Transform (FFT) results of two sequences are multiplied and discrete Fourier inverse transform is performed on the product, the FIR filter result of the single sub-block is obtained, and finally the sub-block results are spliced to obtain complete output.
The nature of the Fast Fourier Transform (FFT) is the same as the discrete fourier transform (discrete Fourier transform, DFT for short). For input data with a one-dimensional length of N, the calculation formula of DFT is as follows:
Wherein W is a twiddle factor, A twiddle factor representing the length N, n of the input data and a particular value of k:
e ix = cos x + i sin x is obtained according to the euler formula, Is an imaginary unit, and has symmetry and periodicity. The algorithmic complexity of DFT is O (N 2), typically X (N) and X (k) are complex numbers, X (N) and X (k) represent the nth input data and kth output data values, respectively, and N ranges from integers within the [0, N-1] interval.
(2) The data are distributed to each core group by using message passing interface MPI, and the Fast Fourier Transform (FFT) calculation result H of the twiddle factor W and the filter coefficient H [ M ] is calculated and transmitted to the operation core
The control core is utilized to read in the input data in the step 1, and a message passing interface MPI is utilized to distribute the input data into four core groups, and M-1 zero values are supplemented at the front end of the input data to avoid the generation of aliasing;
Calculating a twiddle factor W required by 128-point Fast Fourier Transform (FFT) in a control core, carrying out end zero padding on a filter coefficient H [ M ] to 128 points, then carrying out Fast Fourier Transform (FFT) calculation to obtain a result H, and transmitting the twiddle factor W and the result H of the Fast Fourier Transform (FFT) to all operation cores by using Direct Memory Access (DMA);
(3) Direct Memory Access (DMA) transfers single round data to an operation core
In each kernel group, dividing the input data of the step 1 into data blocks with R blocks of L, adding the last M-1 points of the upper sub-block to avoid aliasing, respectively performing R times of FIR filter calculation with the single block length of F (F=L+M-1) (all the first M-1 points of the first sub-block are 0) according to a formula (1), and transmitting four continuous input data blocks with the length of F (i.e. 4*F point data) into each operation core according to an input sequence to transmit M-1+4 point data altogether, wherein 4*L effective calculation input data FIR filter results are contained. When 64 operation cores in each core group successfully receive the data, the transmission of single-round data is completed.
The 4*F point data transmitted to a single operation core in a single round includes 4 x (M-1) data transmitted to avoid aliasing, and 4*L effective calculation input data FIR filter results.
Dividing input data into R blocks with the size of L, namely dividing a large-point FIR filter into a plurality of small-point FIR filters with the same size, and calling 256 operation cores to calculate the small-point FIR filters in parallel by combining the parallel computing capability of a Shenwei many-core processor many-core architecture. In order to avoid the problem of inaccurate data caused by aliasing, the effective input data length processed by each FIR filter block with length F is L, so that an appropriate L value needs to be selected to ensure the speed of FIR filter calculation. The size of a local storage space in an operation core of the Shenwei many-core processor is limited, when an algorithm calculates by utilizing single instruction stream multiple data (SIMD), the input data quantity processed by a single operation core for a single round is 4*F, and a temporary variable occupation space is generated in the calculation process, so that the length selection upper limit of L is limited. Meanwhile, as the number of points supported by the Kuril-base fast Fourier transform in the cyclic convolution determination is the integer power of 2, if the number of points F of the selected single block FIR filter is 256 points, the local storage space of the operation core cannot accommodate 4*F point calculation processes using SIMD. Thus, combining the above factors, 128 points are selected as the length F of the small block FIR filter. The total block number R of the input data division is determined by the size of the input data quantity, and the total round number required to be calculated by the operation core is
When the input data and the filter coefficient h [ M ] are subjected to Fast Fourier Transform (FFT) by using the cyclic convolution theorem, the number of the Fast Fourier Transform (FFT) calculated in each round of each operation core is the same, and the numerical value of the twiddle factor W generated in the process of calculating the same number of the Fast Fourier Transform (FFT) is also the same, so that the value of the twiddle factor W of 128 points can be calculated in advance in a control core, and then the calculation result of the twiddle factor W is transmitted to all operation cores at one time by using Direct Memory Access (DMA). The computing core only needs to expend time cost of once receiving the calculated 128-point twiddle factor W result by DMA no matter how many rounds of FIR filters are calculated. Similarly, the FIR filter coefficients of each block remain the same after decomposing the FIR filter with large number of points into blocks with a plurality of small number of points, so the Fast Fourier Transform (FFT) result of the filter coefficient h M required for each calculation of each calculation core is the same. According to this feature, after performing the fast fourier transform calculation of the filter coefficient H [ M ] once in the control core in advance, the Fast Fourier Transform (FFT) calculation result H of the filter coefficient H [ M ] is directly transmitted to each operation core. Before fast Fourier transform FFT calculation of the filter coefficient h [ M ], the operations from zero padding at the tail to the point F are needed to be carried out, and then cyclic shift, complex multiplication and addition and other operations are carried out for a plurality of times; this optimization step reduces the number of repeated computations and increases the overall computation speed and thus the overall computation speed.
(4) Calculation core calculates single round FIR filter
After each operation core receives the single-round time data transmitted in the step 3, starting to calculate a single-round FIR filter, and simultaneously performing Direct Memory Access (DMA) write-back of the result of the previous-round FIR filter and Direct Memory Access (DMA) read-in of the input data of the next round; after each operation core receives data, using single instruction stream multiple data Stream (SIMD) technology, storing the input signal (i.e. 4*F point data output in step 3) into vector type variable array, then simultaneously calculating 4 groups of Fast Fourier Transform (FFT), then multiplying the input signal and the Fast Fourier Transform (FFT) result of the filter coefficient h [ M ], and then performing inverse discrete Fourier transform to obtain single calculation result; after the calculation of the single-round FIR filter of the operation core in each core group is completed, discarding the first M-1 number of each sub-block with the length of F, and transmitting an output result y [4*L ] back to the control core by using Direct Memory Access (DMA);
When the operation core receives data and carries out calculation of the single-pass FIR filter, the data is firstly stored into a vector type variable array from a standard double-precision floating point type array. The Shenwei many-core processor supports 256-bit single instruction stream multiple data Stream (SIMD) vector variables, so that 4 double-precision floating points can be operated at one time when calculating the single vector variable. Because four double-precision floating point numbers loaded into the same vector type variable can not generate related calculation, an array of vector type variables with the length of 128 is needed, the numbers with the same relative positions in four L-point data blocks are loaded into the same vector type variable of the vector type array, and meanwhile, 4 groups of Fourier transform FFT calculation with the independent positions of 128 points are performed.
The specific steps of single instruction stream multiple data Stream (SIMD) in a single operation core are shown in fig. 2, after the operation core receives data of M-1+4×l points transmitted from the control core, 4 blocks of continuous data of M-1+l points including M-1 point overlapping are sequentially stored into vector variables, wherein the real part and the imaginary part of the data need to be respectively stored into arrays of two vector variables in the same sequence respectively: the imaginary part vector variable array xvi [ F ] and the imaginary part vector variable array xvr [ F ] are subjected to Fourier transform (FFT) calculation of the point F; the result of the Fourier transform (FFT) calculation is also stored in a vector variable array xvi [ F ] of the imaginary part and a vector variable array xvr [ F ] of the real part, and multiplied by the result H of the FFT calculation of the filter coefficient H [ M ] respectively, the products are stored in vector arrays YI [ F ], YR [ F ] respectively, and finally YI and YR are respectively subjected to the fast Fourier transform (IFFT) calculation to obtain an operation result y [4*L ] of a single round in the operation core.
When the data volume of the FIR filter calculated by a single operation core is larger than M-1+L64×4, DMA read-write operation needs to be carried out for multiple rounds, and the communication time can be hidden by using the calculation time by using a pipelining scheduling method. Through the design of the pipeline scheduling, besides the reading-in of the first round and the write-back communication process of the last round, when the calculation core calculates the small block FIR filter of the data of the present round, the reading-in of the data required by the calculation of the next round and the write-back communication of the calculation result of the last round can be carried out, so that the communication time of the intermediate data is hidden.
(5) Repeating the operation of step (3) and step (4) on the input data in step 1 until the processing of all the input data is completed, and sequentially storing the operation result y [4*L ] of each round into y [ N ] in the control core to obtain the final operation result y [ N ] of the N-point filter.
Experiment 1:
experimental environment: experiments were performed based on a single "Shenwei 26010" processor, which contains 4 control cores and 256 arithmetic cores. The FIR filter algorithm was implemented using C language programming using compiler version SWCC Compilers:version5.421-sw-500.
Data sources: in the upper computer, the input data calculated by the FIR filter can be obtained after analog-digital conversion of the signals such as voice signals, image signals and the like, and the input data comprises a filter coefficient h [ M ] with the length of M. Because the bottleneck of the FIR filter algorithm occurs under the condition of large input signal quantity, the experimental input data is selected to be 60 ten thousand to 600 ten thousand points.
Experimental setup two groups of experimental contents:
The first group is a comparison group, 256 operation cores and 4 control cores of the Shenwei processor are called to calculate FIR filters in multiple rounds in parallel, and each operation core calculates a small-point FIR filter obtained by dividing large-point input data by using an overlap reservation method;
the second group is to adopt the method of the embodiment 1, and on the basis of achieving 256 operation cores to calculate the FIR filter in parallel by using the overlap preservation method, the optimization experiments of SIMD, data rearrangement and pipeline scheduling are realized at the same time.
In the experiment, the input data and the running environment adopted by the first group and the second group are consistent, the difference value of the beats at the end and the beginning of the program is calculated by utilizing the beat counter function to count the running beat number, and the scale of the input data and the beat number required by the running program are shown in the table 1.
Table 1: scale of data input by experiment and number of beats of running program
Input data volume Beat number of first group Beat number of second group Speed-up ratio
665663 58239938 1842384 31.61118312
998463 85867337 2597936 33.05214352
1331263 112582160 3357919 33.52735775
1664063 141503508.3 4127346 34.28438648
1996863 169485289.8 4894542 34.6274071
2329663 201013942.8 5650566 35.5741237
2662463 227322500.5 6423453 35.38945494
2995263 255556413.3 7191476 35.53601698
3328063 286874268.3 7956171 36.05682649
3660863 313502253.8 8709197 35.99668876
3993663 341866203.3 9476663 36.07453608
4326463 369981298.8 10234745 36.14953756
4659263 396691929.8 10997718 36.07038731
4992063 435803473 11763436 37.04729409
5324863 466555206.8 12517587 37.27197785
5657663 491176606.8 13285757 36.97016406
5990463 516828023.8 14047638 36.79109913
The acceleration ratio in table 1 is the ratio of the number of beats of the algorithm program operation of the FIR filter which is performed in parallel for multiple rounds by the 256 operation cores and 4 control cores of the Shenwei processor adopted in the first group to the number of beats of the algorithm program operation of the FIR filter which is performed in parallel for multiple rounds by the second group adopted in the embodiment 1, the acceleration ratio is taken as an ordinate, the input data amount is taken as an abscissa, as shown in fig. 3, the higher the acceleration ratio, the more the performance improvement is illustrated, and it can be seen that the average value of the acceleration ratio is 35 and the overall fluctuation is smaller when the input data amount is between 67 ten thousand and 600 ten thousand; meanwhile, as the data size increases, the acceleration ratio increases as well: the acceleration ratio of the input data with the size of 5.32 x 10 x 6 is improved by 17.35 percent compared with the acceleration ratio of the input data with the size of 0.67 x 10 x 6, which shows that the algorithm of the invention has better performance on larger-scale data. Therefore, the invention utilizes the many-core parallel structure of the Shenwei processor to carry out the optimization algorithm of the high performance of the FIR filter, thereby effectively improving the calculation speed of the large-point FIR filter.
Finally, it should also be noted that the above list is merely a few specific embodiments of the present invention. Obviously, the invention is not limited to the above embodiments, but many variations are possible. All modifications directly derived or suggested to one skilled in the art from the present disclosure should be considered as being within the scope of the present invention.

Claims (6)

1.基于国产众核处理器的FIR滤波器高性能实现方法,国产众核处理器包括四个核组,每个核组里有1个控制核心和64个运算核心,其特征在于:1. A high-performance implementation method of FIR filter based on a domestic many-core processor, wherein the domestic many-core processor includes four core groups, each of which has one control core and 64 computing cores, and is characterized by: 包括以下步骤:The following steps are involved: S1、将模拟信号进行模数A-D转换后得到输入数据,包括长度为M的滤波器系数h[M],输入数据长度为N;S1. Convert the analog signal to digital (A-D) to obtain input data, including filter coefficients h[M] with a length of M, and the input data length is N; S2、利用控制核心读入步骤1的输入数据并使用消息传递接口MPI将输入数据分配至四个核组中,在输入数据的前端补充M-1个零值;S2, using the control core to read the input data of step 1 and using the message passing interface MPI to distribute the input data to the four core groups, and adding M-1 zero values to the front end of the input data; 在控制核心中,计算旋转因子W,对滤波器系数h[M]进行末尾补零至128点后进行快速傅里叶变换FFT计算获得结果H,使用直接存储访问DMA将旋转因子W和快速傅里叶变换FFT的计算结果H传输至所有运算核心;In the control core, the rotation factor W is calculated, the filter coefficient h[M] is padded to 128 points at the end, and then the fast Fourier transform FFT calculation is performed to obtain the result H. The rotation factor W and the fast Fourier transform FFT calculation result H are transmitted to all the computing cores using direct memory access DMA; S3、每个核组中,将步骤S1输入数据划分成R块大小为L的数据块后,将上个子块的最后M-1个点加入,再分别进行R次单块长度为F=L+M-1的FIR滤波器计算,第一个子块的前M-1个点全部为0,按照输入顺序将连续的4*F点数据传入至各个运算核心;共传输M-1+4*L个点的数据,其中包含4*L个有效计算输入数据的FIR滤波计算结果;当每个核组中的64个运算核心都成功接收到数据后,单轮数据的传输完成;S3, in each core group, after dividing the input data of step S1 into R blocks of size L, add the last M-1 points of the previous sub-block, and then perform R times of FIR filter calculations with a single block length of F=L+M-1. The first M-1 points of the first sub-block are all 0, and the continuous 4*F point data are transmitted to each computing core in the input order; a total of M-1+4*L points of data are transmitted, including 4*L FIR filter calculation results of valid calculation input data; when all 64 computing cores in each core group successfully receive the data, the single round of data transmission is completed; S4、每个运算核心接收到步骤3所传输的单轮次数据后,开始计算单轮次FIR滤波器,同时进行上一轮次FIR滤波器结果的直接存储访问DMA写回和下一轮次输入数据的直接存储访问DMA读入;每个运算核心在接收到数据后,利用单指令流多数据流SIMD技术,会将步骤3中输出的4*F点数据存入向量型变量数组中,再同时进行4组快速傅里叶变换FFT的计算,然后将输入信号和滤波器系数h[M]的快速傅里叶变换FFT结果相乘后,进行离散傅里叶反变换得到单次计算的结果;每个核组中的运算核心单轮FIR滤波器计算完成后,舍弃每个长度为F的子分块的前M-1个数,将输出结果y[4*L]使用直接存储访问DMA传输回控制核心中;S4, after each computing core receives the single-round data transmitted in step 3, it starts to calculate the single-round FIR filter, and at the same time performs direct storage access DMA write-back of the previous round FIR filter result and direct storage access DMA read-in of the next round input data; after receiving the data, each computing core uses the single instruction stream multiple data stream SIMD technology to store the 4*F point data output in step 3 into the vector variable array, and then simultaneously performs 4 groups of fast Fourier transform FFT calculations, and then multiplies the input signal and the fast Fourier transform FFT result of the filter coefficient h[M], and then performs a discrete Fourier inverse transform to obtain the result of a single calculation; after the computing core in each core group completes the single-round FIR filter calculation, the first M-1 numbers of each sub-block of length F are discarded, and the output result y[4*L] is transmitted back to the control core using direct storage access DMA; S5、对步骤1的输入数据重复步骤S3和步骤S4的操作,直至完成对所有输入数据的处理,每轮次的运算结果y[4*L]在控制核心中按顺序存入y[N],得到N点滤波器最终的运算结果y[N]。S5. Repeat the operations of step S3 and step S4 for the input data of step 1 until all the input data are processed. The calculation result y[4*L] of each round is stored in y[N] in the control core in sequence to obtain the final calculation result y[N] of the N-point filter. 2.根据权利要求1所述的基于国产众核处理器的FIR滤波器高性能实现方法,其特征在于:2. The high-performance implementation method of FIR filter based on domestic multi-core processor according to claim 1 is characterized in that: 步骤S3中所述输入数据划分成R块大小为L的数据块,为将大点数的FIR滤波器划分成多个相同大小的小点数FIR滤波器,调用256个运算核心并行计算小点数FIR滤波器,选择128点作为小块FIR滤波器的长度F,每个长度为F的FIR滤波器分块所处理的有效输入数据长度为L,输入数据划分的总分块数R由输入数据量决定,运算核心需要计算的总轮次为 The input data in step S3 is divided into R blocks of size L. In order to divide the FIR filter with a large number of points into multiple FIR filters with a small number of points of the same size, 256 computing cores are called to parallelly calculate the FIR filter with a small number of points. 128 points are selected as the length F of the small block FIR filter. The effective input data length processed by each FIR filter block with a length of F is L. The total number of blocks R for the input data division is determined by the amount of input data. The total number of rounds that the computing core needs to calculate is 3.根据权利要求2所述的基于国产众核处理器的FIR滤波器高性能实现方法,其特征在于:3. The high-performance implementation method of FIR filter based on domestic multi-core processor according to claim 2 is characterized in that: 步骤S3中所述四个长度为F的输入数据,包含了4*(M-1)个避免混叠现象产生而传输的数据,以及4*L个有效计算输入数据FIR滤波器结果。The four input data of length F in step S3 include 4*(M-1) data transmitted to avoid aliasing, and 4*L valid calculation input data FIR filter results. 4.根据权利要求3所述的基于国产众核处理器的FIR滤波器高性能实现方法,其特征在于:4. The high-performance implementation method of FIR filter based on domestic multi-core processor according to claim 3 is characterized in that: 步骤S4中每个运算核心所述计算单轮次FIR滤波器时,首先将数据从标准的双精度浮点类型数组存储到所述向量型变量数组中,所述向量型变量数组的长度为128,将四个L点数据块中相对位置相同的数装入向量型数组的同一个向量型变量中,同时进行4组128点的互不相关的傅里叶变换FFT计算。When each computing core in step S4 calculates a single round of FIR filter, it first stores data from a standard double-precision floating-point type array into the vector variable array, where the length of the vector variable array is 128. Numbers with the same relative position in four L-point data blocks are loaded into the same vector variable of the vector array, and four groups of 128-point mutually unrelated Fourier transform FFT calculations are performed simultaneously. 5.根据权利要求4所述的基于国产众核处理器的FIR滤波器高性能实现方法,其特征在于:5. The high-performance implementation method of FIR filter based on domestic multi-core processor according to claim 4 is characterized in that: 所述单指令流多数据流SIMD具体步骤为:运算核心接收到从控制核心中传来的M-1+4*L个点的数据后,将4块连续且包含M-1点重叠的M-1+L个点的数据按顺序存入到所述向量型变量中,其中数据的实部和虚部需要按相同的顺序分别存入到两个向量型变量的数组当中,分别为:虚部的向量型变量数组xvi[F]和虚部的向量型变量数组xvr[F],再进行F点的傅里叶变换FFT计算;傅里叶变换FFT计算的结果保存在虚部的向量型变量数组xvi[F]和虚部的向量型变量数组xvr[F]中,并分别与滤波器系数h[M]的傅里叶变换FFT计算结果H相乘,将乘积分别存入向量型数组YI[F]、YR[F]中,最后将YI、YR分别进行快速傅里叶反变换IFFT计算,得到运算核心中单轮次的运算结果y[4*L];The specific steps of the single instruction stream multiple data stream SIMD are as follows: after the computing core receives the data of M-1+4*L points from the control core, 4 blocks of data of M-1+L points that are continuous and contain M-1 points overlapping are stored in the vector type variable in order, wherein the real part and the imaginary part of the data need to be stored in the same order in two vector type variable arrays, namely: the imaginary part vector type variable array xvi[F] and the imaginary part vector type variable array xvr[F], and then the Fourier transform FFT calculation of point F is performed; the result of the Fourier transform FFT calculation is stored in the imaginary part vector type variable array xvi[F] and the imaginary part vector type variable array xvr[F], and is multiplied with the Fourier transform FFT calculation result H of the filter coefficient h[M] respectively, and the product is stored in the vector type arrays YI[F] and YR[F] respectively, and finally YI and YR are respectively subjected to inverse fast Fourier transform IFFT calculation to obtain the single round calculation result y[4*L] in the computing core; 单个运算核心计算FIR滤波器的数据量大于M-1+L*64*4时,需要进行多轮次的直接存储访问DMA读写操作,运用流水调度的方法,利用计算时间隐藏通信时间;通过流水调度设计,除了第一轮次的读入和最后一轮次的写回通信过程以外,当运算核心在进行本轮次数据小块FIR滤波器计算的同时,进行下一轮次计算所需数据的读入以及上一轮次计算结果的写回通信,从而将中间数据通信时间进行隐藏。When the amount of data for calculating the FIR filter by a single computing core is greater than M-1+L*64*4, multiple rounds of direct storage access DMA read and write operations are required. The pipeline scheduling method is used to hide the communication time with the computing time. Through the pipeline scheduling design, in addition to the first round of reading and the last round of writing back communication processes, when the computing core is performing the small block FIR filter calculation of the current round of data, it reads the data required for the next round of calculation and writes back the calculation results of the previous round, thereby hiding the intermediate data communication time. 6.根据权利要求5所述的基于国产众核处理器的FIR滤波器高性能实现方法,其特征在于:6. The high-performance implementation method of FIR filter based on domestic multi-core processor according to claim 5 is characterized in that: 所述FIR数字滤波器的计算公式如下:The calculation formula of the FIR digital filter is as follows: 其中,y(n)是输出信号;h为滤波器系数,h(k)表示滤波器系数的第k个值;x(n)为输入数据,x(n-k)为输入数据的第n-k个值,M表示滤波器系数h[M]的总长度;Where y(n) is the output signal; h is the filter coefficient, h(k) represents the kth value of the filter coefficient; x(n) is the input data, x(n-k) is the n-kth value of the input data, and M represents the total length of the filter coefficient h[M]; 所述快速傅里叶变换FFT的本质与离散傅里叶变换DFT相同,DFT的计算公式为:The essence of the Fast Fourier Transform FFT is the same as the Discrete Fourier Transform DFT, and the calculation formula of DFT is: 其中,W为旋转因子,x(n)为第n个输入数据,X(k)为第k个输出数据值,n的范围为[0,N-1]区间内的整数,表示输入数据长度为N、n和k为特定值时的旋转因子:Where W is the rotation factor, x(n) is the nth input data, X(k) is the kth output data value, and n is an integer in the range [0, N-1]. Indicates the rotation factor when the input data length is N, n and k are specific values:
CN202111519880.XA 2021-12-13 2021-12-13 High-performance implementation method of FIR filter based on domestic many-core processor Active CN114237716B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111519880.XA CN114237716B (en) 2021-12-13 2021-12-13 High-performance implementation method of FIR filter based on domestic many-core processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111519880.XA CN114237716B (en) 2021-12-13 2021-12-13 High-performance implementation method of FIR filter based on domestic many-core processor

Publications (2)

Publication Number Publication Date
CN114237716A CN114237716A (en) 2022-03-25
CN114237716B true CN114237716B (en) 2024-11-22

Family

ID=80755363

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111519880.XA Active CN114237716B (en) 2021-12-13 2021-12-13 High-performance implementation method of FIR filter based on domestic many-core processor

Country Status (1)

Country Link
CN (1) CN114237716B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6526430B1 (en) * 1999-10-04 2003-02-25 Texas Instruments Incorporated Reconfigurable SIMD coprocessor architecture for sum of absolute differences and symmetric filtering (scalable MAC engine for image processing)
CN101031904A (en) * 2004-07-13 2007-09-05 3加1科技公司 Programmable processor system with two kinds of subprocessor to execute multimedia application
CN111461311B (en) * 2020-03-26 2023-04-07 中国科学技术大学 Convolutional neural network operation acceleration method and device based on many-core processor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
适用于众核处理器的信号处理算法实现技术;杨昕瑶;《中国优秀硕士学位论文全文数据库 信息科技辑》;20220615;I136-52 *

Also Published As

Publication number Publication date
CN114237716A (en) 2022-03-25

Similar Documents

Publication Publication Date Title
KR102443546B1 (en) matrix multiplier
CN103440121B (en) A kind of triangular matrix multiplication vectorization method of vector processor-oriented
CN102200964B (en) Parallel-processing-based fast Fourier transform (FFT) device and method thereof
US8255446B2 (en) Apparatus and method for performing rearrangement and arithmetic operations on data
US20030088601A1 (en) Efficient complex multiplication and fast fourier transform (fft) implementation on the manarray architecture
CN103699516B (en) Single instruction multiple data (SIMD)-based parallel fast fourier transform/inverse fast fourier transform (FFT/IFFT) butterfly operation method and SIMD-based parallel FFT/IFFT butterfly operation device in vector processor
WO2010105887A2 (en) Processing array data on simd multi-core processor architectures
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
US7062523B1 (en) Method for efficiently computing a fast fourier transform
CN107451097B (en) High-performance implementation method of multi-dimensional FFT on domestic Shenwei 26010 multi-core processor
WO2023045516A1 (en) Fft execution method, apparatus and device
CN104731563B (en) Large integer multiplication SSA algorithm multi-core parallel concurrent implementation methods based on FFT
CN113496279A (en) Packet convolution for channel convolution engine using point-to-point connections
US7653676B2 (en) Efficient mapping of FFT to a reconfigurable parallel and pipeline data flow machine
CN202217276U (en) FFT device based on parallel processing
CN114237716B (en) High-performance implementation method of FIR filter based on domestic many-core processor
US20060075010A1 (en) Fast fourier transform method and apparatus
CN110750249B (en) Method and device for generating fast Fourier transform code
CN115269008B (en) A data processing method, device, medium and electronic equipment
CN102231624B (en) Vectorization Implementation Method of FIR of Floating Point Complex Number Block Oriented to Vector Processor
CN104615582B (en) The method calculated towards GPDSP one-dimensional FFT vectorizations of counting greatly
US7447722B2 (en) Low latency computation in real time utilizing a DSP processor
Qi et al. Mixed precision method for gpu-based fft
JP3709291B2 (en) Fast complex Fourier transform method and apparatus
Liu et al. Mod (2P-1) shuffle memory-access instructions for FFTs on vector SIMD DSPs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant