[go: up one dir, main page]

CN114237716A - High-performance implementation method of FIR filter based on domestic many-core processor - Google Patents

High-performance implementation method of FIR filter based on domestic many-core processor Download PDF

Info

Publication number
CN114237716A
CN114237716A CN202111519880.XA CN202111519880A CN114237716A CN 114237716 A CN114237716 A CN 114237716A CN 202111519880 A CN202111519880 A CN 202111519880A CN 114237716 A CN114237716 A CN 114237716A
Authority
CN
China
Prior art keywords
data
core
input data
calculation
fir filter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111519880.XA
Other languages
Chinese (zh)
Other versions
CN114237716B (en
Inventor
刘鹏
杨昕瑶
郭俊
张鲁飞
吴东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202111519880.XA priority Critical patent/CN114237716B/en
Publication of CN114237716A publication Critical patent/CN114237716A/en
Application granted granted Critical
Publication of CN114237716B publication Critical patent/CN114237716B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03HIMPEDANCE NETWORKS, e.g. RESONANT CIRCUITS; RESONATORS
    • H03H17/00Networks using digital techniques
    • H03H17/02Frequency selective networks
    • H03H17/0211Frequency selective networks using specific transformation algorithms, e.g. WALSH functions, Fermat transforms, Mersenne transforms, polynomial transforms, Hilbert transforms
    • H03H17/0213Frequency domain filters using Fourier transforms
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03HIMPEDANCE NETWORKS, e.g. RESONANT CIRCUITS; RESONATORS
    • H03H17/00Networks using digital techniques
    • H03H2017/0072Theoretical filter design
    • H03H2017/0081Theoretical filter design of FIR filters

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Algebra (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Analysis (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)
  • Filters That Use Time-Delay Elements (AREA)

Abstract

The invention provides a high-performance realization method of an FIR filter based on a domestic many-core processor, which is based on a domestic many-core processor platform, analog-to-digital conversion is carried out on an analog signal to obtain input data, a control core uses a message transmission interface to distribute the input data to four core groups, M-1 zero values are supplemented at the front end of the input data, then a result H of FFT of a twiddle factor W and a filter coefficient H [ M ] is calculated, and the twiddle factor W and the calculation result H are transmitted to each operation core by using direct storage access; and then, directly storing and accessing the data in a single round, transmitting the data to each operation core, performing FIR filtering calculation in the single round, and sequentially connecting the operation results in each round in the control core to obtain a final result. Compared with the method for directly calculating the FIR filter algorithm by using the single core of the domestic processor, the optimization of the FIR filter algorithm by the method improves the core parallelism, realizes the parallelization of data processing and further improves the algorithm speed.

Description

China-made many-core processor-based FIR filter high-performance implementation method
Technical Field
The invention relates to the field of finite long impulse response filters, in particular to a high-performance realization method of an FIR filter based on a domestic many-core processor.
Background
Digital filters are one of the most widely used algorithms in the field of digital signal processing, and have many applications in speech and image processing, multimedia applications, and the like due to their advantages of strong stability, strict linear phase, easy hardware implementation, and the like.
The FIR filter calculation is a process of completing convolution of the filter coefficient and the input signal, and because the length N of the input data is usually far longer than the length M of the filter coefficient, the problem of calculation resource waste caused by direct convolution can be solved by utilizing an overlap preservation method. According to the overlap-and-preserve method, input data is divided into R data blocks with the size of L, the last M-1 points of the previous sub-block are added into each sub-block, and FIR filtering calculation with the length of F ═ L + M-1 is carried out for R times (the first M-1 points of the first sub-block are all 0). According to the cyclic convolution theorem, the Fast Fourier Transform (FFT) of the result of the circular convolution of two sequences is equal to the product of the result of the Fast Fourier Transform of the two sequences. When the length of the single FIR filter is F, the equivalent conditions of linear convolution and circular convolution are satisfied, so that FFT calculation can be performed on the filter coefficient and each block of input data, then the FFT results of the two sequences are multiplied, the product is subjected to inverse discrete Fourier transform to obtain the result of a single sub-block, and finally the result of each sub-block is spliced to obtain complete output.
The Shenwei processor 26010 adopts a 64-bit autonomous Shenwei instruction set, a full chip 260 core, a chip standard working frequency of 1.5GHz and a peak operating speed of 3.168 TFLOPS. The Shenwei processor has four core groups in total, each core group has 1 control core and 64 operation cores, the size of the storage space of a single core group is 8 Gbytes, the local storage space of the single operation core is 16 Kbytes, and data can be transmitted among the core groups in a Message Passing Interface (MPI) mode.
When performing FIR filter calculation, if the input data size is large, the computational speed of the schweizhong kernel of the domestic processor is slow, and therefore a method capable of solving the performance bottleneck of FIR filter calculation is needed.
Disclosure of Invention
The invention aims to provide a high-performance realization method of an FIR filter based on a domestic many-core processor, which is used for solving the problem of adaptability of a digital signal processing function FIR filter and a Shenwei many-core processor and improving the calculation rate of the FIR filter with a large number of points on the Shenwei many-core processor.
In order to solve the technical problems, the invention provides a high-performance realization method of an FIR filter based on a domestic many-core processor, wherein the domestic many-core processor comprises four core groups, and each core group comprises 1 control core and 64 operation cores, and the method comprises the following steps:
s1, performing analog-to-digital (A-D) conversion on the analog signal to obtain input data, wherein the input data comprise a filter coefficient h [ M ] with the length of M, and the length of the input data is N;
s2, reading the input data in the step 1 by using a control core, distributing the input data to four core groups by using a Message Passing Interface (MPI), and supplementing M-1 zero values at the front end of the input data;
in the control core, calculating a twiddle factor W, carrying out end zero padding on a filter coefficient H [ M ] to 128 points, then carrying out Fast Fourier Transform (FFT) calculation to obtain a result H, and transmitting the twiddle factor W and the calculation result H of the FFT to all operation cores by using a Direct Memory Access (DMA);
s3, in each core group, dividing the input data in the step S1 into data blocks with the size of an R block being L, adding the last M-1 points of the previous sub-block, respectively calculating R times of FIR filters with the length of a single block being F ═ L + M-1, wherein the first M-1 points of the first sub-block are all 0, and transmitting continuous 4 x F point data to each operation core according to the input sequence; transmitting data of M-1+4 points in total, wherein the data comprises 4 FIR filtering calculation results of effective calculation input data; when 64 operation cores in each core group successfully receive the data, completing the transmission of the single-round data;
s4, after each operation core receives the single round data transmitted in step 3, starting to calculate a single round FIR filter, and simultaneously performing direct memory access DMA write-back of the FIR filter result of the previous round and direct memory access DMA read-in of the input data of the next round; after each operation core receives data, 4 x F point data output in the step 3 is stored in a vector type variable array by using a single instruction stream multiple data stream SIMD technology, then 4 groups of FFT (fast Fourier transform) calculation are carried out simultaneously, then an input signal is multiplied by the FFT result of the filter coefficient h [ M ], and then inverse discrete Fourier transform is carried out to obtain the result of single calculation; after the calculation of the operation core single-round FIR filter in each core group is completed, abandoning the first M-1 number of each sub-block with the length of F, and transmitting an output result y [ 4X L ] back to the control core by using direct memory access DMA;
and S5, repeating the operations of the step S3 and the step S4 on the input data of the step 1 until all the input data are processed, and sequentially storing the operation result y [4 x L ] of each turn into y [ N ] in the control core to obtain the final operation result y [ N ] of the N-point filter.
The improvement of the high-performance realization method of the FIR filter based on the domestic many-core processor is as follows:
in step S3, the input data is divided into R blocks with size L, in order to divide the FIR filter with large number of points into a plurality of FIR filters with small number of points with the same size, 256 computation cores are called to compute the FIR filters with small number of points in parallel, 128 points are selected as lengths F of the FIR filters with small number of points, the effective input data length processed by each FIR filter with length F is L, the total number R of blocks divided by the input data is determined by the input data amount, and the total round number needed to be computed by the computation cores is L
Figure BDA0003406919010000031
The FIR filter high-performance implementation method based on the domestic many-core processor is further improved as follows:
the four input data with length F in step S3 include 4 × L (M-1) data transmitted to avoid aliasing and 4 × L results of the FIR filter for effective calculation of the input data.
The FIR filter high-performance implementation method based on the domestic many-core processor is further improved as follows:
when each operation core calculates the single-round FIR filter in step 4, first, data is stored into the vector type variable array from a standard double-precision floating point type array, the length of the vector type variable array is 128, numbers with the same relative position in four L point data blocks are loaded into the same vector type variable of the vector type array, and simultaneously, 4 sets of 128-point orthogonal fourier transform FFT calculations are performed.
The FIR filter high-performance implementation method based on the domestic many-core processor is further improved as follows:
the single instruction stream multiple data stream SIMD comprises the following specific steps: after receiving the data of M-1+4 x L points transmitted from the control core, the operation core stores 4 continuous data containing M-1+ L points overlapped by M-1 points into the vector type variables in sequence, wherein the real part and the imaginary part of the data need to be stored into the arrays of the two vector type variables in the same sequence respectively, and the sequence is as follows: imaginary vector type variable array xvi [ F ] and imaginary vector type variable array xvr [ F ], and then performing Fourier transform FFT calculation of F point; the result of FFT calculation is stored in imaginary vector type variable array xvi [ F ] and imaginary vector type variable array xvr [ F ], and multiplied with the result H of FFT calculation of filter coefficient H [ M ], the products are stored in vector type arrays YI [ F ] and YR [ F ], finally, YI and YR are subjected to IFFT calculation to obtain the result y [4 x L ] of one-turn operation in the operation core;
when the data volume of the FIR filter calculated by a single operation core is more than M-1+ L64 x 4, multiple rounds of direct memory access DMA read-write operation are required, and the communication time is hidden by using the calculation time by using a flow scheduling method; through the flow scheduling design, except for the read-in process of the first round and the write-back communication process of the last round, when the operation core performs the calculation of the data small block FIR filter of the current round, the read-in of the data required by the calculation of the next round and the write-back communication of the calculation result of the previous round are performed, so that the communication time of the intermediate data is hidden.
The FIR filter high-performance implementation method based on the domestic many-core processor is further improved as follows:
the calculation formula of the FIR digital filter is as follows:
Figure BDA0003406919010000032
wherein y (n) is an output signal; h is the filter coefficient, h (k) represents the kth value of the filter coefficient; x (n) is input data, x (n-k) is the n-k value of the input data, and M represents the total length of the filter coefficients h [ M ];
the nature of the fast Fourier transform FFT is the same as that of the discrete Fourier transform DFT, and the calculation formula of the DFT is as follows:
Figure BDA0003406919010000041
wherein W is a twiddle factor, x (N) is nth input data, X (k) is kth output data value, and N is in the range of [0, N-1 ]]The integer number within the interval is such that,
Figure BDA0003406919010000042
denotes the twiddle factor when the input data length is N, n and k is a specific value:
Figure BDA0003406919010000043
the invention has the following beneficial effects:
according to the multi-core parallel computing method, based on the multi-core of the many-core processor, the large-point FIR filter is divided into the small-point FIR filters with the same size for parallel computing, and the core parallelism is improved;
the method has the advantages that the single instruction stream and multiple data stream SIMD parallel instruction of the many-core processor is utilized to parallelize the processing data and improve the data stream processing parallelism;
the invention reduces the redundant calculation and data transmission overhead of the FIR algorithm by adopting data rearrangement and stream scheduling, and improves the algorithm execution speed.
Drawings
The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.
FIG. 1 is a flow chart of the high performance implementation method of the FIR filter based on the domestic many-core processor of the present invention;
FIG. 2 is a flow diagram of a single instruction stream, multiple data streams SIMD operation in a single operation core;
FIG. 3 is a graph showing the results of the acceleration ratio of the experiment in example 1.
Detailed Description
The invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto:
embodiment 1, a method for implementing a high performance FIR filter based on a domestic many-core processor, as shown in fig. 1-2, can perform adaptation of FIR filter algorithm and architecture and characteristics of an Shenwei many-core processor.
The realization method of the FIR filter high performance based on the domestic Shenwei processor 26010 comprises the following steps:
(1) inputting data
Analog signals such as voice signals and image signals are subjected to analog-to-digital (A-D) conversion to obtain input data calculated by an FIR filter, wherein the input data calculated by the FIR filter comprise a filter coefficient h [ M ], the length of the filter coefficient h [ M ] is M, and the length of the input data is N; the input data precision is 64-bit double-precision data, and the transformation type is the transformation from complex number to complex number, namely the input and output data are all complex number; the input data scale is 60 ten thousand points and higher values, and the data is double-precision complex data;
the filter types comprise a high-pass filter, a low-pass filter, a band-pass filter and a band-stop filter, and the filter coefficient length M support range is from one digit to two digits.
(2) Distributing data to each kernel group by using a message passing interface MPI, calculating an FFT calculation result H of the twiddle factor W and the filter coefficient H [ M ] and transmitting the result H to an operation core:
reading in the input data in the step 1 by using a control core, distributing the input data to four core groups by using a Message Passing Interface (MPI), and supplementing M-1 zero values at the front end of the input data to avoid aliasing; calculating a twiddle factor W required by 128-point Fast Fourier Transform (FFT) in a control core according to a formula (3), carrying out end zero padding on a filter coefficient H [ M ] to 128 points, then carrying out Fast Fourier Transform (FFT) calculation to obtain a result H, and transmitting the twiddle factor W and the result H of the Fast Fourier Transform (FFT) to all operation cores by using Direct Memory Access (DMA);
the FIR digital filter has the following calculation formula:
Figure BDA0003406919010000051
wherein y (n) is an output signal; h is the filter coefficient, h (k) represents the kth value of the filter coefficient; x (n) is input data, and x (n-k) is the n-k value of the input data; n denotes the length of the input data and M denotes the total length of the filter coefficients h [ M ].
The FIR filter calculation can be regarded as a process of convolution of the filter coefficient and the input signal, and because the length N of the input data is usually much longer than the total length M of the filter coefficient h [ M ], the problem of calculation resource waste caused by direct convolution can be solved by using an overlap-and-hold method. According to the overlap-preserving method, input data is divided into R data sub-blocks with the size of L, the last M-1 point of the last sub-block is added into each data sub-block, and a large-point-number FIR filter with the length of N once is decomposed into R times of small-point-number FIR filtering calculation with the single block length of F ═ L + M-1 (the first M-1 points of the first sub-block are all 0).
According to the cyclic convolution theorem, the Fast Fourier Transform (FFT) of the results of the circular convolutions of two sequences is equal to the product of the results of the Fast Fourier transforms of the two sequences. When the length of the single FIR filter is F, the equivalent conditions of linear convolution and circular convolution are met, so that zero filling can be carried out to F point at the end of the filter coefficient, Fast Fourier Transform (FFT) calculation of the F point is carried out to each input data sub-block, then the Fast Fourier Transform (FFT) results of the two sequences are multiplied, the product is subjected to inverse discrete Fourier transform, the FIR filter result of a single sub-block is obtained, and finally, all sub-block results are spliced to obtain complete output.
The nature of the Fast Fourier Transform (FFT) is the same as the Discrete Fourier Transform (DFT). For input data with one-dimensional length N, the calculation formula of DFT is as follows:
Figure BDA0003406919010000052
wherein, W is a rotation factor,
Figure BDA0003406919010000061
denotes the twiddle factor when the input data length is N, n and k is a specific value:
Figure BDA0003406919010000062
e is obtained from Euler's formulaix=cos x+i sin x,
Figure BDA0003406919010000063
Is an imaginary unit with symmetry and periodicity. The algorithm complexity of DFT is O (N)2) Usually, x (N) and X (k) are complex numbers, x (N) and X (k) respectively represent the nth input data and the kth output data, N ranges from [0, N-1 ]]An integer within the interval.
(2) Distributing data to each kernel group by using a message passing interface MPI, calculating a Fast Fourier Transform (FFT) calculation result H of a twiddle factor W and a filter coefficient H [ M ] and transmitting the calculation result H to an operation core
Reading in the input data in the step 1 by using a control core, distributing the input data to four core groups by using a Message Passing Interface (MPI), and supplementing M-1 zero values at the front end of the input data to avoid aliasing;
calculating a twiddle factor W required by 128-point Fast Fourier Transform (FFT) in a control core, carrying out end zero padding on a filter coefficient H [ M ] to 128 points, then carrying out Fast Fourier Transform (FFT) calculation to obtain a result H, and transmitting the twiddle factor W and the result H of the Fast Fourier Transform (FFT) to all operation cores by using Direct Memory Access (DMA);
(3) direct Memory Access (DMA) transfers a single round of data to an arithmetic core
In each kernel group, the input data in step 1 is divided into R data blocks with a block size of L, in order to avoid aliasing, the last M-1 points of the last sub-block are added, then R times of FIR filter calculation with a single block length of F (F ═ L + M-1) are respectively carried out according to formula (1) (the first M-1 points of the first sub-block are all 0), four consecutive input data blocks with a length of F (namely 4F point data) are transmitted to each operation core according to the input sequence, and the data of M-1+4 x L points are transmitted together, wherein the data comprise 4 x L effective calculation input data FIR filter results. And when the 64 operation cores in each core group successfully receive the data, completing the transmission of the single round of data.
The 4 x F point data transmitted in a single round of transmission to a single operation core contains 4 x (M-1) data transmitted to avoid aliasing phenomenon, and 4 x L effective calculation input data FIR filter results.
The input data is divided into R data blocks with the size of L, namely, the FIR filter with large points is divided into a plurality of FIR filters with small points with the same size, and 256 operation cores are called to calculate the FIR filters with small points in parallel by combining the parallel computing capability of the multi-core architecture of the Shenwei multi-core processor. In order to avoid the problem of inaccurate data caused by aliasing, the effective input data processed by each FIR filter block with the length of F is L, so that an appropriate L value needs to be selected to ensure the calculation speed of the FIR filter. The size of a local storage space in an operation core of the Shenwei many-core processor is limited, and the algorithm utilizes a single fingerWhen a stream multiple data Stream (SIMD) is calculated, the input data amount processed by a single operation core in a single round is 4 × F, and a temporary variable occupation space is generated in the calculation process, so that the upper limit of the length selection of L is limited. Meanwhile, because the number of points supported by the kuri-radix fast fourier transform in the cyclic convolution theorem is an integer power of 2, if the number of points F of the selected single FIR filter is 256, the local storage space of the operation core cannot accommodate the calculation process of 4 x F points using SIMD. Therefore, combining the above factors, 128 points are selected as the length F of the small block FIR filter. The total block number R of the input data division is determined by the size of the input data volume, and the total round number required to be calculated by the operation core is
Figure BDA0003406919010000071
When the input data and the filter coefficient h [ M ] are subjected to Fast Fourier Transform (FFT) by using the cyclic convolution theorem, the number of Fast Fourier Transform (FFT) points calculated by each operation core in each round is the same, and the numerical value of the twiddle factor W generated in the calculation process of the Fast Fourier Transform (FFT) with the same number of points is also the same, so that the value of the 128-point twiddle factor W can be calculated in the control core in advance, and the calculation result of the twiddle factor W is transmitted to all the operation cores at one time by using Direct Memory Access (DMA). The operation core only needs to consume one time of the time overhead of receiving the calculated 128-point twiddle factor W result by the DMA no matter how many rounds of FIR filters are calculated. Similarly, after the large-point FIR filter is decomposed into a plurality of small-point blocks, the FIR filter coefficients of each block are still the same, so the Fast Fourier Transform (FFT) results of the filter coefficients h [ M ] required by each computation core for each round of computation are also the same. According to the characteristic, the fast Fourier transform calculation of the filter coefficient H [ M ] can be carried out in the control core in advance, and then the Fast Fourier Transform (FFT) calculation result H of the filter coefficient H [ M ] is directly transmitted to each operation core. Before Fast Fourier Transform (FFT) calculation of a filter coefficient h [ M ], the operation of zero padding to a point F at the tail end is required to be carried out, and then the operation of cyclic shift, complex multiplication and addition and the like is carried out for many times; this optimization step reduces the number of repeated calculations and increases the overall calculation speed and thus the overall calculation speed.
(4) And computing core calculation single round FIR filter
After receiving the single round data transmitted in step 3, each operation core starts to calculate a single round FIR filter, and simultaneously performs Direct Memory Access (DMA) write-back of the FIR filter result of the previous round and Direct Memory Access (DMA) read-in of the input data of the next round; after each operation core receives data, an input signal (namely 4 x F point data output in the step 3) is stored in a vector type variable array by using a single instruction stream multiple data Stream (SIMD) technology, then 4 groups of Fast Fourier Transform (FFT) are calculated at the same time, then the input signal is multiplied by a Fast Fourier Transform (FFT) result of a filter coefficient h [ M ], and then inverse discrete Fourier transform is carried out to obtain a single calculation result; after the calculation of the operation core single-round FIR filter in each core group is completed, abandoning the first M-1 number of each sub-block with the length of F, and transmitting an output result y [ 4X L ] back to the control core by using Direct Memory Access (DMA);
when the operation core receives data and carries out single-round computation of the FIR filter, the data are firstly stored into a vector type variable array from a standard double-precision floating point type array. The Shenwei many-core processor supports 256-bit single instruction multiple data Stream (SIMD) vector variables, so that 4 double-precision floating points can be operated at one time while a single vector variable is calculated. Because the four double-precision floating point numbers loaded with the same vector type variable cannot generate related calculation, an array of the vector type variable with the length of 128 is needed, the numbers with the same relative position in the four L point data blocks are loaded into the same vector type variable of the vector type array, and simultaneously, 4 groups of 128-point mutually-unrelated Fourier transform FFT calculation are carried out.
As shown in fig. 2, after receiving data of M-1+4 × L points transmitted from the control core, the operation core stores 4 blocks of continuous data including M-1+ L points overlapped by M-1 points into vector variables in sequence, where the real part and the imaginary part of the data need to be stored into the arrays of two vector variables in the same sequence, respectively: imaginary vector type variable array xvi [ F ] and imaginary vector type variable array xvr [ F ], and then Fourier transform (FFT) calculation of F point is carried out; the result of Fourier transform (FFT) calculation is also stored in the vector type variable array xvi [ F ] of the imaginary part and the vector type variable array xvr [ F ] of the real part, and is multiplied with the FFT calculation result H of the filter coefficient H [ M ], the products are stored in the vector type arrays YI [ F ] and YR [ F ] respectively, finally, YI and YR are subjected to Inverse Fast Fourier Transform (IFFT) calculation respectively, and the single round calculation result y [4 x L ] in the calculation core is obtained.
When the data volume of the FIR filter calculated by a single operation core is more than M-1+ L64 x 4, multiple rounds of DMA read-write operation are required, and the communication time can be hidden by using the calculation time by using a pipeline scheduling method. Through the flow scheduling design, except for the read-in process of the first round and the write-back communication process of the last round, when the operation core performs the calculation of the data small block FIR filter of the current round, the read-in of the data required by the calculation of the next round and the write-back communication of the calculation result of the previous round can be performed, so that the communication time of the intermediate data is hidden.
(5) And repeating the operations of the step (3) and the step (4) on the input data in the step 1 until all the input data are processed, and sequentially storing the operation result y [4 x L ] of each round into the control core to obtain the final operation result y [ N ] of the N-point filter.
Experiment 1:
the experimental environment is as follows: experiments were performed on a single-chip "Shenwei 26010" processor, which contained 4 control cores and 256 arithmetic cores. The FIR filter algorithm is realized by using C language programming, and the version of the compiler used is SWCC builders: version 5.421-sw-500.
The data source is as follows: in the upper computer, input data calculated by the FIR filter can be obtained after analog-to-digital conversion is carried out on signals such as voice signals, image signals and the like, and the input data comprise a filter coefficient h [ M ] with the length of M. Because the bottleneck of the FIR filter algorithm occurs when the input signal quantity is large, the range of experimental input data selection is 60 ten thousand to 600 ten thousand points.
Two sets of experimental content were set up:
the first group is a comparison group, 256 operation cores and 4 control cores of the Shenwei processor are called to carry out multi-round parallel FIR filter calculation, and each operation core calculates a small-point FIR filter obtained by segmenting large-point input data by using an overlap preservation method;
the second group adopts the method of embodiment 1, and realizes the optimization experiment of SIMD, data rearrangement and pipeline scheduling simultaneously on the basis of achieving 256 operation cores parallel computation FIR filters by using the overlap reservation method.
In the experiment, the input data adopted by the first group and the second group are consistent with the running environment, the running beat number is counted by calculating the difference value of the beats at the end and the beginning of the program by using a beat counter function, and the scale of the input data and the beat number required by the running program are shown in table 1.
Table 1: scale of data input by experiment and number of beats of running program
Input data volume Number of beats of first group Number of beats of the second group Acceleration ratio
665663 58239938 1842384 31.61118312
998463 85867337 2597936 33.05214352
1331263 112582160 3357919 33.52735775
1664063 141503508.3 4127346 34.28438648
1996863 169485289.8 4894542 34.6274071
2329663 201013942.8 5650566 35.5741237
2662463 227322500.5 6423453 35.38945494
2995263 255556413.3 7191476 35.53601698
3328063 286874268.3 7956171 36.05682649
3660863 313502253.8 8709197 35.99668876
3993663 341866203.3 9476663 36.07453608
4326463 369981298.8 10234745 36.14953756
4659263 396691929.8 10997718 36.07038731
4992063 435803473 11763436 37.04729409
5324863 466555206.8 12517587 37.27197785
5657663 491176606.8 13285757 36.97016406
5990463 516828023.8 14047638 36.79109913
The acceleration ratio in table 1 is the ratio of the number of beats of the algorithm program of the FIR filter in which 256 arithmetic cores and 4 control cores of the schwegian processor in the first group perform multiple rounds of parallel operations to the number of beats of the algorithm program of the FIR filter in which the second group performs many rounds of parallel operations using the many-core parallel algorithm program in embodiment 1, the acceleration ratio is taken as the ordinate, and the input data amount is taken as the abscissa, as shown in fig. 3, the higher the acceleration ratio, the more performance improvement is described, and it can be seen that, when the input data amount is between 67 to 600 ten thousand, the average value of the acceleration ratio is 35, and the total fluctuation is small; meanwhile, as the data size increases, the acceleration ratio also increases: the acceleration ratio of the input data scale of 5.32X 10^6 is improved by 17.35 percent compared with the acceleration ratio of the input data scale of 0.67X 10^6, which shows that the algorithm of the invention has better performance on larger-scale data. Therefore, the invention utilizes the many-core parallel structure of the Shenwei processor to carry out the high-performance optimization algorithm of the FIR filter, and effectively improves the calculation speed of the FIR filter with large number of points.
Finally, it is also noted that the above-mentioned lists merely illustrate a few specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims (6)

1. A high-performance realization method of an FIR filter based on a domestic many-core processor comprises four core groups, wherein each core group comprises 1 control core and 64 operation cores, and the FIR filter is characterized in that:
the method comprises the following steps:
s1, performing analog-to-digital (A-D) conversion on the analog signal to obtain input data, wherein the input data comprise a filter coefficient h [ M ] with the length of M, and the length of the input data is N;
s2, reading the input data in the step 1 by using a control core, distributing the input data to four core groups by using a Message Passing Interface (MPI), and supplementing M-1 zero values at the front end of the input data;
in the control core, calculating a twiddle factor W, carrying out end zero padding on a filter coefficient H [ M ] to 128 points, then carrying out Fast Fourier Transform (FFT) calculation to obtain a result H, and transmitting the twiddle factor W and the calculation result H of the FFT to all operation cores by using a Direct Memory Access (DMA);
s3, in each core group, dividing the input data in the step S1 into data blocks with the size of an R block being L, adding the last M-1 points of the previous sub-block, respectively calculating R times of FIR filters with the length of a single block being F ═ L + M-1, wherein the first M-1 points of the first sub-block are all 0, and transmitting continuous 4 x F point data to each operation core according to the input sequence; transmitting data of M-1+4 points in total, wherein the data comprises 4 FIR filtering calculation results of effective calculation input data; when 64 operation cores in each core group successfully receive the data, completing the transmission of the single-round data;
s4, after each operation core receives the single round data transmitted in step 3, starting to calculate a single round FIR filter, and simultaneously performing direct memory access DMA write-back of the FIR filter result of the previous round and direct memory access DMA read-in of the input data of the next round; after each operation core receives data, 4 x F point data output in the step 3 is stored in a vector type variable array by using a single instruction stream multiple data stream SIMD technology, then 4 groups of FFT (fast Fourier transform) calculation are carried out simultaneously, then an input signal is multiplied by the FFT result of the filter coefficient h [ M ], and then inverse discrete Fourier transform is carried out to obtain the result of single calculation; after the calculation of the operation core single-round FIR filter in each core group is completed, abandoning the first M-1 number of each sub-block with the length of F, and transmitting an output result y [ 4X L ] back to the control core by using direct memory access DMA;
and S5, repeating the operations of the step S3 and the step S4 on the input data of the step 1 until all the input data are processed, and sequentially storing the operation result y [4 x L ] of each turn into y [ N ] in the control core to obtain the final operation result y [ N ] of the N-point filter.
2. The FIR filter high-performance implementation method based on the domestic many-core processor according to claim 1, characterized in that:
in step S3, the input data is divided into R blocks with size L, in order to divide the FIR filter with large number of points into a plurality of FIR filters with small number of points with the same size, 256 computation cores are called to compute the FIR filters with small number of points in parallel, 128 points are selected as lengths F of the FIR filters with small number of points, the effective input data length processed by each FIR filter with length F is L, the total number R of blocks divided by the input data is determined by the input data amount, and the total round number needed to be computed by the computation cores is L
Figure FDA0003406917000000021
3. The FIR filter high-performance implementation method based on the domestic many-core processor according to claim 2, characterized in that:
the four input data with length F in step S3 include 4 × L (M-1) data transmitted to avoid aliasing and 4 × L results of the FIR filter for effective calculation of the input data.
4. The FIR filter high-performance implementation method based on the domestic many-core processor according to claim 3, characterized in that:
when each operation core calculates the single-round FIR filter in step 4, first, data is stored into the vector type variable array from a standard double-precision floating point type array, the length of the vector type variable array is 128, numbers with the same relative position in four L point data blocks are loaded into the same vector type variable of the vector type array, and simultaneously, 4 sets of 128-point orthogonal fourier transform FFT calculations are performed.
5. The FIR filter high-performance implementation method based on the domestic many-core processor according to claim 4, characterized in that:
the single instruction stream multiple data stream SIMD comprises the following specific steps: after receiving the data of M-1+4 x L points transmitted from the control core, the operation core stores 4 continuous data containing M-1+ L points overlapped by M-1 points into the vector type variables in sequence, wherein the real part and the imaginary part of the data need to be stored into the arrays of the two vector type variables in the same sequence respectively, and the sequence is as follows: imaginary vector type variable array xvi [ F ] and imaginary vector type variable array xvr [ F ], and then performing Fourier transform FFT calculation of F point; the result of FFT calculation is stored in imaginary vector type variable array xvi [ F ] and imaginary vector type variable array xvr [ F ], and multiplied with the result H of FFT calculation of filter coefficient H [ M ], the products are stored in vector type arrays YI [ F ] and YR [ F ], finally, YI and YR are subjected to IFFT calculation to obtain the result y [4 x L ] of one-turn operation in the operation core;
when the data volume of the FIR filter calculated by a single operation core is more than M-1+ L64 x 4, multiple rounds of direct memory access DMA read-write operation are required, and the communication time is hidden by using the calculation time by using a flow scheduling method; through the flow scheduling design, except for the read-in process of the first round and the write-back communication process of the last round, when the operation core performs the calculation of the data small block FIR filter of the current round, the read-in of the data required by the calculation of the next round and the write-back communication of the calculation result of the previous round are performed, so that the communication time of the intermediate data is hidden.
6. The FIR filter high-performance implementation method based on the domestic many-core processor according to claim 5, characterized in that:
the calculation formula of the FIR digital filter is as follows:
Figure FDA0003406917000000031
wherein y (n) is an output signal; h is the filter coefficient, h (k) represents the kth value of the filter coefficient; x (n) is input data, x (n-k) is the n-k value of the input data, and M represents the total length of the filter coefficients h [ M ];
the nature of the fast Fourier transform FFT is the same as that of the discrete Fourier transform DFT, and the calculation formula of the DFT is as follows:
Figure FDA0003406917000000032
wherein W is a twiddle factor, x (N) is nth input data, X (k) is kth output data value, and N is in the range of [0, N-1 ]]The integer number within the interval is such that,
Figure FDA0003406917000000033
denotes the twiddle factor when the input data length is N, n and k is a specific value:
Figure FDA0003406917000000034
CN202111519880.XA 2021-12-13 2021-12-13 High-performance implementation method of FIR filter based on domestic many-core processor Active CN114237716B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111519880.XA CN114237716B (en) 2021-12-13 2021-12-13 High-performance implementation method of FIR filter based on domestic many-core processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111519880.XA CN114237716B (en) 2021-12-13 2021-12-13 High-performance implementation method of FIR filter based on domestic many-core processor

Publications (2)

Publication Number Publication Date
CN114237716A true CN114237716A (en) 2022-03-25
CN114237716B CN114237716B (en) 2024-11-22

Family

ID=80755363

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111519880.XA Active CN114237716B (en) 2021-12-13 2021-12-13 High-performance implementation method of FIR filter based on domestic many-core processor

Country Status (1)

Country Link
CN (1) CN114237716B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115603708A (en) * 2022-09-29 2023-01-13 展讯通信(天津)有限公司(Cn) FIR digital filtering apparatus, filtering method, medium and computer program product

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6526430B1 (en) * 1999-10-04 2003-02-25 Texas Instruments Incorporated Reconfigurable SIMD coprocessor architecture for sum of absolute differences and symmetric filtering (scalable MAC engine for image processing)
CN101031904A (en) * 2004-07-13 2007-09-05 3加1科技公司 Programmable processor system with two kinds of subprocessor to execute multimedia application
CN111461311A (en) * 2020-03-26 2020-07-28 中国科学技术大学 Convolutional neural network operation acceleration method and device based on many-core processor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6526430B1 (en) * 1999-10-04 2003-02-25 Texas Instruments Incorporated Reconfigurable SIMD coprocessor architecture for sum of absolute differences and symmetric filtering (scalable MAC engine for image processing)
CN101031904A (en) * 2004-07-13 2007-09-05 3加1科技公司 Programmable processor system with two kinds of subprocessor to execute multimedia application
CN111461311A (en) * 2020-03-26 2020-07-28 中国科学技术大学 Convolutional neural network operation acceleration method and device based on many-core processor

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIAOXIA ZOU;SHOGO MURAMATSU;HITOSHI KIYA: "The generalized overlap-add and overlap-save methods using discrete sine and cosine transforms for FIR filtering", 《PROCEEDINGS OF 1996 3RD INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING(ICSP\'96) VOLUME Ⅰ》, 14 October 1996 (1996-10-14), pages 91 - 94, XP010209658, DOI: 10.1109/ICSIGP.1996.566980 *
杨昕瑶: "适用于众核处理器的信号处理算法实现技术", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 June 2022 (2022-06-15), pages 136 - 52 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115603708A (en) * 2022-09-29 2023-01-13 展讯通信(天津)有限公司(Cn) FIR digital filtering apparatus, filtering method, medium and computer program product

Also Published As

Publication number Publication date
CN114237716B (en) 2024-11-22

Similar Documents

Publication Publication Date Title
CN103440121B (en) A kind of triangular matrix multiplication vectorization method of vector processor-oriented
KR102443546B1 (en) matrix multiplier
CN102200964B (en) Parallel-processing-based fast Fourier transform (FFT) device and method thereof
CN110415157A (en) A computing method and device for matrix multiplication
CN102340296B (en) A GPU-based implementation method of high-order digital FIR filter frequency domain parallel processing
WO2023045516A1 (en) Fft execution method, apparatus and device
Wang et al. Novel memory reference reduction methods for FFT implementations on DSP processors
CN115328440A (en) A method and device for realizing general sparse matrix multiplication based on 2D systolic array
CN114237716A (en) High-performance implementation method of FIR filter based on domestic many-core processor
CN110750249B (en) A method and device for generating fast Fourier transform codes
CN202217276U (en) FFT device based on parallel processing
US7246143B2 (en) Traced fast fourier transform apparatus and method
CN119848404A (en) Data processing method, device, equipment and medium for sparse matrix multiplication calculation
Maleki et al. Automatic hierarchical parallelization of linear recurrences
CN118227934A (en) Cooley-Tukey FFT algorithm high-performance optimization method of multi-GPU platform
JP4057729B2 (en) Fourier transform method and program recording medium
CN102231624B (en) Vectorization Implementation Method of FIR of Floating Point Complex Number Block Oriented to Vector Processor
CN117493748A (en) Implementation method and device for low bit width data matrix vector multiplication of vector processor
WO2019232091A1 (en) Radix-23 fast fourier transform for an embedded digital signal processor
Meyer-Baese Fourier transforms
US20180373676A1 (en) Apparatus and Methods of Providing an Efficient Radix-R Fast Fourier Transform
US7447722B2 (en) Low latency computation in real time utilizing a DSP processor
RU148684U1 (en) VECTOR SIGNAL FILTER DEVICE
Lv et al. A FPGA-based accelerator implementaion for YOLOv2 object detection using Winograd algorithm
US7409418B2 (en) Linearly scalable finite impulse response filter

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant