CN109615591A

CN109615591A - A 3D block matching noise reduction method based on GPU parallel acceleration

Info

Publication number: CN109615591A
Application number: CN201811426126.XA
Authority: CN
Inventors: 韩玉; 李磊; 闫镔; 荣利会; 陈健; 席晓琦; 梁宁宁; 孙艳敏; 王敬雨
Original assignee: Dongguan Letter Of Fusion Innovation Research Institute; PLA Information Engineering University
Current assignee: Dongguan Letter Of Fusion Innovation Research Institute; PLA Information Engineering University
Priority date: 2018-11-27
Filing date: 2018-11-27
Publication date: 2019-04-12

Abstract

The present invention relates to technical field of image processing, specifically disclose a kind of three-dimensional Block- matching noise-reduction method accelerated parallel based on GPU, including image to be processed is carried out the pretreatment of boundary symmetric extension；Pretreated image data is sent in the global storage of GPU；Multithreaded network grid is created, the mode of access and the acceleration strategy of shared memory recycled for multiple times are merged using global storage, acceleration processing is carried out to the grouping of similar image Block- matching；The first step noise reduction estimated data of three-dimensional similar matrix is obtained using the parallel acceleration strategy of hard -threshold collaboration filter kernel function；It is reference with first step noise reduction estimated data, the parallel acceleration strategy of joint wiener collaboration filter kernel function obtains second step noise reduction estimated data；Second step noise reduction estimated data is rejected into extended boundary pixel.The present invention can improve data access speed and reduce repeated accesses delay, moreover it is possible to effectively remove the noise in image, be conducive to the processing of large-size images real-time noise-reducing.

Description

A kind of three-dimensional Block- matching noise-reduction method accelerated parallel based on GPU

Technical field

The present invention relates to technical field of image processing, specifically disclose a kind of three-dimensional Block- matching accelerated parallel based on GPU Noise-reduction method.

Background technique

Digital picture is big due to being influenced usually to contain by imaging device and external environment in acquisition and transmission process Noise is measured, picture quality is influenced.It is past in CT image due to being influenced by low dosage voltage and current especially in medical application Toward a large amount of noise can be introduced, cause deteriroation of image quality, influences the clinical disease diagnosis of doctor.

Common three-dimensional Block- matching (BM3D) algorithm combines local, non local, multiple dimensioned sparse, adaptive by its own Filtering noise reduction feature, it is considered to be best Image denoising algorithm at present, but the algorithm is based on the collaboration filter of similar image block Wave, algorithm complexity is high, computationally intensive, when handling large-sized CT image data, takes a long time, treatment effeciency is lower, can not Meet real work demand.

Therefore, it is necessary to a kind of methods that can solve the above problem.

Summary of the invention

In order to overcome shortcoming and defect existing in the prior art, the purpose of the present invention is to provide one kind to be based on GPU simultaneously The three-dimensional Block- matching noise-reduction method that row accelerates.

To achieve the above object, the present invention uses following scheme.

A kind of three-dimensional Block- matching noise-reduction method accelerated parallel based on GPU, comprising:

Image to be processed is subjected to the pretreatment of boundary symmetric extension at the end CPU；

By pretreated image data from the global storage that CPU host side is sent to GPU；

Multithreaded network grid is created, the mode and shared memory recycled for multiple times of access are merged using global storage Acceleration strategy, parallel acceleration processing is carried out to the grouping of similar image Block- matching；

Estimated using the first step noise reduction that the parallel acceleration strategy of hard -threshold collaboration filter kernel function obtains three-dimensional similar matrix It counts；

It is reference with first step noise reduction estimated data, the parallel acceleration strategy of joint wiener collaboration filter kernel function obtains the Two step noise reduction estimated datas；

Second step noise reduction estimated data is sent to CPU host side from GPU, and rejects extended boundary pixel to obtain noise reduction Image afterwards.

Further, the creation multithreaded network grid, comprising:

With the image block matching process of each reference block of image for a thread block block, each detection figure in search window As the similitude matching process of block and reference block is that per thread thread carries out thread distribution；

With the step-length of certain pixel respectively from ranks direction selection reference image block incremented by successively, and referred to according in image Image block number determines the size of thread net grid, determines thread block according to the image block number of reference image block search window Size.

Further, the mode and shared memory recycled for multiple times that access is merged using global storage Acceleration strategy carries out parallel acceleration processing to the grouping of similar image Block- matching, comprising:

All thread thread in the same warp are executed into sequential cells in same instruction access global storage, Merge access module to obtain；

By search window be divided into size be 32*32 4 piecemeals, and in each piecemeal with the thread of block (16,16) into Row Similarity measures obtain similar block,Wherein distance of the d between image block is defined as two figures As size of the mould divided by image block of block corresponding element numerical difference, τ_threodFor the suitable distance threshold of selection；

By in search window in pel data Circulant Block shared memory, and be arranged threadIdx.x < 16 and threaIdx.y<16；

The most like image block of defined amount is found using the parallel strategy of minimum value reduction.

Further, the strategy parallel using minimum value reduction finds the most like image block of defined amount, packet It includes:

The similar image block of reference image block is gathered into three-dimensional matrice according to the ascending sequence of similarity distance；

It is corresponding to enable n distance value D [n] that n thread is obtained with Similarity measures respectively；

The value of ith thread is compared with the distance value of (i+n/2) a thread and smaller value is placed on left half In, the larger value is placed on right half, then left part by stages is D [0] to D [n/2], and right part by stages is D [n/2] to D [n]；

After completing thread parallel relatively, Thread Count will be compared and halved, above-mentioned comparison is carried out to left half zone distance value, directly Halve to multiple alternative line number of passes and compares rear left partial section value and be reduced to D [0]；

Using D [0] as the minimum value in distance value, D [n] starting access position is moved back one and repeated the above steps and is sought Minimum value is looked for, until finding the similar image block apart from the smallest defined amount.

Further, described that three-dimensional similar matrix is obtained using the parallel acceleration strategy of hard -threshold collaboration filter kernel function First step noise reduction estimated data, comprising:

Instruct hybrid optimization to accelerate, by the three-dimensional direct transform of three-dimensional similar matrix, hard -threshold filtering, three-dimensional inverse transformation and The process integration of Weighted estimation is in hard -threshold collaboration filter kernel function；Wherein the three-dimensional direct transform includes successively carrying out two Tie up Bi-orthogonal Spline Wavelet Transformation direct transform and one-dimensional Walsh-Hadanjard Transform；The three-dimensional inverse transformation is one-dimensional including successively carrying out Walsh-Hadanjard Transform and two-dimentional Bi-orthogonal Spline Wavelet Transformation inverse transformation；

According to the size of reference image block, a certain number of similar image blocks are chosen, and keep thread grid grid constant With the size of setting thread block block；

The similar image block number of selection is sent according to from global storage to altogether using the mode that global storage merges access It enjoys in memory to constitute three-dimensional similar matrix；

Two-dimentional Bi-orthogonal Spline Wavelet Transformation direct transform and one-dimensional Walsh-Hadanjard Transform are carried out to three-dimensional matrice, and become Progress hard -threshold filtering in domain is changed, passes through one-dimensional Walsh-Hadanjard Transform and two-dimentional bi-orthogonal spline after hard -threshold filtering again Wavelet inverse transformation obtains the first step noise reduction estimated data of image block,

Gray value in hard -threshold filtered image block is weighted and averaged, weighted average is assigned to the single of image block Pixel, and introduce triumphant plucked instrument window coefficient in weighted average and be weighted optimization to obtain first step image noise reduction value；

Meanwhile three-dimensional similar matrix being carried out to four filter coefficients lpd, hpd of two-dimentional Bi-orthogonal Spline Wavelet Transformation transformation, Lpr, hpr are stored in constant storage, and define private variable storage intermediate result in each thread using register.

Further, it is described with first step noise reduction estimated data be reference, joint wiener collaboration filter kernel function it is parallel Acceleration strategy obtains second step noise reduction estimated data, comprising:

It instructs hybrid optimization to accelerate, by the three-dimensional direct transform of three-dimensional similar matrix, Wiener filtering, three-dimensional inverse transformation and adds The process integration of kernel estimators is in wiener collaboration filter kernel function；Wherein the three-dimensional direct transform include successively carry out two dimension from Dissipate cosine direct transform and one-dimensional Walsh-Hadanjard Transform；The three-dimensional inverse transformation includes successively carrying out one-dimensional Walsh-hada Hadamard transform and 2-D discrete cosine inverse transformation；

With first step noise reduction estimated data be reference, by initial three-dimensional similar matrix carry out 2-D discrete cosine direct transform and One-dimensional Walsh-Hadanjard Transform；Wiener filtering is carried out, and one-dimensional Wall is carried out to the three-dimensional similar matrix after Wiener filtering Assorted-Hadamard transform and 2-D discrete cosine inverse transformation obtain second step noise reduction estimated data；

Gray value in image block after Wiener filtering is weighted and averaged, weighted average is assigned to the single picture of image block Element, and introduce triumphant plucked instrument window coefficient in weighted average and be weighted optimization to obtain second step image noise reduction value；

Meanwhile Two Dimension Discrete Cosine is stored in private variable defined in per thread using register In.

Further, when boundary symmetric extension pre-processes, the border column pixel of the left and right sides is carried out respectively first symmetrical Then extension carries out symmetric extension to the border row pixel of upper and lower two sides respectively again, and the pixel width of border extension is by searching The radius size of rope window determines.

Beneficial effects of the present invention: a kind of three-dimensional Block- matching noise-reduction method accelerated parallel based on GPU is provided, merging is passed through The mode of access once reads data required for each GPU thread block in shared memory from global storage, and adopts With the strategy of shared memory is recycled, to improve data access speed and reduce repeated accesses delay, significant increase Algorithm overall performance, improves computational efficiency；Cooperate hard -threshold collaboration filter kernel function and wiener collaboration filter kernel simultaneously Function can effectively remove the noise in image, be conducive to the processing of large-size images real-time noise-reducing.

Detailed description of the invention

Fig. 1 is the flow diagram of the embodiment of the present invention.

Fig. 2 is that image of the embodiment of the present invention carries out the pretreated schematic diagram of boundary symmetric extension.

Fig. 3 is the schematic diagram of thread of embodiment of the present invention grid distribution.

Fig. 4 is the schematic diagram of shared memory of embodiment of the present invention employment mechanism.

Fig. 5 is the schematic diagram of minimum value of embodiment of the present invention reduction sorting in parallel.

Fig. 6 is schematic diagram of the present invention using the original CT image of head mould.

Fig. 7 is the schematic diagram of the CT image of present invention denoising back mould.

Fig. 8 is schematic diagram of the present invention using the original CT image of body mould.

Fig. 9 is the schematic diagram of the CT image of body mould after the present invention denoises.

Specific embodiment

For the ease of the understanding of those skilled in the art, the present invention is made further below with reference to examples and drawings Bright, the content that embodiment refers to not is limitation of the invention.

A kind of three-dimensional Block- matching noise-reduction method accelerated parallel based on GPU, as shown in Figure 1, comprising:

CT image boundarg pixel, which crosses the border, in order to prevent not can be carried out noise reduction process, and it is symmetrical that image to be processed is carried out boundary Extension pretreatment.It needs when as shown in Fig. 2, carrying out symmetric extension pretreatment first to noise variance, similar block number, search window radius Etc. parameters be configured, and the pixel wide of border extension is determined by search window radius size, and general value is 16 pixels.

After completing the pretreatment of image boundary symmetric extension, pretreated image data will be extended and be sent to from host side In the global storage of GPU；

Thread distribution need to be carried out when creating multithreaded network grid, as shown in figure 3, considering the similar Block- matching of different reference blocks The irrelevant property of process, it is every in search window with the image block matching process of each reference block of image for a thread block block The similitude matching process of a detection image block and reference block is per thread thread.Then in the picture along the line of the column direction with The step-length of every 3 pixels selection reference image block incremented by successively, when reference image block number is (M*N) in image, then thread The size of grid grid is (M, N), and the 32*32 image block detected in reference block search window carries out similitude matching, thread block The constant magnitude of block is set as (32,32).

Then, the mode of access and the acceleration strategy of shared memory recycled for multiple times are merged using global storage, Parallel acceleration processing is carried out to the grouping of similar image Block- matching.As shown in figure 4, making all thread thread in the same warp When being carried out continuous unit in same instruction access global storage, best access module is obtained.One of them Warp beam has 32 threads, therefore search window is divided into 4 piecemeals that size is 32*32, it is made to meet global storage merging The requirement of Access Optimization reaches global storage bandwidth communication peak value as far as possible.And in each piecemeal with block (16, 16) the thread method of salary distribution carries out parallel Similarity measures and obtains similar block.

When carrying out Similarity measures, with reference image block P_kCentered on, radius is sliding detection pixel-by-pixel in the search window of R Window (size N*N) calculates the image block that judgement detection window is chosenWith reference image block P_kSimilitude, and by similar figure As block is assembled to obtain reference image block P_kThree-dimensional similar image block group.Similitude judgment formula are as follows:

In formula, distance of the d between image block is defined as the mould of two image block corresponding element numerical differences divided by image block Size, τ_threodFor the suitable distance threshold of selection, meet d < τ_threodWhen, that is, think that two image blocks are similar, conversely, then Not so.

After completing the operation of similitude PARALLEL MATCHING, block data is copied to shared memory.Since global storage is every It is secondary to access the delay for having up to 400~600 clock cycle, and there is a large amount of again for pel data in image block matching process Repeated accesses, therefore reduced as far as possible during copying the pel data in search window the shared memory of cache to To the access times of global storage.And limited by GPU hardware condition, the maximum thread of thread block block is 1024, will When the pel data of 32*32 detection block disposably copies shared memory to from global storage in search window, pixel is total Number has exceeded thread maximum number, in order to effectively solve this problem, copies using by pel data Circulant Block in search window Strategy into shared memory is accelerated parallel.Simultaneously in order to avoid being overlapped as piecemeal caused by image block repeated matching ThreadIdx.x < 16 and threaIdx.y < 16 are arranged after block data is copied to shared memory in situation.

The most like image block that defined amount is found using the parallel strategy of minimum value reduction, as shown in figure 5, according to reality The most like image block that situation chooses certain amount participates in subsequent processing, i.e., distance need to be selected in similar image block most Small similar image block.It sorts from small to large according to similarity distance referring initially to the similar image block of image block and is gathered into three-dimensional square It is corresponding to enable n distance value D [n] that n thread is obtained with Similarity measures respectively for battle array；By the value of ith thread and (i+ N/2) distance value of a thread is compared and smaller value is placed in left half, and the larger value is placed on right half, then left part subregion Between be D [0] to D [n/2], right part by stages be D [n/2] to D [n]；At this time n/2 thread parallel relatively after, minimum value is certain It is D [0] to D [n/2] in left part by stages.After completing n/2 thread parallel relatively, compares Thread Count and halve into n/4, to left half Section equally carries out above-mentioned comparison, then the left part by stages where minimum value is also reduced into D [0] to D [n/4]；Repeat above-mentioned step Suddenly, until left part by stages is reduced into D [0], then D [0] is the minimum value in all distance values at this time.Each reduction sorting in parallel After finding minimum value, one will be moved back apart from the starting of array D access position, array length becomes n-1, then in circulating repetition The most like image block of defined amount can be found by stating reduction operations process finally, until finding apart from the smallest defined amount Similar image block.Compare Thread Count with this each reduction and be all reduced into original half, greatly reduces the time of sequence.

Either Floating-point Computation instruction, load instruction or branch instruction occupy instruction processing bandwidth, the finger of each SM Processing bandwidth is enabled all to be limited.Therefore accelerated using instruction hybrid optimization, by the three-dimensional direct transform (two dimension of three-dimensional similar matrix Bi-orthogonal Spline Wavelet Transformation direct transform+one-dimensional Walsh-Hadanjard Transform), hard -threshold filtering, three-dimensional inverse transformation (one-dimensional Walsh- Hadamard transform+two dimension Bi-orthogonal Spline Wavelet Transformation inverse transformation) and the process integration of Weighted estimation unified cooperateed with to a hard -threshold In filter kernel function, to reduce reverse cyclic loadings, the copy instruction of unnecessary intermediate variable, unnecessary time-consuming is saved.

Thread distribution is carried out, the size of reference image block is 8*8, and hard -threshold collaboration filtering selects 16 similar image block ginsengs With processing, therefore the size of thread block block is specifically configured to (64,16), thread grid grid is remained unchanged, and is equally used The mode for merging access copies the most like image block data of selection in shared memory to constitute three from global storage to Tie up similar matrix

Three-dimensional matriceProgress two-dimentional Bi-orthogonal Spline Wavelet Transformation (BIOR) one-dimensional Walsh-Hadanjard Transform between direct transform and block, And hard -threshold filtering is carried out in the transform domain as illustrated, pass through one-dimensional Walsh-Hadanjard Transform and two-dimentional Bi-orthogonal Spline Wavelet Transformation after filtering again (BIOR) inverse transformation obtains the first step noise reduction estimated value of all image blocks in groupProcess is WhereinFor the one-dimensional Walsh-Hadanjard Transform of radial direction between the two-dimentional BIOR transformation of image block each in group and block； The two-dimentional BIOR of radial one-dimensional Walsh-Hadanjard Transform and each image block is inverse between the block of three-dimensional matrice after hard -threshold filtering Transformation；γ is hard -threshold filtering processing；

After three-dimensional direct transform, noise is often focused at coefficient in transform domain smaller value, and true detail information concentrates on becoming It changes at domain coefficient the larger value, therefore the coefficient in transform domain for being less than threshold parameter is set 0 by hard -threshold filtering, other coefficients retain not Become, process is shown below:

In formula, x is three-dimensional matriceTransformation coefficient after three-dimensional direct transform,For the shrinkage parameters of hard -threshold filtering, σ For the noise bias of estimation.

After hard -threshold collaboration filtering, each pixel of each image block obtains an estimated value, but for a certain A pixel i, is likely to appear in multiple images block, thus possess multiple estimated values, the picture for needing to have these block of overlapping First estimated value is weighted and averaged to obtain the basic noise reduction estimated value of pixel i, basic weightIt is filtered by hard -threshold Three-dimensional matrice afterwardsIn non-zero transform domain coefficients numberIt determines,

In weighted mean procedure, in order to be further reduced edge effect, triumphant plucked instrument window coefficient W is added_kaiserIt is weighted poly- Collection, then basic noise reduction estimates the value of any pixel i in image are as follows:

In formula,

Simultaneously as small echo becomes when the two-dimentional Bi-orthogonal Spline Wavelet Transformation for carrying out image block to three-dimensional similar matrix converts Four the filter coefficients lpd, hpd, lpr, hpr changed are constants, and need frequently accessed repeatedly by per thread, therefore by its It is stored in and possesses in the constant storage that caching accelerates, register is made full use of to accelerate access speed, save operation time.Deposit Device and shared memory are located at GPU chip interior, are two most fast memories of access speed respectively.Two-dimentional bi-orthogonal spline is small Wave conversion process is related to interative computation, needs defined variable storage intermediate conversion as a result, register is made full use of simultaneously, each Private variable storage intermediate result is defined in thread, to improve interative computation efficiency.

It is reference with first step noise reduction estimation image, the estimation of Wiener filtering noise reduction is carried out to original noisy image.It is same first Sample is accelerated using instruction hybrid optimization, by three-dimensional direct transform (the 2-D discrete cosine direct transform+one-dimensional Wall of three-dimensional similar matrix Assorted-Hadamard transform), Wiener filtering, three-dimensional inverse transformation (one-dimensional Walsh-Hadanjard Transform+2-D discrete cosine inverse transformation) And the process integration of Weighted estimation is unified into wiener collaboration filter kernel function.

When carrying out thread distribution optimization, Wiener filtering selects 32 similar image block participations processing, therefore by thread block block Size be specifically configured to (64,32), thread grid grid is remained unchanged, and equally merge access copy shared memory to, Constitute new three-dimensional similar matrixThe similar block matrix of three-dimensional of original image is obtained from original noisy image simultaneouslyIt is right Two three-dimensional matricesWith2-D discrete cosine direct transform (DCT) and one-dimensional Walsh-Hadanjard Transform are carried out respectively, Then experience wiener coefficient is calculated with the three-dimensional direct transform matrix of basis estimation image

Followed by experience wiener coefficient to the three-dimensional matrice of original noisy imageWiener filtering processing is carried out, has been handled Cheng Houzai is estimated by the noise reduction that all image blocks in group can be obtained in inverse transformation

Principle using the similar first step is estimated to be weighted and averaged to superposition image member, the difference is that wherein basic weightBy The decision of experience wiener coefficient,Then second step noise reduction estimates the value of any pixel i in image are as follows:

Simultaneously as transformation coefficient is related to the great trigonometric function of expense in the two-dimension discrete cosine transform of image block It calculates, therefore is precalculated out, and register is made full use of to store it in the private variable that per thread defines In.Second step noise reduction estimated data is then sent to CPU host side from GPU, and rejects extended boundary pixel to obtain noise reduction Image afterwards.

A kind of three-dimensional Block- matching noise-reduction method accelerated parallel based on GPU provided by the invention, it is contemplated that the figure of cross-thread As there is repetitions of a large amount of data access, and each of global storage is accessed when having up to 400~600 in block detection The delay in clock period once reads shared memory from global storage using by data required for each GPU thread block, And the strategy of shared memory is reused, to save repeated accesses delay, promote computational efficiency.And data are stored from the overall situation Device is all made of the mode for merging access when reading shared memory, can improve the access speed of data, reach complete as far as possible The peak value of office's bandwidth of memory, can will carry out GPU acceleration based on three-dimensional block matching algorithm, significant increase algorithm overall performance, The processing time is saved, calculating speed improves nearly 80 times than conventional serial algorithm；Hard -threshold is cooperated to cooperate with filter kernel simultaneously Function and wiener cooperate with filter kernel function, have not only effectively removed the noise in image, also help and large scale CT is schemed The real-time noise-reducing of picture is handled.

More specifically, it includes noise original CT image that Fig. 6, which is the head mould that the present invention uses, scanned position is the oral cavity of human body Tooth, Fig. 7 are using the head mould CT image after the method for the present invention denoising；Fig. 8 is that the present invention is to include much noise using body mould Original CT image, Fig. 9 be using the method for the present invention denoising after body mould CT image.Pass through 6,7 and Fig. 8 of comparison diagram, 9 experiment As a result, from can be seen that the method for the present invention has effectively removed the noise in original image, image entirety clarity on improvement of visual effect Preferably, edge detail information is also kept good.From processing speed, original serial algorithm process breadth is 2048*2048 size CT image need 1.5 hours, and the method for the present invention processing identical image time-consuming then only need 69 seconds, speed improve by Nearly 80 times.

The above is only a preferred embodiment of the present invention, for those of ordinary skill in the art, according to the present invention Thought, there will be changes in the specific implementation manner and application range, and the content of the present specification should not be construed as to the present invention Limitation.

Claims

1. a three-dimensional block matching noise reduction method based on GPU parallel acceleration, is characterized in that, comprises:

Perform boundary symmetry expansion preprocessing on the image to be processed on the CPU side;

Send the preprocessed image data from the CPU host to the global memory of the GPU;

Create a thread network grid, adopt the mode of global memory combined access and the acceleration strategy of multiple recycling of shared memory, and perform parallel acceleration processing on matching groups of similar image blocks;

The first step noise reduction estimation data of the three-dimensional similarity matrix is obtained by adopting the parallel acceleration strategy of the hard threshold collaborative filtering kernel function;

Taking the first step noise reduction estimation data as a reference, the second step noise reduction estimation data is obtained in conjunction with the Wiener collaborative filtering kernel function parallel acceleration strategy;

The second-step denoising estimation data is sent from the GPU to the CPU host, and the extended boundary pixels are culled to obtain a denoised image.

2. a kind of three-dimensional block matching noise reduction method based on GPU parallel acceleration according to claim 1, is characterized in that, described creation thread network grid, comprises:

Taking the image block matching process of each reference block of the image as a thread block block, the similarity matching process between each detected image block and the reference block in the search window is used for thread allocation for each thread thread;

The reference image blocks are selected sequentially from the row and column directions with a step size of a certain pixel, and the size of the thread grid is determined according to the number of reference image blocks in the image, and the size of the thread block is determined according to the number of image blocks in the reference image block search window.

3. a kind of three-dimensional block matching noise reduction method based on GPU parallel acceleration according to claim 1, is characterized in that, described adopting the mode of global memory merged access and the acceleration strategy of shared memory repeatedly recycling, to similar images Block matching grouping for parallel accelerated processing, including:

All threads in the same warp execute the same instruction to access contiguous units in global memory to obtain a merged access mode;

Divide the search window into 4 blocks with a size of 32*32, and perform similarity calculation with the thread of block(16,16) in each block to obtain similar blocks, Where d is the distance between image blocks, defined as the modulus of the difference between the corresponding elements of the two image blocks divided by the size of the image block, and τ _threod is the selected suitable distance threshold;

Divide the pixel data in the search window into blocks and circulate the shared memory, and set threadIdx.x<16 and threadIdx.y<16;

A minimum-reduction-parallel strategy is used to find a specified number of the most similar image patches.

4. a kind of three-dimensional block matching noise reduction method based on GPU parallel acceleration according to claim 3, is characterized in that, described adopting the strategy of minimum reduction parallel to find the most similar image block of specified number, comprising:

The similar image blocks of the reference image block are sorted into a three-dimensional matrix according to the similarity distance from small to large;

Enable n threads corresponding to the n distance values D[n] obtained by similarity calculation;

Compare the value of the i-th thread with the distance value of the (i+n/2)-th thread and put the smaller value in the left part and the larger value in the right part, then the left part interval is D[0 ] to D[n/2], the right part of the interval is D[n/2] to D[n];

After the parallel comparison of threads is completed, the number of comparison threads is halved, and the above comparison is performed on the interval distance value of the left part, until the number of comparison threads is halved for multiple comparisons, and the interval value of the left part is reduced to D[0];

Taking D[0] as the minimum value among the distance values, move the initial access position of D[n] one bit back and repeat the above steps to find the minimum value until a specified number of similar image blocks with the minimum distance are found.

5. a kind of three-dimensional block matching noise reduction method based on GPU parallel acceleration according to claim 1, is characterized in that, described adopting hard threshold collaborative filtering kernel function parallel acceleration strategy to obtain the first step noise reduction estimation of three-dimensional similarity matrix data, including:

The instruction mixing optimization is accelerated, and the process of 3D forward transformation, hard threshold filtering, 3D inverse transformation and weighted estimation of 3D similarity matrix is integrated into the hard threshold collaborative filtering kernel function; wherein the 3D forward transformation includes sequentially performing two-dimensional biorthogonal Spline wavelet forward transform and one-dimensional Walsh-Hadamard transform; the three-dimensional inverse transform includes sequentially performing one-dimensional Walsh-Hadamard transform and two-dimensional biorthogonal spline inverse wavelet transform;

According to the size of the reference image block, select a certain number of similar image blocks, keep the thread grid grid unchanged, and set the size of the thread block block;

The selected similar image block data is sent from the global memory to the shared memory to form a three-dimensional similarity matrix by adopting the global memory combined access mode;

Perform two-dimensional biorthogonal spline wavelet forward transform and one-dimensional Walsh-Hadamard transform on the three-dimensional matrix, and perform hard threshold filtering in the transform domain, and then pass the one-dimensional Walsh-Hadamard transform after hard threshold filtering. and the two-dimensional biorthogonal spline wavelet inverse transform to obtain the first step noise reduction estimation data of the image block,

Perform a weighted average of the grayscale values in the image block after hard threshold filtering, assign the weighted average value to a single pixel of the image block, and introduce a Kaiser window coefficient during the weighted average for weighted optimization to obtain the first step image noise reduction value;

At the same time, the four filter coefficients lpd, hpd, lpr, hpr of the 2D biorthogonal spline wavelet transform of the 3D similarity matrix are stored in the constant memory, and the register is used to define private variables in each thread to store the intermediate results.

6. a kind of three-dimensional block matching noise reduction method based on GPU parallel acceleration according to claim 1, is characterized in that, described taking the first step noise reduction estimation data as reference, joint Wiener collaborative filtering kernel function parallel acceleration strategy Obtain second-step noise reduction estimation data, including:

The instruction mixing optimization is accelerated, and the process of three-dimensional forward transformation, Wiener filtering, three-dimensional inverse transformation and weighted estimation of three-dimensional similarity matrix is integrated into the Wiener collaborative filtering kernel function; wherein the three-dimensional positive transformation includes sequentially performing two-dimensional discrete cosine sine transformation and one-dimensional Walsh-Hadamard transform; the three-dimensional inverse transform includes sequentially performing one-dimensional Walsh-Hadamard transform and two-dimensional inverse discrete cosine transform;

Taking the noise reduction estimation data of the first step as a reference, the original three-dimensional similarity matrix is subjected to two-dimensional discrete cosine transformation and one-dimensional Walsh-Hadamard transform; Wiener filtering is performed, and the three-dimensional similarity matrix after Wiener filtering is performed One-dimensional Walsh-Hadamard transform and two-dimensional inverse discrete cosine transform to obtain the second-step noise reduction estimation data;

Perform a weighted average of the grayscale values in the image block after Wiener filtering, assign the weighted average value to a single pixel of the image block, and introduce the Kaiser window coefficient during the weighted average for weighted optimization to obtain the second-step image noise reduction value;

At the same time, the two-dimensional discrete cosine transform coefficients are stored in private variables defined in each thread using registers.

7. a kind of three-dimensional block matching noise reduction method based on GPU parallel acceleration according to claim 1, is characterized in that, during boundary symmetrical expansion preprocessing, at first the boundary column pixels on the left and right sides are respectively symmetrically expanded, then Then, symmetrically expand the pixels on the upper and lower sides of the boundary row, and the width of the pixels in the boundary expansion is determined by the radius of the search window.