CN109344818B

CN109344818B - A salient object detection method in light field based on deep convolutional network

Info

Publication number: CN109344818B
Application number: CN201811141315.2A
Authority: CN
Inventors: 张骏; 刘亚美; 刘紫薇; 张钊; 郑顺源; 郑彤; 王程; 张旭东; 高隽
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2018-09-28
Filing date: 2018-09-28
Publication date: 2020-04-14
Anticipated expiration: 2038-09-28
Also published as: CN109344818A

Abstract

The invention discloses a light field significant target detection method based on a deep convolutional network, which comprises the following steps: 1 converting light field data obtained by using a light field acquisition device into sub-aperture images of all visual angles; 2, recombining the sub-aperture images under different visual angles into a micro-lens image; 3, performing data enhancement on the microlens image; 4, building a significant target detection model combined with a microlens image on the basis of the pre-training weight of the Deeplab-V2 network, and training by using a data set; and 5, carrying out the salient object detection on the light field data to be processed by utilizing the trained salient object detection model. The method can effectively improve the accuracy of the detection of the salient target of the complex scene image.

Description

Light field significant target detection method based on deep convolutional network

Technical Field

The invention belongs to the field of computer vision, image processing and analysis, and particularly relates to a light field significant target detection method based on a deep convolutional network.

Background

Salient object detection is the perceptual capability of the human visual system. When an image is observed, the vision system can rapidly acquire the interested region and the target in the image, and the process of acquiring the interested region and the target is the detection of the salient target. With the development of computer technology and internet and the popularization of mobile intelligent equipment, people acquire external images to show blowout type growth. The obvious target detection selects a small part from a large amount of input visual information to enter subsequent complex processing, such as target detection and identification, image retrieval, image segmentation and the like, so that the calculation amount of a visual system is effectively reduced. At present, detection of salient objects has become one of the hot spots of research in the field of computer vision.

Current methods of salient object detection can be classified into three categories, according to the available image data: two-dimensional salient object detection, three-dimensional salient object detection and light field salient object detection.

The two-dimensional salient object detection method is characterized in that a traditional camera is used for obtaining a two-dimensional image, and the traditional method or the learning-based method is used for extracting and fusing characteristics such as color, brightness, position, texture and the like through a local or global contrast frame so as to realize salient and non-salient distinction.

The three-dimensional salient target detection method is used for realizing salient target detection by utilizing the depth information of a two-dimensional image and a scene. Depth information of a scene is acquired by a three-dimensional sensor, which also plays an important role in the human visual system, reflecting the distance between an object and an observer. The depth information is used for detecting the salient object, the defects of the traditional two-dimensional image are made up, the final salient image is obtained by utilizing the mutual complementation of the color and the depth, and the accuracy of detecting the salient object is improved to a certain extent.

The method for detecting the light field salient object is to process light field data acquired by a light field camera to realize salient object detection. The light field imaging can record the position and visual angle information of light radiation in a space through one-time exposure by means of a new calculation imaging technology, and the acquired light field information reflects the geometry and the reflection characteristic of a natural scene. The conventional method improves the performance of detecting the significant target of the challenging scene by fusing the significant characteristics of different light field data.

Although some methods for detecting salient objects with excellent performance have appeared in the field of computer vision, these methods still have disadvantages:

1. in the two-dimensional salient target detection method, because the two-dimensional image is the integral of the projection of light on the camera sensor and only contains the light intensity in a specific direction, the two-dimensional salient target detection is too sensitive to a high-frequency part or noise and is easily influenced by factors such as similar color and texture of a foreground and a background and disordered background.

2. In the three-dimensional obvious target detection method, the accuracy of scene depth information depends on a depth camera, and the conventional depth camera has the problems of low resolution, narrow measurement range, high noise, incapability of measuring transmission materials, easiness in sunlight and reflection interference of a smooth plane and the like.

3. In the three-dimensional salient object detection method, the characteristic information such as color, depth, position and the like are processed and fused independently, and the complementarity of the characteristic information is not considered comprehensively.

4. Most of the methods for detecting the salient objects based on the two-dimensional and three-dimensional images are based on the assumption that the objects are obviously different from the background, the background is simple and the like, and the methods have certain limitations as the image data is increased in a large scale and the complexity of the image content is increased.

5. In the detection of the light field significant target, the research of light field data on the aspect of significant target detection is just started, and the currently available data sets are fewer and the image quality is poorer. The prior method for detecting the salient object by utilizing the light field data is based on the traditional salient feature calculation method, and simultaneously models multiple clues such as color, depth, refocusing and the like respectively, so that the problems of insufficient feature expression, poor robust detection effect and the like exist.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a light field significant target detection method based on a deep convolutional network, so that the spatial information and the visual angle information of light field data can be fully utilized, and the significant target detection accuracy of a complex scene image can be effectively improved.

The invention adopts the following technical scheme for solving the technical problems:

the invention relates to a method for detecting a light field significant target based on a deep convolutional network, which is characterized by comprising the following steps of:

step 1, obtaining a microlens image I_d；

Step 1.1, acquiring a light field file by using light field equipment, and decoding to obtain a light field data set which is recorded as L ═ (L ═ L)₁,L₂,…,L_d,…,L_D) Wherein L is_dRepresenting the d-th light field data and denoting the d-th light field data as L_d(u, v, s, t), u and v representing any horizontal pixel and vertical pixel in the spatial information, and s and t representing any horizontal viewing angle and vertical viewing angle in the viewing angle information; d is equal to [1, D ∈]And D represents the total number of light field data;

step 1.2, fixing a horizontal view angle s and a vertical view angle t, and traversing the d-th light field data L_d(u, v, s, t) of all horizontal and vertical pixels, resulting in the d-th light-field data L_dSub-aperture image at view angle of tth row and sth column in (u, v, s, t)

And is

Is marked as V and U respectively, and V is belonged to [1, V ∈]，u∈[1,U]；

Step 1.3, traversing the light field data L_dAll horizontal and vertical views in (u, v, s, t)Viewing angle, obtaining sub-aperture image set under the d-th all viewing angle

Wherein S is ∈ [1, S ∈]，t∈[1,T]S represents the row of the maximum horizontal viewing angle, and T represents the column of the maximum vertical viewing angle;

step 1.4, defining the number of the selected visual angles as m multiplied by m, and utilizing a formula (1) to collect the sub-aperture images N under the d-th all visual angles_dTo select the d-th image set M centered on the central view angle_d：

In the formula (1), the reaction mixture is,

and to

Taking an integer downwards;

step 1.5, obtaining the d-th microlens image I according to the x ═ v-1 × m + t, y ═ u-1 × m + s_dMiddle x row and y column pixel point I_d(x, y) to obtain the d-th microlens image I with height and width of H and W, respectively_dWherein x ∈ [1, H ]]，y∈[1,W]，H＝V×m，W＝U×m；

Step 2, from the d image set M_dSelecting the sub-aperture image of the d-th central view, and recording as

Sub-aperture image to the d-th central view

Marking a significant area, setting the pixel of the significant area as 1, and setting the pixel of the non-significant area as 0, thereby obtaining the d-th microlens image I_dD true saliency map G of_dSaid d-th real saliency map G_dHeight and width ofAre V and U, respectively;

step 3, aiming at the d-th microlens image I_dCarrying out data enhancement processing to obtain a d enhanced microlens image set I_d'; for the d real significant map G_dPerforming geometric transformation to obtain the d-th transformed real saliency map set G_d′；

Step 4, repeating steps 1.2 to 3, and obtaining D enhanced microlens image sets I' in the light field data set L (I ═ I)₁′,I₂′,…,I_d′,…,I′_D) And D sets of transformed true saliency maps are denoted G' ═ (G)₁′,G₂′,…,G_d′,…,G′_D)；

Step 5, constructing the d light field data L_d(u, v, s, t) salient object detection model;

step 5.1, acquiring a Deeplab-V2 convolutional neural network of a layer c, wherein the Deeplab-V2 convolutional neural network comprises a convolutional layer, a pooling layer and a discarding layer;

step 5.2, modifying the Deeplab-V2 convolutional neural network of the layer c to obtain a modified LFnet convolutional neural network;

step 5.2.1, adding a convolution layer LF _ conv1_1 with convolution kernel size of m multiplied by m and a ReLU activation function LF _ ReLU1_1 before the first layer of the Deeplab-V2 convolutional neural network;

setting the moving step length of the convolution kernel to be m when the convolution layer LF _ conv1_1 carries out convolution operation;

the mathematical expression of the ReLU activation function LF _ ReLU1_1 is phi (a) ═ max (0, a), where a represents the output of the convolutional layer LF _ conv1_1 and is input to the ReLU activation function LF _ ReLU1_1, and phi (a) represents the output of the ReLU activation function LF _ ReLU1_ 1;

step 5.2.2, adding a discarding layer after other convolution layers in the Deeplab-V2 convolutional neural network except convolution layers connected with the discarding layer in the convolutional layer LF _ conv1_1 and the Deeplab-V2 convolutional neural network;

step 5.2.3, setting the number of output channels of the c-1 layer in the Deeplab-V2 convolutional neural network as b, wherein b is the number of pixel categories;

step 5.2.4, adding an upsampling layer after the layer c of the Deeplab-V2 convolutional neural network, and utilizing the upsampling layer to output a feature map F to the layer c of the Deeplab-V2 convolutional neural network_d(q, r, b) performing an upsampling operation to obtain an upsampled feature map F_d' (q, r, b); wherein q, r and b represent the characteristic diagram F respectively_dWidth, height and number of channels of (q, r, b);

step 5.2.5, adding a shear layer after the upsampling layer, and according to the d-th real saliency map G_dLength V and width U of said feature map F using said shear layer_d' (q, r, b) obtaining said microlens image I by shearing_dPixel class prediction probability map F_d″(q,r,b)；

And 5.3, taking the enhanced microlens image set I 'as the input of the LFnet convolutional neural network, taking the transformed real significant image set G' as a label, using a cross entropy loss function, and training the LFnet convolutional neural network by using a gradient descent algorithm, so as to obtain a significant target detection model of the light field data, and realizing significant target detection of the light field data by using the significant target detection model.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention utilizes the second generation light field camera to collect the light field data of complex and changeable scenes, the scenes comprise various sizes of salient objects, various light sources, similarity between the salient objects and the background, disordered background and other difficulties, the defects of the current light field salient data on data and difficulty are fully supplemented, and the quality of the current light field salient data is improved.

2. According to the method, the image characteristics are extracted by utilizing the powerful function of the depth convolution network in the aspect of image processing, the spatial information and the visual angle information of the light field data are fused, the context information of the microlens image is captured by utilizing the cavity pyramid network, and the salient object in the image scene is detected, so that the defect that the visual angle information cannot be used by the current two-dimensional or three-dimensional salient object detection method is overcome, and the precision and the robustness of the image salient object detection in the complex scene are improved.

3. The multi-view information in the microlens image reflects the space geometric characteristics of a scene, the microlens image is directly input into the convolutional neural network, the obvious target detection is realized, the defect that the depth and color information is independently processed in the current light field obvious target detection method is overcome, the depth perception and the visual significance are considered, the complementarity of the depth and the color is effectively utilized, and the accuracy of the image obvious target detection is improved.

Drawings

FIG. 1 is a flow chart of the salient object detection method of the present invention;

FIG. 2 is a sub-aperture image obtained by the method of the present invention;

FIG. 3 is a microlens image obtained by the method of the present invention;

FIG. 4 is a partial scene and a true saliency map of a data set acquired by the method of the present invention;

FIG. 5 is a detailed process diagram of the microlens image input network model according to the method of the present invention;

FIG. 6 is a diagram of the Deeplab-V2 model used in the method of the present invention;

FIG. 7 is a comparison graph of detection results of some salient objects obtained by the method of the present invention and other light field salient object detection methods on a data set collected by a second generation light field camera;

fig. 8 is an analysis diagram of the quantitative comparison between the data set acquired by the second generation light field camera and the current other light field saliency extraction methods, using the "recall ratio/precision ratio curve" as the measurement standard.

Detailed Description

In this embodiment, a method for detecting a significant light field target based on a deep convolutional network is shown in fig. 1, and is performed according to the following steps:

step 1, obtaining a microlens image I_d；

Step 1.1, acquiring a light field file by using light field equipment and decoding to obtain the light field fileTo light field data set is denoted as L ═ (L)₁,L₂,…,L_d,…,L_D) Wherein L is_dRepresenting the d-th light field data and denoting the d-th light field data as L_d(u, v, s, t), u and v representing any horizontal pixel and vertical pixel in the spatial information, and s and t representing any horizontal viewing angle and vertical viewing angle in the viewing angle information; d is equal to [1, D ∈]And D represents the total number of light field data;

in this embodiment, a second-generation light field camera is used to acquire a light field file, and the light field file is decoded by a lytro powertoolbeta tool to obtain light field data L_d(u, v, s, t); light field data L_dThe (u, v, s, t) is expressed by a biplane parameter method, in a four-dimensional (u, v, s, t) coordinate space, a light ray corresponds to a sampling point of a light field, the u and v planes express a space information plane, and the s and t planes express a view angle information plane; in the experiment of the invention, 640 pieces of light field data are acquired, the light field data are averagely divided into 5 pieces, 1 piece is selected as a test set in turn, and the other 4 pieces are selected as a training set. D in step 1.1 represents the training data set, D512;

step 1.2, fixing the horizontal view angle s and the vertical view angle t, and traversing the d light field data L_dAll horizontal and vertical pixels in (u, v, s, t) to obtain the d-th light field data L_dSub-aperture image at view angle of tth row and sth column in (u, v, s, t)

And is

Is marked as V and U respectively, and V is belonged to [1, V ∈]，u∈[1,U]In this experiment, V375, U540;

step 1.3, traversing the light field data L_dAll horizontal viewing angles and vertical viewing angles in (u, v, s, t) are obtained, and a sub-aperture image set under the d-th all viewing angles is obtained

Wherein S is ∈ [1, S ∈]，t∈[1,T]S represents the row of the maximum horizontal viewing angle, and T represents the column of the maximum vertical viewing angle; in particular toIn an embodiment, S-14, T-14; as shown in fig. 2, the left image in fig. 2 is a set of sub-aperture images for all viewing angles, and the right image in fig. 2 is a sub-aperture image for the viewing angle at the row 6 and column 11

Step 1.4, defining the number of the selected visual angles as m multiplied by m, and utilizing a formula (1) to collect the sub-aperture images N under the d-th all visual angles_dTo select the d-th image set M centered on the central view angle_d(ii) a In specific implementation, m is 9, and 81 view images are selected in total; experiments show that more visual angles can provide more information, the performance of the obvious target detection model can be further improved, however, more visual angles consume a large amount of storage and calculation time, and the experiment difficulty is increased;

in the formula (1), the reaction mixture is,

and to

Taking an integer downwards;

step 1.5, obtaining the d-th microlens image I according to the x ═ v-1 × m + t, y ═ u-1 × m + s_dMiddle x row and y column pixel point I_d(x, y) to obtain the d-th microlens image I with height and width of H and W, respectively_dAs shown in FIG. 3, where x ∈ [1, H ]]，y∈[1,W]H ═ V × m, W ═ U × m; in this embodiment, H is 3375, W is 4860, and the left image in fig. 3 is a microlens image I_dAnd the right image in FIG. 3 is a microlens image I_dAnd (3) partially enlarging, wherein all pixels in the grids in the partially enlarged image represent a pixel set of the same spatial information and different viewing angle information.

Sub-aperture image for the d-th central viewing angle

Marking a salient region, and enabling the pixel of the salient region to be 1 and the pixel of the non-salient region to be 0, thereby obtaining a d-th microlens image I_dD true saliency map G of_dD-th real saliency map G_dV and U, in specific embodiments, 375 and 540; as shown in fig. 4, the first and third rows in fig. 4 are microlens images, and the second and fourth rows are real saliency maps.

Step 3, for the d-th microlens image I_dCarrying out data enhancement processing to obtain a d enhanced microlens image set I_d'; for the d-th real saliency map G_dPerforming geometric transformation to obtain the d-th transformed real saliency map set G_d'; in the present embodiment, for the d-th microlens image I_dThe method has the advantages that the data enhancement is realized by rotating, turning over, increasing the chroma, increasing the contrast, increasing the brightness, reducing the brightness and increasing the Gaussian noise processing, and the generalization capability of the obvious target detection model can be improved by the data enhancement.

And 4, repeating the steps 1.2 to 3 to obtain D enhanced microlens image sets I' in the light field data set L (I)₁′,I₂′,…,I_d′,…,I′_D) And D sets of transformed true saliency maps are denoted G' ═ (G)₁′,G₂′,…,G_d′,…,G′_D)；

step 5.1, a deep convolutional neural network is adopted in the deep convolutional neural network, the deep convolutional neural network is composed of 16 convolutional layers, 5 pooling layers, 2 discarding layers and 1 laminating merging layer and used for semantic segmentation, the detailed structure of the deep convolutional neural network is shown in fig. 6, the deep convolutional neural network is contained in the deep convolutional neural network, the context of an image is captured in multiple proportions, and the detection of significant targets in multiple scales is achieved.

Step 5.2, modifying the Deeplab-V2 convolutional neural network at the layer c to obtain a modified LFnet convolutional neural network, wherein the detailed structure of the LFnet convolutional neural network is shown in FIG. 5;

setting the moving step size of a convolution kernel to be m when the convolution layer LF _ conv1_1 carries out convolution operation; in specific implementation, m is 9; constructing microlens image I at step 1.4 and step 1.5_dIn the process, the number of the selected viewing angles is 9 × 9, and in order that the network can better extract and fuse multi-viewing angle information, the size of a convolution kernel of the convolution layer LF _ conv1_1 is set to be 9 × 9, and the step length is 9;

the mathematical expression for the ReLU activation function LF _ ReLU1_1 is phi (a) ═ max (0, a), where a represents the output of the convolutional layer LF _ conv1_1 and is input to the ReLU activation function LF _ ReLU1_1, and phi (a) represents the output of the ReLU activation function LF _ ReLU1_ 1;

step 5.2.2, adding a discarding layer after other convolution layers in the Deeplab-V2 convolutional neural network except the convolution layers connected with the discarding layer in the convolutional layer LF _ conv1_1 and the Deeplab-V2 convolutional neural network; in the embodiment, the discarding layer is added, so that overfitting can be effectively prevented, and meanwhile, the generalization capability of the obvious target detection model is improved;

step 5.2.3, setting the number of output channels of the c-1 layer in the Deeplab-V2 convolutional neural network as b, wherein b is the number of pixel categories; in specific embodiments, c-1 ═ 23, b ═ 2; the salient object detection model classifies pixels into salient and non-salient types.

Step 5.2.4, adding an upsampling layer after the layer c of the Deeplab-V2 convolutional neural network, and utilizing the upsampling layer to output a characteristic diagram F of the layer c of the Deeplab-V2 convolutional neural network_d(q, r, b) performing an upsampling operation to obtain an upsampledLater feature map F_d' (q, r, b); wherein q, r and b respectively represent a characteristic diagram F_dWidth, height and number of channels of (q, r, b);

step 5.2.5, adding a shear layer after the upper sampling layer, and according to the d-th real saliency map G_dLength V and width U of (d), using shear layer pair profile F_d' (q, r, b) obtaining a microlens image I by shearing_dPixel class prediction probability map F_d″(q,r,b)；

Processing the test set according to the steps 1.1 to 2 to obtain a microlens image of the test set, inputting the microlens image of the test set into the salient object detection model to obtain a pixel class prediction probability map F of the test set_test"(q, r, b), extraction of saliency map F using equation (2)_s", formula (2) wherein F_test"(q, r,2) represents a probability map F_test"(q, r, b) the value of the second channel; for significant picture F_s"normalization" to obtain the final saliency map F_s。

F_s″＝F_test″(q,r,2) (2)

In order to more fairly evaluate the performance of the significant target detection model obtained by the method, a training set and a test set are selected in turn, and the average of the 5 test results is taken as a final index for evaluating the performance of the significant target detection model.

Fig. 7 is a qualitative comparison between the significant target detection method based on the deep convolutional network of the present invention and other current light field significant target detection methods, where Ours represents the significant target detection method based on the deep convolutional network of the present invention; multi-cue represents a light field significant target detection method based on focus flow, view flow, depth and color; DILF represents a light field significant target detection method based on color, depth and background prior; WSC represents a light field significant target detection method based on sparse coding theory; LFS represents a salient object detection method based on object and background modeling. All 4 methods were tested on real scene data sets collected by the second generation light field camera used in the present invention.

Table 1 is an analysis table of quantitative comparison between the method for detecting a significant target based on a deep convolutional network and other current methods for detecting a significant target of a light field by using an 'F-measure', 'WF-measure', 'average precision AP', 'average absolute value error MAE' as measurement standards and using a data set acquired by a second-generation light field camera, wherein the 'F-measure' is a statistical index of 'recall ratio/precision curve' measurement, the closer the value to 1, the better the effect of the significant target detection is indicated, the 'WF-measure' is a statistical index of 'weighted recall ratio/precision curve' measurement, the closer the value to 1, the better the effect of the significant target detection is indicated, the 'AP' measures the average precision of the result of the significant target detection, the closer the value to 1, the better the effect of the significant target detection is indicated, and the 'MAE' measures the average absolute difference degree of the result of the significant target detection and the real result, the closer the value is to 0, the better the detection of a significant target.

Fig. 8 is an analysis diagram of the significant target detection method based on the deep convolutional network, which takes the "PR curve of the accuracy-recall rate" as the measurement standard and performs quantitative comparison with other current light field significant target detection methods, wherein if one PR curve is completely "wrapped" by another PR curve, the performance of the latter PR curve is better than that of the former PR curve.

TABLE 1

Salient object detection method	Ours	Multi-cue	DILF	WSC	LFS
						F-measure	0.8118	0.6649	0.6395	0.6452	0.6108
WF-measure	0.7541	0.5420	0.4844	0.5946	0.3597
						AP	0.9124	0.6593	0.6922	0.5960	0.6193
MAE	0.0551	0.1198	0.1390	0.1093	0.1698

As can be seen from the quantitative analysis table in Table 1, the 'F-measure', 'WF-measure', 'AP' and 'MAE' obtained by the method are all higher than those obtained by other light field significant target detection methods. As can be seen from the PR graph of FIG. 8, the method of the present invention shows that the "recall/precision curve" is close to the upper right corner, and all contain PR curves of other methods, and when the recall ratios are the same, the probability of false detection is lower.

Claims

1. A light field salient object detection method based on a deep convolutional network is characterized by comprising the following steps:

step 1, obtaining a microlens image I_d；

Step 1.1, acquiring a light field file by using a light field device, and decoding to obtain a light field data set, which is recorded as L ═ (L ═ L)₁,L₂,…,L_d,…,L_D) Wherein L is_dRepresenting the d-th light field data and denoting the d-th light field data as L_d(u, v, s, t), u and v representing any horizontal pixel and vertical pixel in the spatial information, and s and t representing any horizontal viewing angle and vertical viewing angle in the viewing angle information; d is equal to [1, D ∈]And D represents the total number of light field data;

step 1.2, fixing a horizontal view angle s and a vertical view angle t, and traversing the d-th light field data L_d(u, v, s, t) of all horizontal and vertical pixels, resulting in the d-th light-field data L_dSub-aperture image at view angle of s-th row and t-th column in (u, v, s, t)

And is

Step 1.3, traversing the d light field data L_dAll horizontal viewing angles and vertical viewing angles in (u, v, s, t) are obtained, and a sub-aperture image set under the d-th all viewing angles is obtained

In the formula (1), the reaction mixture is,

and to

Taking an integer downwards;

Sub-aperture image to the d-th central view

Marking a significant area, setting the pixel of the significant area as 1, and setting the pixel of the non-significant area as 0, thereby obtaining the d-th microlens image I_dD true saliency map G of_dSaid d-th real saliency map G_dAre respectively V and U;

step 3, aiming at the d-th microlens image I_dNumber of advancesObtaining a d enhanced microlens image set I 'according to the enhancement processing'_d(ii) a For the d real significant map G_dPerforming geometric transformation processing to obtain a d-th transformed real saliency map set G'_d；

And 4, repeating the steps 1.2 to 3 to obtain D enhanced microlens image sets I ' (I ') in the light field data set L '₁,I′₂,…,I′_d,…,I′_D) And D sets of true saliency maps after transformation are denoted as G '═ G'₁,G′₂,…,G′_d,…,G′_D)；

step 5.2.4, convolution at the deep-V2Adding an upsampling layer behind the layer c of the neural network, and utilizing the upsampling layer to output a feature map F of the layer c of the Deeplab-V2 convolutional neural network_d(q, r, b) performing an upsampling operation to obtain an upsampled feature map F_d' (q, r, b); wherein q, r and b represent the characteristic diagram F respectively_dWidth, height and number of channels of (q, r, b);