Autonomous learning method in visual saliency detection
Technical Field
The invention relates to the technical field of computer vision, in particular to a method for autonomously learning in salient object detection by means of a visual perception saturation mechanism.
Background
Visual attention and visual saliency are fundamental research issues in psychology, bioneurology, cognitive science, and computer vision. In recent decades, a hundred-fold multi-ocular-spot prediction fp (visualization prediction) and saliency detection sod (salient object detection) method has been proposed for visual attention modeling. The best performance is currently the deep SOD model based on the deep learning framework. However, the most difficult step to construct a deep learning model is to manually label a large number of training sample graphs at the pixel level, and the overall system performance depends on manually labeled data. However, manual labeling is time consuming and expensive. Furthermore, models trained on fine-labeled datasets tend to be over-fit and less generalizable. Therefore, a deep SOD model which can be trained autonomously without manual intervention becomes a research hotspot.
We note that most of the deep SOD models published today process information in a single sensing pathway. For example, in SOD based on full convolutional neural network (FCN), different features come from different convolutional layers, and the fusion of features is always performed in the same FCN; the single deep SOD model rarely exhibits perceptual fusion at the decision level. Although a few multi-sensing channel or multi-sensing branch detection models can fuse multi-sensing channel results, most of them only aim at extraction and fusion of multi-scale features, and rarely reveal the interaction and the interrelation of a multi-channel perception system.
Psychological and physiological experiments show that human intuition and memory can produce visual perception simultaneously and interact with each other. For example, the human binocular system forms two channels for processing visual information, and should have other functions in addition to being able to form stereoscopic vision. We believe that simulating human binocular perception, establishing two parallel, slightly different SOD sensing channels may be beneficial for the significant target detection task. The system with multiple sensing channels can simultaneously generate target sensing and output sensing difference. When the outputs of a plurality of sensing channels are very similar, namely the difference is very small, the same target is detected by multiple channels at the same time, and the sensing tends to be saturated; smaller multi-channel perceptual differences correspond to higher visual perceptual saturation, which can be represented as a confidence level of the target detected by the multi-channel system. By using the mechanism, a remarkable target with high reliability in an image can be automatically found by constructing a multi-channel algorithm, and a high-reliability target region is used as an automatic labeling sample for iteratively updating a deep SOD model.
Disclosure of Invention
In view of this, the invention proposes a frame of a salient object detection algorithm with two perception channels by simulating human binocular vision: finding a target area with high reliability by comparing the difference of the two-channel visual perception; constructing a new training sample set through a high-reliability target area; and continuously optimizing the target detection model through self iterative learning. The purpose of the invention is realized by the following technical scheme:
1) constructing two parallel visual perception channels by two different supervised deep SOD models to form a set of remarkable target detection framework with double visual information flows;
2) judging the perception saturation degree of the salient target area by comparing the difference between the binaryzation masks of the two perception channels outputting the salient images at the same moment; if the difference is small, the saturation is high, and if the difference is large, the saturation is low.
3) When the perception saturation output in the step 2) exceeds a preset empirical threshold, the visual perception is considered to be close to saturation, the saliency maps output by the two channels can be superposed to generate a final saliency map, and the binary mask of the saliency map is considered as a target area with high reliability; therefore, a certain number of automatically labeled target areas with high reliability are collected, and a training sample set with algorithm self-labeling is formed and is used for further self-learning updating of the two supervised deep SOD models in the step 1). If the perception saturation value output in the step 2) is smaller than a preset threshold value, the visual perception is indicated to be undersaturated, and the credibility of the detected and output significant target area is low, so that the significant target area cannot be selected to enter a training sample set.
4) When the SOD model in the step 1) is updated, the method can obtain a remarkable target detection result with better performance.
5) Under the condition of a limited number of test data, after the deep SOD model in the method is updated iteratively for 1-2 times, the detection precision of the obvious target is not obviously improved any more, and the system performance tends to be saturated; in order to obtain better performance, the algorithm needs to process a larger-scale data set, collect more training samples with high reliability, and iteratively update the SOD model in the dual channels.
Drawings
FIG. 1 is a block diagram of an autonomous learning method in visual saliency detection;
Detailed Description
The present invention is further illustrated by the following specific examples, but the present invention is not limited to these examples.
The invention is intended to cover alternatives, modifications, equivalents, and alternatives that may be included within the spirit and scope of the invention. In the following description of the preferred embodiments of the present invention, specific details are set forth in order to provide a thorough understanding of the present invention, and it will be apparent to those skilled in the art that the present invention may be practiced without these specific details.
As shown in FIG. 1, the implementation of the self-learning method in the detection of the visually significant object of the present invention comprises the following steps:
1) a parallel two-channel salient object detection system is designed. The system consists of two detection channels with similar structures, simulates a human eye binocular system, and obtains high-reliability target perception through a visual saturation mechanism;
2) each detection channel can respectively adopt a current new deep SOD model, such as PicANet and F3 Net; firstly, utilizing an initial training set to perform offline pre-training to generate two deep SOD models to form an initial system; the initial system detects the image salient object to obtain salient images output by two channels, and two binary mask images are obtained after thresholding the salient images; comparing the two mask regions, and measuring the difference between the masks by using an F-measure parameter (see formula (1)); the larger the F-measure value is, the higher the perceived saturation is, and conversely, the lower the saturation is.
Wherein:
β
2are empirical weighting factors. M1 is a sense channel output binary mask, and M2 is another sense channel output binary mask.
3) And when the F-measure value representing the perception saturation is larger than the threshold th, the output saliency maps of the two perception channels are superposed and fused together to form a new output map, and the binary mask map of the map is used as an automatic labeling map and becomes a new training sample. Through a large number of image tests, a certain number of automatic annotation graphs with high reliability can be collected to form a new training sample set.
4) The new training sample set may be used for iterative updating of the deep SOD model in both detection channels.
5) Because the test data is limited, the number of the high-reliability marking samples automatically obtained by the method does not increase any more after reaching a certain number, and tends to be saturated; the iterative updating of the deep SOD model is influenced by the influence, the detection precision of the obvious target is not obviously improved any more after the iterative updating is carried out for 1-2 times, and the system performance tends to be saturated. To achieve better performance, the algorithm can process a larger-scale data set, generate more training samples with high confidence, and iteratively update the SOD models in the two channels.
The foregoing is illustrative of the preferred embodiments of the present invention only and is not to be construed as limiting the claims. The present invention is not limited to the above embodiments, and the specific structure thereof is allowed to vary. In general, all changes which come within the scope of the invention as defined by the independent claims are intended to be embraced therein.