CN115937613B

CN115937613B - A plant temporal image contrast learning method embedding prior distance

Info

Publication number: CN115937613B
Application number: CN202310033871.2A
Authority: CN
Inventors: 胡玲艳; 许巍; 郭睿雅; 汪祖民; 谷毛毛; 陈鹏宇; 徐国辉; 郭占俊; 李国强; 秦山
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2023-01-10
Filing date: 2023-01-10
Publication date: 2026-01-09
Anticipated expiration: 2043-01-10
Also published as: CN115937613A

Abstract

The invention discloses a plant time sequence image contrast learning method for embedding prior distances, which comprises the steps of reading plant time sequence images, obtaining object weather information, generating four types of image pairs, recording corresponding prior distances owned by different types of image pairs, inputting the image pairs x and y into a contrast model, carrying out data enhancement to obtain images v1 and v2, extracting feature vectors h1 and h2 from the images v1 and v2 in an encoder, mapping a representation to a space where contrast loss is applied by using a small neural network projection head after extracting the feature vectors h1 and h2, and fusing the prior distances of the different types of image pairs with the actual distances of the corresponding vectors z1 and z2 through a classification distance and a classification distance, so as to obtain the contrast loss, and further training. The invention uses contrast learning to obtain the pre-training weight specially used for crops, so that the self-supervision contrast learning method can be effectively applied to the pre-training of the plant time sequence image.

Description

Plant time sequence image contrast learning method embedded with priori distance

Technical Field

The invention relates to the technical field of plant time sequence image processing of deep learning, in particular to a plant time sequence image contrast learning method embedded with prior distance.

Background

Plants often exhibit different characteristics and traits during the growing process due to genetic differences and environmental impact. The appearance and inherent physiological and biochemical characteristics of all measurable organisms, such as shape, structure, size, color, etc., as determined by genotype and environment are referred to as plant phenotype. Understanding the phenotypic characteristics and traits of plants is an important proposition for biological research, and the complex roles of genomic and environmental factors on plant phenotypes will not be deeply understood in the absence of exhaustive phenotypic data. The traditional plant phenotype analysis mainly measures various parameters manually, and has the advantages of small analysis scale, low efficiency and large error.

The plant time sequence image general research steps mainly comprise two stages of information extraction and time sequence modeling. In the information extraction stage, a digital image processing method, in particular to various deep learning methods such as image classification, target detection, semantic segmentation and the like are generally adopted to extract phenotype data of single image data. In the time sequence modeling stage, information is accumulated from the time dimension, data of different growth stages are fused, a specific model is built, and joint analysis can be carried out with other external factors.

In the acquisition process of the plant time sequence image, the imaging equipment usually shoots the image aiming at the plant in a specific area at fixed time, so that the plant time sequence image can be conveniently acquired. However, in training a deep learning model with image information extraction, a large number of tagged images are often required, and because of the presence of more details of the plant image, such as edges of flowers and leaves, the manual tagging of the data set requires a higher cost, which requires that the model can use less manual tagging data to obtain better training results.

For the current situation, the contrast learning of pre-training by using label-free data is one way to realize efficient training of labels, such as SimCLR, moCo, simSia. The model learns the characterization by maximizing consistency between different enhanced views of the same data instance, with good results when migrating to downstream tasks. Research applications have also begun on plant and crop images, such as plant phenotype segmentation, plant remote sensing, pest monitoring, seed classification, and the like.

However, the number of contrast learning studies in the plant field is far less than in other fields, and no study of self-supervised contrast learning on plant timing images is seen. The plant grows slowly, the image change of the image sequence in a period of time is small and is similar, the ratio of the parts of organs such as flowers, leaves, trunks and the like in the image is large, and the semantic information is simple. In the traditional contrast learning model, although research images are easy to acquire, the labeling cost is high, and due to the semantic similarity of different images, the model is difficult to judge whether the same image is a positive sample enhanced by different data or a negative sample from different images but similar images during contrast training.

Disclosure of Invention

The invention aims to provide a plant time sequence image contrast learning method embedded with a priori distance, which can be effectively applied to the pre-training of plant time sequence images and has wide application prospects in the research of various computer vision plant phenotypes.

In order to achieve the above object, the present application provides a plant time sequence image contrast learning method embedded with a priori distance, comprising:

reading the plant time sequence image to obtain the information of the weather period, and generating four kinds of image pairs, namely image pairs of the same sequence in the same period, the same sequence in different periods, the different sequences in the same period and the different sequences in different periods;

recording corresponding prior distances owned by different types of image pairs;

After an image pair x and y is input into a comparison model, data enhancement is carried out, the purpose is to weaken the influence of color on model training, particularly the green of large-area leaves, so that the model can pay attention to higher-level semantic information except the color, different views can be generated in the training of different epochs for the same image, and the model can better utilize similar images. The data enhanced images are v1 and v2.

The images v1 and v2 are extracted into characteristic vectors h1 and h2 in an encoder, and the encoder is arbitrarily decided according to a downstream task.

After extracting the feature vectors h1 and h2, the invention uses a small neural network projection head to map the representation to the space where contrast loss is applied, uses 2-layer MLP with ReLU and BN layers to project the feature vectors hi and hj to 256-dimensional vectors z1 and z2, and the projection head does not participate in downstream task training.

In the stage of calculating contrast loss, the invention provides two modes of classifying distance and classifying distance, and the prior distance of different types of image pairs and the actual distance of corresponding vectors z1 and z2 are fused to obtain the contrast loss.

Compared with the prior art, the technical scheme has the advantages that important domain knowledge is converted into the prior distance of the image pair, and contrast learning pre-training is carried out. The self-supervision contrast learning method can be effectively applied to the pre-training of plant time sequence images, and has wide application prospects in the research of various computer vision plant phenotypes.

Drawings

FIG. 1 is a flow chart of a plant time sequence image contrast learning method embedded with a priori distance;

FIG. 2 is a schematic illustration of weathered segmentation and image extraction;

FIG. 3 is a graph of distance metric and contrast loss function acquisition process;

FIG. 4 is a schematic diagram of the pre-training and migration process.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the application, i.e., the embodiments described are merely some, but not all, of the embodiments of the application.

Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.

Example 1

As shown in FIG. 1, the application provides a plant time sequence image contrast learning method embedded with a priori distance, which specifically comprises the following steps:

And acquiring the weathered period information from the plant image. And randomly extracting 3-5 example sequences for n image sequences of a certain plant. For each example sequence, the budding was set to a reference time day0, and the start time and duration of the different climates could be obtained by manually interpreting the image sequence. Each time node of the example sequence is averaged. Since a particular plant has its relatively fixed annual growth period, this average value may be approximated as representing the climatic periods of all n image sequences.

And extracting a required image from the images with the divided object waiting periods. The weathered period change is a process, and at the end of one weathered period and the beginning of the next weathered period, the corresponding images may contain similar semantic information. In order to realize automatic and accurate extraction of images in different climates and maximize the difference of image semantics in different climates, images in adjacent climates in the boundary period are abandoned, and only images far from a time critical point are selected and recorded for classification.

The images are paired pairwise, namely, the image set itself is subjected to Cartesian product. The obtained image pairs can be recorded into four types of synchronous sequences, different phases of different sequences and different phases of different sequences according to the sequences and the climatic periods of the two images, and are stored as label by One-hot coding.

And calculating contrast loss according to the prior distances of the image pairs of different types, and providing grading distances and classifying distances. The grading distance is shown in fig. 3 (a), specifically:

For class i image pairs, a distance coefficient k _l is first defined to characterize the relative distance between the class of image pairs. For any one image pair x and y, its distance p _xy can be obtained:

a ₁,a₂,a₃,a₄ is the original label of the image pair, and after this processing, the prior distance p _xy between any one of the image pairs x and y is marked as a number between [0,1], and p _xy can be regarded as the probability between the images with the same meaning between the sample pairs.

For vectors z1 and z2 of image pairs x and y, the loss value loss is calculated as:

Wherein Simz ₁,z₂ is the similarity between z1 and z 2. The function can be regarded as similarity to z1 and z2, and cross entropy is calculated with the probability represented by p _xy after Sigmoid. τ is a temperature coefficient that can be used to adjust the distribution of similarity. The similarity may be measured using a negative cosine distance (negative cosine similarity) or euclidean distance. When a negative cosine distance is used:

the euclidean distance is the l2 norm:

Sim(z₁,z₂)＝Euc(z₁,z₂)＝||z₁-z₂||₂

for each mini-batch, if batchsize is n, then its contrast loss is:

the classification distance is shown in fig. 3 (b), specifically, the distance information of the contrast model is implicitly mapped onto a fully connected layer, and the loss is directly calculated with the classification of different image pairs.

In calculating the Euclidean distance of z1 and z2, it can be further rewritten as:

the euclidean distance calculation is completed to obtain a number, which is not beneficial to the connection to the full connection layer for classification. When the classifying distance calculates the contrast loss, z1 and z2 are directly subtracted according to the phase, namely:

e=z₁-z₂

e is a vector of the same dimension as z1 and z2, and it can be found that the process of calculating e is the core step of calculating the euclidean distance of z1 and z2, which results in the inclusion of a priori distance information of z1 and z2 in e. The e is linearly projected to a full connection layer t of 4 nodes, and the output o can be obtained through softmax processing. Namely:

t=eW

o=Softmax(t)

The cross entropy is used for calculating the error between the category information in o and the prior distance information in the image pair label category, namely:

After the contrast learning pre-training is finished, the obtained weight can be transferred to various downstream supervision tasks. The invention adopts a small amount of plant images with labels to supervise training based on the weight which is already trained. And taking the trained semantic segmentation network as an information extraction model, recording semantic segmentation results from the time dimension, and establishing a plant growth model.

As shown in fig. 4, a U-Net semantic segmentation network is used as an information extraction model to segment the regions of the stems, flowers, leaves, fruits, background and the like in the plant image at the pixel level. The U-Net network structure is a classical Encoder-Decoder structure, and various trunks can be adopted as Encoder. The Encoder networks can obtain 5 preliminary layers of active features. The Decoder network uses deconvolution to up-sample 5 feature layers and performs feature fusion to obtain an effective feature layer fused with all features. And classifying each feature point of the finally obtained feature layer to obtain a semantic segmentation result.

And the Kalman filtering is adopted, so that the influence on the aspects of ambient light, climate, production activities and the like is reduced, noise and interference in a growth model system are removed, and more stable observation data are obtained.

The foregoing descriptions of specific exemplary embodiments of the present invention are presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain the specific principles of the invention and its practical application to thereby enable one skilled in the art to make and utilize the invention in various exemplary embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. A plant temporal image contrast learning method embedding prior distance, characterized in that it includes:

Read plant time series images to obtain phenological information and generate four types of image pairs: image pairs of the same sequence at the same time, image pairs of the same sequence at different times, image pairs of different sequences at the same time, and image pairs of different sequences at different times.

Record the prior distances between different types of image pairs;

After inputting an image into a comparison model using x and y pairs, data augmentation is performed to obtain images v1 and v2; different views are generated for the same image during training at different epochs.

The encoder extracts feature vectors h1 and h2 from the images v1 and v2;

After extracting feature vectors h1 and h2, a neural network projection head is used to map the representations to a space where contrastive loss is applied, resulting in vectors z1 and z2.

By fusing the prior distance between image pairs of different types and the actual distance between the corresponding vectors z1 and z2 using both hierarchical distance and classification distance methods, a contrastive loss is obtained.

The hierarchical distance is specifically defined as follows: First, a distance coefficient is defined. To characterize the relative distance between image pairs of this type; for image pairs x and y, obtain their distance. :

For the original labels of the images, after this processing, the prior distance between any image pairs x and y is... A number labeled as [0,1] will Consider the probability that a sample pair represents images with the same meaning;

For the image pairs x and y vectors z1 and z2, the loss value is calculated as follows:

In the formula Let z1 and z2 be the similarity; the above formula is considered as calculating the similarity between z1 and z2, and then applying the Sigmoid function to it. Calculate the cross-entropy based on the probabilities they represent; This is a temperature coefficient used to adjust the distribution of similarity.

Similarity is measured using negative cosine distance or Euclidean distance; when using negative cosine distance:

Euclidean distance is the l2 norm:

For each mini-batch, if the batch size is n, then its contrastive loss is:

.

2. The plant temporal image contrast learning method with embedded prior distance according to claim 1, wherein the classification distance is specifically: implicitly mapping the distance information of the contrast model onto a fully connected layer, and directly calculating the loss with the classification of different image pairs;

When calculating the Euclidean distance between z1 and z2, the following formula can be rewritten as:

When calculating the contrastive loss based on classification distance, z1 and z2 are directly subtracted bitwise, i.e.: The parameter e is a vector with the same dimension as z1 and z2. The process of calculating e is the core step in calculating the Euclidean distance between z1 and z2, which means that e contains the prior distance information of z1 and z2.

3. The plant temporal image contrast learning method embedding prior distance according to claim 2, characterized in that e is linearly projected onto a fully connected layer t with 4 nodes, and after softmax processing, the output o is obtained; that is:

Cross-entropy is used to calculate the error between the category information in o and the prior distance information between the image and the label categories, i.e.:

.

4. The plant temporal image contrast learning method with embedded prior distance as described in claim 1, characterized in that a U-Net semantic segmentation network is used as the information extraction model to perform pixel-level segmentation of branches, flowers, leaves, fruits and background regions in the plant image; the U-Net semantic segmentation network is an Encoder-Decoder structure, the Encoder network obtains 5 feature layers; the Decoder network uses deconvolution to upsample the 5 feature layers and performs feature fusion to obtain an effective feature layer that integrates all features; the semantic segmentation result can be obtained by classifying each feature point of the effective feature layer.