CN119091127A

CN119091127A - Visual information processing method in railway shunting operation scene

Info

Publication number: CN119091127A
Application number: CN202411565510.3A
Authority: CN
Inventors: 张屹; 杜怡曼; 赵怡洁
Original assignee: Beijing Southwest Jiaotong University Shengyang Technology Co ltd
Current assignee: Beijing Southwest Jiaotong University Shengyang Technology Co ltd
Priority date: 2024-11-05
Filing date: 2024-11-05
Publication date: 2024-12-06
Anticipated expiration: 2044-11-05
Also published as: CN119091127B

Abstract

The invention discloses a visual information processing method in a railway shunting operation scene, and belongs to the field of computer vision. In order to solve the problem of efficiently and quickly understanding and identifying the track and the railway target, the invention simultaneously implements the anchor line generator, the track detection head and the target detection head on the basis of a shared encoder, and finally efficiently realizes the track detection task and the railway target detection task. The method has low requirement on computing resources, can realize real-time railway target detection and track detection tasks, can track and identify the track in a complex track scene, and improves the reliability of the system.

Description

Visual information processing method in railway shunting operation scene

Technical Field

The invention relates to the field of computer vision, in particular to a visual information processing method in a railway shunting operation scene.

Background

Currently, intelligent visual information processing in the field of automatic driving is a very popular research direction. In the field of automatic driving, a multi-task detection network mainly focuses on three detection tasks of lane segmentation, drivable region segmentation and traffic object target detection. The road structure is complex, and the road structure comprises multiple changeable factors such as multiple lanes, intersections, pedestrian crossing and the like, and particularly under the condition that a high-definition map cannot be acquired and accurate positioning is not achieved, the perception of the three tasks is very important for automatic driving.

In the railway shunting operation scene, the technical emphasis is placed on (a) a railway target detection task and (b) a track detection task, namely accurately identifying and monitoring a fixed track path, because railway tracks are relatively regular.

The basis of the current railway video car-taking system is to use various technical means such as a video acquisition technology, a deep learning technology, a visual ranging technology and the like to acquire the road condition in front of a shunting operation pushing head car and the position information of a parking car.

With the rapid development of deep learning, convolutional neural networks are widely used in the fields of track detection and railway target detection. However, existing detection methods typically consider track detection tasks separately from railroad target detection tasks. In the push shunting operation video car-taking system, because of limited computing resources, if two independent detection models are adopted on equipment mounted on a push head car, not only hardware requirements are increased, but also instantaneity is reduced.

(A) For railway target detection tasks, the object positioning and object type determination are mainly realized by identifying the reserved car or/and coupler targets in a specific area and indicating the object positions through a boundary box. In the field of target detection, detection methods can be classified into two-stage target detection methods and one-stage target detection methods.

The two-stage target detection method comprises the steps of firstly selecting a candidate region from the whole image, secondly searching objects in the candidate region, classifying the found objects and generating a boundary box of a final object. Although the two-stage target detection method has higher accuracy, a great deal of time and resources are consumed, and the real-time application of the network is not facilitated.

The one-stage target detection method is an improvement on the two-stage target detection method, and takes positioning and classification as a regression problem, so that end-to-end detection is realized, and the detection speed is high. The detection frame selection and the target positioning classification are fused, and the target is selected when the anchor frame is arranged on the whole image, so that the region proposal selection and the target positioning are realized. And performing class classification and confidence calculation on each anchor frame to finish classification of each anchor frame, and finally realizing positioning and classification of the target. The one-stage network has better performance in terms of speed and precision, and is well applied to both academia and industry.

(B) For a track detection task, a traditional track detection method divides a track line area through edge detection and filtering, and then track line detection is performed by combining Hough transformation, ransac methods and the like. However, the conventional method needs to manually adjust parameters according to the characteristics of the application scene, and is limited in application. Therefore, deep learning-based methods are favored. The method based on deep learning can be divided into three types, namely a method based on semantic segmentation, a method based on row classification and a method based on anchor lines.

The rail line detection method based on semantic segmentation only focuses on the rail in the current rail area of the train, and due to the slender shape of the rail, the target to be detected occupies few pixels in the scene. The method for dividing the small targets is not only difficult to divide the set targets, but also wastes calculation resources and does not meet the requirement of real-time detection. The line classification-based method is a line classification detection method based on input image meshing, and for each line, the model predicts the cells most likely to contain part of the track, and then fits the track line through the key points. The method is faster than the segmentation method, but the detection effect on occlusion scenes may be poor. The method based on the anchor line can directly predict the geometric structure of the track line, has high execution speed, meets the real-time processing requirement, and has better robustness on the shielding scene because the method mainly focuses on the structure of the line instead of the color or texture information of the pixel level.

Aiming at the railway shunting operation scene, no technical proposal is disclosed which can complete the railway target detection task and the railway detection task simultaneously in real time and occupies less calculation resources.

The invention focuses on the track detection task and the railway target detection task in the railway shunting operation scene based on the railway video car-taking system. The invention provides a brand-new multi-task detection network which can be carried in equipment of a pushing head car with limited computing resources and can realize multi-target end-to-end detection.

Disclosure of Invention

In order to alleviate or partially alleviate the above technical problem, the solution of the present invention is as follows:

A visual information processing method in a railway shunting operation scene includes inputting pictures obtained by acquiring visual information into an encoder, wherein the encoder comprises a convolutional neural network and a characteristic pyramid network, the convolutional neural network comprises a plurality of segments, a characteristic pyramid network comprising a first characteristic map, a second characteristic map and a third characteristic map is constructed by extracting a characteristic map of the last layer in the segments in the convolutional neural network, the second characteristic map is sequentially input into a first prediction branch, a second prediction branch, a third prediction branch and a fourth prediction branch and is respectively used for predicting a starting point, a starting point offset, an angle and a dynamic kernel parameter of an anchor line, a plurality of dynamic convolution kernel parameters are generated for each predicted starting point through a circulation instance module, region-of-interest characteristics of the anchor line are extracted from the third characteristic map, a conditional convolution operation is performed on the region-of-interest characteristics through the dynamic kernel parameters, the characteristic pyramid network is used as an input of a full-connection layer, horizontal offset between a track and the anchor line is obtained according to the starting point, the starting point offset and the angle of the anchor line and the horizontal offset between the track and the anchor line, the starting point offset of the anchor line and the horizontal offset between the track and the anchor line are respectively, a plurality of prediction parameters are obtained according to the prediction points and the position of the track and the horizontal offset between the track and the anchor line, the prediction parameters are obtained under the window and the different in the prior frame, the prediction path is not equal to the target position of the target is obtained, the prediction path is obtained by the prediction frame and the prediction window and the target position is different from the target frame.

Further, the first, second and third feature maps increase in size in order.

Further, a first characteristic diagram in the characteristic pyramid network is obtained by extracting a characteristic diagram of a last layer in a last first section of the convolutional neural network and performing multi-layer maximum pooling operation of a spatial pyramid pooling module, a second characteristic diagram in the characteristic pyramid network is obtained by extracting a characteristic diagram of a last layer in a penultimate section of the convolutional neural network and performing 1×1 convolution and then adding the up-sampled first characteristic diagram, and a third characteristic diagram in the characteristic pyramid network is obtained by extracting a characteristic diagram of a last layer in a penultimate section of the convolutional neural network and performing 1×1 convolution and then adding the up-sampled second characteristic diagram.

Further, when predicting the starting point, the starting point offset, the angle and the dynamic kernel parameters about the anchor line, a starting point heat map, a starting point offset map, an angle map and a parameter map are generated respectively.

Further, a plurality of dynamic convolution kernel parameters are generated for each anchor line start point by a loop instance module.

Further, in the training phase, the weighted track detection loss and the railway target detection loss are taken as the total loss of the multi-task network for simultaneously processing the track detection task and the railway target detection task.

Further, the size of the feature map is not changed within each segment in the convolutional neural network.

Further, the railway target is a stay car or/and a coupler.

Further, the size of the starting point heat map, the size of the starting point offset map and the size of the angle map are the same as the size of the second feature map.

Further, during the training phase, the track detection penalty includes an angle prediction penalty, a start point offset prediction penalty, a start point prediction penalty, and a cross entropy penalty, wherein the cross entropy penalty is a constraint on the state output result of the cycle instance module.

The technical scheme of the invention has one or more of the following beneficial technical effects:

(1) The invention focuses on the specific requirements of a video car-taking system, and firstly constructs a multi-task network architecture of an encoder-decoder so as to realize an end-to-end training mode. The architecture can detect and process the track and the target in real time through a neural network, so that the number of parameters to be trained and stored is obviously reduced, the demand on computing resources is reduced, and the running efficiency of the system is further improved.

(2) Compared with the prior art that track detection and railway target detection tasks are realized through two independent networks, the method and the system are beneficial to improving the processing speed in a way of processing multiple tasks through a single neural network, so that the system can run in real time in an environment with limited resources, and the requirements of real-time monitoring and processing are met.

(3) In view of the difficulty in accurately distinguishing and tracking each track line in the method based on single anchor line generation in the prior art, the method can ensure that each steel rail is independently and accurately tracked, thereby improving accuracy and robustness under the condition of complex turnout. The adaptability is particularly critical in complex or variable track environments, and the overall performance and reliability of the system can be effectively improved.

(4) By means of the prediction of the starting point, the starting point offset and the angle, information about anchor lines can be output efficiently, and task difficulty of track detection is reduced.

Furthermore, other advantageous effects that the present invention has will be mentioned in the specific embodiments.

Drawings

FIG. 1 is a schematic diagram of the overall architecture of a multitasking detection network;

FIG. 2 is a schematic diagram of a multitasking network with respect to a track detection task architecture;

FIG. 3 is a schematic view of anchor lines;

fig. 4 is a schematic diagram of a multitasking network with respect to a railway target detection task architecture.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to clearly describe the technical solution of the embodiments of the present invention, in the embodiments of the present invention, the words "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and effect. Those skilled in the art will appreciate that the words "first," "second," and the like do not limit the number and order of execution.

Fig. 1 is a schematic diagram of the overall architecture of a multitasking network. Unlike the conventional art, the multitasking network of the present invention entirely includes an encoder and a decoder 1 that processes task 1, a decoder 2 that processes task 2, and a decoder 3 that processes task 3, respectively, for given task 1, task 2, and task 3.

The overall architecture of the multi-task detection network is divided into an encoder and a decoder. The encoder is based on a hard parameter sharing mechanism, and the encoder shares task characteristics in the encoding stage, is a shared bottom network in the framework and is positioned behind an input layer.

Above the encoder is a tower network for a plurality of tasks, the tower network of decoders is built on the output of the shared underlying network, i.e. on the output of the encoder. Decoder 1, decoder 2 and decoder 3 are a set of independent task-specific heads, predicting the output of all tasks, such as the output for task 1, the output for task 2 and the output for task 3, directly from the same input in one processing cycle.

Illustratively, the encoder of the present invention is a backbone network consisting of a series of convolutional layers for extracting features from an input. The input referred to by the invention is the image acquired or processed according to the image acquisition equipment.

FIG. 2 is a schematic diagram of a multitasking network with respect to a track detection task architecture. The multi-task detection network of the invention performs track detection tasks on the one hand and railway target detection tasks on the other hand with respect to a track detection task architecture.

As one of the technical characteristics of the invention, the invention simultaneously completes the track detection task and the railway target detection task in one network, reduces the hardware resource requirement and improves the real-time performance of information processing.

An encoder receives as input information captured pictures of a visual information gathering device, such as a camera device.

In the encoder, a convolutional neural network is included. The convolutional neural network includes a plurality of segments or stages, illustratively 5 segments of neural networks in total, wherein the feature map in each segment of the neural network is the same size, in other words, the feature map is not changed in size within each segment of the convolutional neural network. As depth increases, the size of convolutional neural networks is becoming smaller.

And extracting a characteristic map of the last layer of a plurality of sections of neural networks in the convolutional neural network, and applying the characteristic map to construct a characteristic pyramid network (Feature Pyramid Networks, FPN). The deep network layer has stronger semantic features, but the position information characterization is weaker, and the shallow network layer has stronger position information characterization, but the semantic features are weaker. Therefore, through the feature pyramid network, feature graphs with different scales can be mutually fused, and the feature graph characterization capability is enhanced.

In the encoder, a feature pyramid network is included. The size of the extracted feature map is enlarged by an up-sampling method of interpolation value, the size of the enlarged feature map is the same as the size of the extracted next-layer feature map, and semantic features in a deep network can be transferred to a shallow network by the method, so that semantic expression on multiple scales is enhanced.

Further, the up-sampling refers to inserting new pixel values between different pixel values through an interpolation algorithm on the basis of the pixel values of the original feature image, so that the size of the original feature image is enlarged. Illustratively, the up-sampling magnification is 2 times.

The feature map generated from bottom to top is convolved with the feature map obtained by up-sampling after expansion through 1×1 convolution, and the feature maps of different layers are obtained through addition and fusion, namely a second feature map P4 and a third feature map P3. The invention enables the feature images of different layers to have richer information by fusing the feature images.

In other words, the second feature map P4 in the feature pyramid network is obtained by extracting the feature map of the last layer in the penultimate segment in the convolutional neural network, and adding the up-sampled first feature map P5 after 1×1 convolution.

And extracting a characteristic map of the last layer in the third last section in the convolutional neural network, and adding the up-sampled second characteristic map P4 after 1X 1 convolution to obtain a third characteristic map P3 in the characteristic pyramid network.

In order to solve the problem of image distortion caused by clipping and scaling operations and to adapt to the problem of input pictures with different sizes, the characteristic extraction capability of the network is enhanced by extracting the characteristic diagram of the last layer in the last first section of the convolutional neural network and generating the output with fixed size for the characteristic diagram with different sizes through the multi-layer maximum pooling operation of the spatial pyramid pooling (SPATIAL PYRAMID Pooling, SPP) module, and the first characteristic diagram P5 in the characteristic pyramid network is obtained.

In the present invention, the feature pyramid network includes a first feature map P5, a second feature map P4, and a third feature map P3, and the sizes of the first feature map P5, the second feature map P4, and the third feature map P3 are sequentially increased.

In the present application, specific implementation details concerning the spatial pyramid pooling technique may be referred to in the prior art 1：He, Kaiming et al. "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition." IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2014): 1904-1916. and the present application is incorporated by reference in its entirety.

The number of the decoder is 3, and the decoder respectively performs three tasks, namely (1) anchor line generation, (2) track detection and (3) railway target detection. The anchor line generating task is completed by an anchor line generator, and the anchor line bit track provides prior information about the position. On the basis, the track detection head completes the track detection task.

For example, the railroad target in the railroad target detection task is a stay or/and coupler.

The anchor line can be regarded as a line which starts from the image boundary and has a certain slope, so that it can be parameterized asWhereinRepresents the coordinates of the starting point,Representing an angle or slope.

Further, the second profile P4 is input into three parallel prediction branches, i.e., a first prediction branch, a second prediction branch, and a third prediction branch, and predicts a start point, a start point offset, and an angle with respect to an anchor line, respectively, and dynamically constructs the anchor line for each track instance according to the predicted result.

Further, the second profile P4 is also input to the fourth prediction branch, predicts dynamic kernel parameters, and constructs a parameter map.

In the invention, the constructed anchor line is used as reference information of the track detection head, and finally the track is positioned through the predicted horizontal offset.

In the present invention, the track P may be formed byThe sequence of two-dimensional coordinate point sets is expressed as P =. The ordinate, i.e. the y-coordinate, in the set of coordinate points is preset and is sampled uniformly along the longitudinal axis of the image. The start index of the track on the vertical axis is denoted s and the length is denoted l.

In the present invention, the first heat map is a start point heat map, the second heat map is a start point shift map, and the third heat map is an angle map.

(1) For a starting point, the first predicted branch may employ two convolutional layer-set layers to estimate the starting point location. The input of the two convolution layers is the second characteristic diagram P4, and the first heat diagram is outputTo represent the probability that each pixel belongs to the starting point, where H ₄ and W ₄ are the dimensions of the second profile P4.

In the training phase, a first heat map of the live (ground truth) is generated using the starting point of the track, wherein the real labels at the (i, j) positions in the first heat mapThe method comprises the following steps:

=;

Wherein the method comprises the steps of Is a spatial index of the live heat map, H ₄ and W ₄ are the dimensions of the fourth feature map P4,Is the standard deviation related to the size of the input image,AndFor a given set of track start point coordinates, k is the index number of the track.

To solve the imbalance problem between the start point region and the non-start point region, a focus loss (focal loss) is applied to the first heat map in the prediction process, assuming thatAndThe predicted tag and the real tag at the (i, j) position in the first heat map, respectively, predict the loss with respect to the start pointCan be defined as:

=;

wherein, AndIs an adjustable super-parameter, which can be used for the control of the temperature,Is the number of tracks in the image, H ₄ and W ₄ are the dimensions of the fourth profile P4,AndThe predicted tag and the real tag at the (i, j) position in the first heat map, respectively.

(2) For a start point offset, the peaks in the first heat map provide a rough position estimate of the start point, accuracy and step size in the second feature map P4And (5) correlation.

The second predicted branch prediction offset map of the present inventionWherein H ₄ and W ₄ are the dimensions of the second heat map.

In the training phase, a square region with a live starting point distance of r is selected as an effective region, and a smooth L1 loss function commonly used in the field is applied in the region(-) Constraint of the predicted offset map, predicting loss with respect to the starting point offsetCan be defined as:

=;

wherein, Is the number of tracks in the image, r is half the side length of a square,AndIs the effective position relative toOffset along the x-axis and y-axis,Is a graph of the offset of the graph,Is the step size in the second profile P4,Is a position point obtained by downsampling the start point p.

(3) For angles, to predict an angle mapH ₄ and W ₄ are the dimensions of the fourth profile P4 for a given trackTaking the average gradient of all connecting lines of non-starting points and starting points on the track as the ground true gradient or angle:

;

Where arctan2 (-) is a four-quadrant arctangent function, the track cutoff index e=s+l-1, s is the start index, L is the track length, (x _i, y_i) is the point in track L, and (x _s, y_s) is the start point in track L.

During the training phase, the same effective training area is selected as the offset estimation, and the angle prediction loss is utilizedPredicting at a constraint angle:

;

wherein, Is the true slope or angle of the ground, k is the index number of the track,Is a prediction angle graph, L1Loss (deg.) is an L1Loss function, r is half the square side length,AndIs the effective position relative toOffset along the x-axis and y-axis,Is the number of tracks in the image and,Is a position point obtained by downsampling the start point p.

Fig. 3 is a schematic view of an anchor line. Finally, the anchor line generated by the anchor line generator is as follows: Wherein Is the starting point of the predictionIn the offset graphIs a predicted start point offset of the in (a),Is the starting point of the predictionIn an angle viewIs used to predict the angle of the beam,(.) Represents mapping the coordinate values to the final scale of the model output, ensuring that the predicted coordinates can be properly adapted to the position in the original image or target space.

In the foregoing scheme, each start point corresponds to one track instance. At a turnout of a track, however, it is often possible for multiple rails to share the same starting point.

In order to cope with the special situation, the invention generates dynamic kernel parameters through a recurrent neural network by adopting a circulation instance module (Recurrent Instance Module, RIM) and is applied to a conditional convolution processing process, and the method can generate a plurality of groups of parameters for a plurality of tracks with the same starting point, thereby ensuring that each steel rail is independently and accurately tracked and identified, and further improving the accuracy and the robustness of the identification under a complex track scene.

The loop instance module is based on a Long Short-Term Memory (LSTM) module, which receives input feature vectors and loops predicting state vectors and kernel (kernel) parameter vectors.

In the present application, regarding the cyclic instance module technology, the present application can be referred to in the prior art 2：Liu, Lizhe et al. "CondLaneNet: a Top-to-down Lane Detection Framework Based on Conditional Convolution." 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 3753-3762. by way of reference, and the above technical contents are incorporated in their entirety into the present application.

For each starting point predicted in the anchor line generator, a RIM is added. In the reasoning stage, the RIM repeatedly predicts the kernel parameter vector corresponding to the track bound with the same starting point until the state indication stops, so that a plurality of dynamic convolution kernel parameters are generated for each anchor line starting point through the starting point and the feature images around the starting point.

The dynamic convolution kernel parameters may be regarded as sub-proposals of a starting point, with each sub-proposal corresponding to a candidate instance. Thus, each starting point may guide shape prediction for multiple track instances.

Illustratively, the cross entropy loss can be achievedConstraint is carried out on the state output result of the RIM:

;

Where s _i is the output of the softmax operation for the i-th state, y _i is the live tag for the i-th state, and N _s is the total number of state outputs in the batch.

In the track detection head, the anchor line generated by the anchor line generator is used as a reference, and finally, the track is accurately positioned.

First, in the RoI feature extraction module, a region of interest (Region of Interest, roI) feature about the anchor line is extracted from the third feature map P3. For a given generated anchor lineN _s points are sampled for Ji Junyun, then features of each sampled point are calculated using bilinear interpolation, features of all sampled points are stitched into a RoI feature, where (x _start, y_start) is the starting point coordinates,Is an angle.

Then, a RIM module is utilized to generate dynamic convolution kernel parameters, roI features are used as input, conditional convolution operation is carried out on the RoI features according to the dynamic convolution kernel parameters, the generated features are more refined, and the features of different examples can be adapted.

Finally, in the prediction adjustment module, the RoI characteristic processed by the dynamic convolution kernel parameters is used as the input of the full connection layer to generate N _pts horizontal offsets between the track and the anchor lineThe track length is l and the start index is s.

For each fixed vertical axis coordinate y _i, each corresponding predicted horizontal axis coordinateThe method comprises the following steps:

;

wherein, For angle, y _i is the vertical axis coordinate, (x _start, y_start) is the start point coordinate,For the i-th horizontal offset amount,Is the predicted abscissa axis coordinate. And obtaining predicted track information according to the starting point, the starting point offset and the angle information about the anchor line and the horizontal offset between the track and the anchor line.

During the training phase, track detection is lostThe following can be defined:

;

Wherein alpha, beta, gamma and eta are hyper-parameters, Is the cross-entropy loss described above,Is the loss of the angle prediction,Is the start point offset prediction loss,Is the starting point prediction loss.

Fig. 4 is a schematic diagram of a multitasking network with respect to a railway target detection task architecture. The core of the object detection task is to identify the object in the image while precisely determining its position.

The invention adopts a target detection scheme based on a multi-scale anchor frame mechanism. In order to extract the features of different scales in the image, feature images of different sizes of the image are acquired through sampling, and feature image sets with different sizes form an image pyramid.

The direct use of multiple scale feature maps to predict target locations can result in large computational effort and large memory usage. In this regard, the invention adopts the FPN structure, and by extracting the features from bottom to top and fusing the features of different layers from top to bottom, the invention not only avoids larger calculation amount, but also fully maintains the original image features.

However, the FPN structure only passes semantic information from top to bottom, and does not contain positioning information. To this end, the invention employs another feature pyramid to exclusively convey positioning information, namely a path aggregation network (Path Aggregation Network, PAN) structure. The PAN structure conveys multi-scale positioning information from bottom to top. The FPN transmits semantic features from top to bottom, and the PAN transmits positioning features from bottom to top, so that a better feature fusion effect can be obtained by combining the two features.

Illustratively, according to the feature pyramid network, the next-layer features or the lower-layer features are downsampled and fused with the previous-layer features or the higher-layer features to generate a path aggregation network. The path aggregation network comprises feature diagrams N5, N4 and N3, wherein the feature diagram N5 is positioned at the top layer, and the feature diagram N3 is positioned at the bottom layer. In other words, high-level features are generated from low-level features, which are a relative concept.

Each grid of the multi-scale fusion feature map in the PAN will then be assigned three a priori anchor boxes of different aspect ratios, the target detection heads will predict the positional offset of the target, the scaling of the height and width of the a priori anchor boxes, and the probability that the railway target belongs to each category.

The invention provides an anchor frame matching criterion, which comprises the steps of firstly determining the grid position of a true value target center point, and then respectively calculating the aspect ratios of a plurality of anchor frames with different sizes and a true value target boundary frame generated corresponding to the grid. If the ratio is smaller than the ratio threshold, the anchor frame is judged to be matched with the true value target, the positive sample is marked, and otherwise, the negative sample is marked.

According to the anchor frame matching criterion, after obtaining the predicted value of network output, the target detection lossThe following can be calculated:

;

Wherein the method comprises the steps of For the bounding box regression loss,Is a loss of confidence in the target and,Is a classification loss.

Finally, the total loss of the multi-task network for simultaneously completing the railway target detection task and the track detection taskIs the track detection lossLoss of detection with targetIs a weighted sum of:

;

wherein, AndThe track detection task weight and the target detection task weight are two super parameters; It is the loss of track detection that, Is the target detection loss.

Training the multi-task network by using a back propagation algorithm and the total loss of the multi-task network, obtaining network parameters which enable the total loss of the multi-task network to be minimum or nearly minimum, and taking the network parameters as parameters of multi-task network reasoning.

Numerous specific details are set forth in the above description in order to provide a better illustration of the invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present invention.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A visual information processing method in a railway shunting operation scene is characterized in that:

inputting a picture obtained by acquiring visual information into an encoder, wherein the encoder comprises a convolutional neural network and a characteristic pyramid network, the convolutional neural network comprises a plurality of segments, and the characteristic pyramid network comprising a first characteristic map, a second characteristic map and a third characteristic map is constructed by extracting a characteristic map of the last layer in a plurality of segments in the convolutional neural network;

Inputting the second feature map into the first prediction branch, the second prediction branch, the third prediction branch and the fourth prediction branch in sequence, and respectively predicting a starting point, a starting point offset, an angle and dynamic kernel parameters related to an anchor line;

generating a plurality of dynamic convolution kernel parameters for each predicted starting point by a cyclic instance module;

extracting a region of interest feature with respect to the anchor line from the third feature map, performing a conditional convolution operation on the region of interest feature using the plurality of dynamic convolution kernel parameters, and obtaining a horizontal offset between the track and the anchor line as an input to the full link layer, and obtaining predicted track information according to a start point, a start point offset, and an angle with respect to the anchor line, and the horizontal offset between the track and the anchor line, and

According to the feature pyramid network, downsampling the next layer of features, and fusing the next layer of features with the previous layer of features to generate a path aggregation network;

And allocating a plurality of prior anchor frames with different length-width ratios for each grid in the path aggregation network, and predicting the position offset of the railway target, the height and width scaling of the prior anchor frames and the probability that the railway target belongs to each category through the target detection head.

2. The visual information processing method in a railway shunting operation scene according to claim 1, characterized in that:

The first, second and third feature maps increase in size in sequence.

3. The visual information processing method in a railway shunting operation scene according to claim 1 or 2, characterized in that:

The method comprises the steps of obtaining a first characteristic diagram in a characteristic pyramid network by extracting a characteristic diagram of a last layer in a last first section of a convolutional neural network and performing multi-layer maximum pooling operation of a spatial pyramid pooling module;

Extracting a characteristic image of the last layer in the penultimate section of the convolutional neural network, adding the first characteristic image after 1×1 convolution, and obtaining a second characteristic image in the characteristic pyramid network;

And extracting a characteristic map of the last layer in the third last section in the convolutional neural network, and adding the up-sampled second characteristic map after 1×1 convolution to obtain a third characteristic map in the characteristic pyramid network.

4. A visual information processing method in a railway shunting operation scene according to claim 3, characterized in that:

and respectively generating a starting point heat map, a starting point offset map, an angle map and a parameter map when predicting the starting point, the starting point offset, the angle and the dynamic kernel parameters of the anchor line.

5. The visual information processing method in a railway shunting operation scene according to claim 4, characterized in that:

a plurality of dynamic convolution kernel parameters are generated for each anchor line start point by a loop instance module.

6. The visual information processing method in a railway shunting operation scene according to claim 5, characterized in that:

In the training stage, the weighted track detection loss and the railway target detection loss are taken as the total loss of the multi-task network for simultaneously processing the track detection task and the railway target detection task.

7. The visual information processing method in a railway shunting operation scene according to claim 1 or 6, characterized in that:

within each segment in the convolutional neural network, the size of the feature map is not changed.

8. The visual information processing method in a railway shunting operation scene according to claim 7, characterized in that:

The railway target is to leave a car or/and a car coupler.

9. The visual information processing method in a railway shunting operation scene according to claim 4, characterized in that:

The size of the initial point heat map, the size of the initial point offset map and the size of the angle map are the same as those of the second characteristic map.

10. The visual information processing method in a railway shunting operation scene according to claim 6, characterized in that:

During the training phase, the track detection penalty includes an angle prediction penalty, a start point offset prediction penalty, a start point prediction penalty, and a cross entropy penalty, where the cross entropy penalty is a constraint on the state output results of the cycle instance module.