CN119205719A

CN119205719A - Intelligent machine vision detection method, system and storage medium based on image processing

Info

Publication number: CN119205719A
Application number: CN202411500428.2A
Authority: CN
Inventors: 桂洪洋; 桑万里; 雷建岭
Original assignee: Henan Mingshi Security Prevention Engineering Co ltd
Current assignee: Henan Mingshi Security Prevention Engineering Co ltd
Priority date: 2024-10-25
Filing date: 2024-10-25
Publication date: 2024-12-27

Abstract

The application relates to the technical field of image processing, and discloses an intelligent machine vision detection method, system and storage medium based on image processing. The method comprises the steps of carrying out candidate region generation and geometric constraint optimization processing on fusion features through a cascading bidirectional region suggestion network to obtain target candidate regions, carrying out instance segmentation processing on the target candidate regions through a graph convolution network and a boundary dynamic adjustment mechanism to obtain segmentation results, carrying out feature extraction processing on the segmentation results through a space-time joint attention network to obtain space-time context features, carrying out multi-target tracking processing on the space-time context features through depth association measurement and motion pattern matching to obtain track prediction results, carrying out abnormal pattern modeling processing on the track prediction results through hierarchical comparison learning and multi-mode feature fusion to obtain abnormal features, and generating a visual detection report according to the abnormal features. The application improves the accuracy of intelligent machine vision detection based on image processing.

Description

Intelligent machine vision detection method, system and storage medium based on image processing

Technical Field

The present application relates to the field of image processing, and in particular, to an intelligent machine vision detection method, system and storage medium based on image processing.

Background

In industrial production, the intelligent machine vision detection technology is widely applied to the fields of product quality detection, industrial robot control, production process monitoring and the like. The existing machine vision detection method mainly comprises a detection method based on traditional image processing and a detection method based on deep learning. The traditional image processing method realizes target detection and defect identification through steps such as image preprocessing, feature extraction, pattern identification and the like, but has poor adaptability to complex scenes and dynamic changes. Although the detection method based on deep learning has strong feature learning capability, a large amount of annotation data is often required for training, and the requirements on real-time performance and computing resources are high. In practical industrial application, a plurality of links such as target tracking, anomaly detection and risk assessment are required to be considered at the same time, and the existing method often carries out the task splitting treatment, so that the systematicness and the continuity are lacked.

The prior art has the following defects that firstly, in an image preprocessing stage, a single processing method is difficult to adapt to various interference factors in an industrial scene, such as illumination change, noise, shielding and the like, secondly, multi-scale and multi-mode information is not fully utilized in a feature extraction process, so that feature expression capability is insufficient, thirdly, modeling of space-time context information is lacking in a target detection and tracking process, the understanding capability of a system on the dynamic scene is influenced, and finally, a hierarchical analysis framework is lacking in an abnormality detection and risk assessment link, so that accurate abnormality positioning and risk quantification are difficult to realize.

Disclosure of Invention

The application provides an intelligent machine vision detection method, an intelligent machine vision detection system and a storage medium based on image processing, which are used for improving the accuracy of intelligent machine vision detection based on image processing.

The application provides an intelligent machine vision detection method based on image processing, which comprises the steps of carrying out multi-scale decomposition and self-adaptive threshold segmentation on an original image to obtain a segmented image, carrying out multi-layer cascade filtering and nonlinear weighted fusion on the segmented image to obtain a preprocessed image, carrying out complementary feature extraction processing on the preprocessed image through a parallel multi-branch heterogeneous neural network to obtain multi-dimensional features, carrying out feature fusion processing on the multi-dimensional features through a hierarchical self-adaptive attention mechanism and cross-scale feature recalibration to obtain fusion features, carrying out candidate region generation and geometric constraint optimization processing on the fusion features through a cascade bidirectional region suggestion network to obtain a target candidate region, carrying out instance segmentation processing on the target candidate region through a graph convolution network and a boundary dynamic adjustment mechanism to obtain a segmentation result, carrying out feature extraction processing on the segmentation result through a space-time joint attention network to obtain space-time context features, carrying out multi-object tracking processing on the space-time context features through depth association measurement and motion pattern matching to obtain a track prediction result, carrying out abnormal feature comparison and carrying out modeling and abnormal feature detection and carrying out abnormal feature comparison and carrying out modeling and abnormal feature prediction report processing according to the obtained track prediction result.

In a second aspect, the present application provides an intelligent machine vision inspection device based on image processing, the intelligent machine vision inspection device based on image processing includes:

The segmentation module is used for carrying out multi-scale decomposition and self-adaptive threshold segmentation processing on the original image to obtain a segmented image, and carrying out multi-layer cascade filtering and nonlinear weighted fusion processing on the segmented image to obtain a preprocessed image;

The fusion module is used for carrying out complementary feature extraction processing on the preprocessed image through a parallel multi-branch heterogeneous neural network to obtain multi-dimensional features, and carrying out feature fusion processing on the multi-dimensional features through a hierarchical self-adaptive attention mechanism and cross-scale feature recalibration to obtain fusion features;

The optimization module is used for carrying out candidate region generation and geometric constraint optimization processing on the fusion features through a cascading bidirectional region suggestion network to obtain a target candidate region, and carrying out instance segmentation processing on the target candidate region through a graph convolution network and a boundary dynamic adjustment mechanism to obtain a segmentation result;

the extraction module is used for carrying out feature extraction processing on the segmentation result through a space-time joint attention network to obtain space-time context features;

the tracking module is used for carrying out multi-target tracking processing on the space-time context characteristics through depth association measurement and motion pattern matching to obtain a track prediction result;

The generation module is used for carrying out abnormal mode modeling processing on the track prediction result through hierarchical comparison learning and multi-mode feature fusion to obtain abnormal features, and generating a visual detection report according to the abnormal features.

A third aspect of the present application provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the above-described intelligent machine vision detection method based on image processing.

In the technical scheme provided by the application, the multi-scale decomposition and self-adaptive threshold segmentation combined processing mode is adopted, the multi-layer cascade filtering and nonlinear weighted fusion are combined, the robustness and adaptability of image preprocessing can be effectively improved, the high-quality preprocessed image is ensured to be obtained in a complex industrial environment, the advantages of different network structures are fully utilized by the design of the parallel multi-branch heterogeneous neural network, the complementary extraction and effective fusion of multi-dimensional characteristics are realized by matching with a hierarchical self-adaptive attention mechanism and cross-scale characteristic recalibration, the characteristic expression capability is obviously enhanced, the accuracy of target detection is improved by the introduction of a cascading bidirectional region suggestion network through candidate region generation and geometric constraint optimization, the accuracy of example segmentation is further optimized by the combination of a graph convolution network and a boundary dynamic adjustment mechanism, meanwhile, the application of the space-time joint attention network realizes the effective modeling of time sequence information, enhances the space-time consistency of feature extraction, not only improves the accuracy of target tracking through a multi-target tracking strategy of depth association measurement and motion pattern matching, but also effectively processes multi-target interaction problems in complex scenes, finally, the comprehensive anomaly analysis from low-level physical features to high-level semantic features is realized through the anomaly detection framework of hierarchical contrast learning and multi-mode feature fusion, visual and interpretable detection results are provided through a structured visual detection report, a complete intelligent machine visual detection flow is constructed through the organic combination of a plurality of innovative technical features by the whole scheme, the real-time performance and reliability of the system are ensured while the detection precision is improved, is particularly suitable for the practical application requirements of industrial sites.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained based on these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an embodiment of an intelligent machine vision detection method based on image processing in an embodiment of the present application;

Fig. 2 is a schematic diagram of an embodiment of an intelligent machine vision inspection device based on image processing in an embodiment of the present application.

Detailed Description

The embodiment of the application provides an intelligent machine vision detection method, an intelligent machine vision detection system and a storage medium based on image processing. The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, a specific flow of an embodiment of the present application is described below with reference to fig. 1, and an embodiment of an intelligent machine vision detection method based on image processing in an embodiment of the present application includes:

s101, performing multi-scale decomposition and self-adaptive threshold segmentation processing on an original image to obtain a segmented image, and performing multi-layer cascade filtering and nonlinear weighted fusion processing on the segmented image to obtain a preprocessed image;

step S102, carrying out complementary feature extraction processing on the preprocessed image through a parallel multi-branch heterogeneous neural network to obtain multi-dimensional features, and carrying out feature fusion processing on the multi-dimensional features through a hierarchical self-adaptive attention mechanism and cross-scale feature recalibration to obtain fusion features;

Step S103, generating candidate regions and optimizing geometric constraints of the fusion features through a cascading bidirectional region suggestion network to obtain target candidate regions, and carrying out instance segmentation on the target candidate regions through a graph convolution network and a boundary dynamic adjustment mechanism to obtain segmentation results;

step S104, carrying out feature extraction processing on the segmentation result through a space-time joint attention network to obtain space-time context features;

step S105, performing multi-target tracking processing on the space-time context characteristics through depth association measurement and motion pattern matching to obtain a track prediction result;

And S106, carrying out abnormal mode modeling processing on the track prediction result through hierarchical comparison learning and multi-mode feature fusion to obtain abnormal features, and generating a visual detection report according to the abnormal features.

It is to be understood that the execution subject of the present application may be an intelligent machine vision detection device based on image processing, and may also be a terminal or a server, which is not limited herein. The embodiment of the application is described by taking a server as an execution main body as an example.

Specifically, an intelligent machine vision detection method based on image processing firstly carries out multi-scale decomposition and self-adaptive threshold segmentation processing on an input original image. The multi-scale decomposition process adopts a pyramid structure, and scaling is carried out on an original image by setting scale factors of 0.5, 0.75 and 1.0 to generate three sub-images with different resolutions. For each sub-image, calculating an entropy distribution map of the local area, setting the window size to 7×7 pixels, and counting the gray distribution probability of each position through a sliding window to obtain the entropy map. Based on the entropy diagram, adopting an iterative minimum cross entropy algorithm to perform self-adaptive threshold selection, setting an initial threshold as an image average gray value, and stopping after multiple iterations until the threshold change is smaller than a preset threshold value of 0.5. And then carrying out multi-layer cascade filtering on the segmented image, wherein the multi-layer cascade filtering comprises three layers of wavelet decomposition and double-sided filtering, the wavelet decomposition adopts db4 wavelet basis to respectively obtain a low-frequency approximation coefficient and a high-frequency detail coefficient, and the double-sided filter simultaneously considers the difference of the spatial distance and the gray value of the pixel to realize edge-preserving denoising. In the characteristic extraction stage, the parallel multi-branch heterogeneous neural network comprises three branches, wherein a high-resolution branch adopts a VGG-16 structure and comprises 13 convolution layers and 3 full connection layers, the input size is 224 multiplied by 224 pixels, a middle-resolution branch adopts a ResNet-50 structure and comprises 49 convolution layers and 1 full connection layer, the input size is 112 multiplied by 112 pixels, a low-resolution branch adopts a MobileNetV structure and adopts depth separable convolution, and the input size is 56 multiplied by 56 pixels. The three branches realize feature interaction through cross-branch connection, and a feature exchange channel is arranged in the middle layer of each branch. The hierarchical self-adaptive attention mechanism comprises two modules of channel attention and spatial attention, wherein the channel attention generates a channel descriptor through global average pooling and global maximum pooling, and the spatial attention generates a spatial weight map to guide a network to pay attention to an important area. In the target detection and segmentation link, the cascading bidirectional region proposal network firstly generates a feature map through a forward feature extraction unit, wherein the feature map comprises 5 feature layers with different scales, and each layer is provided with 3 anchor frames with different sizes. The reverse feature extraction unit enhances the forward features, and semantic information of low-level features is improved through top-down feature transfer. In the geometric constraint optimization process, a threshold limit between 0.5 and 2.0 is set for the aspect ratio of the candidate frame, and the position constraint requires that the center point of the candidate frame is located in the effective area of the image. The graph rolling network models a target area as a graph structure, nodes represent super-pixel blocks in the area, edges represent spatial relations among the super-pixel blocks, and feature propagation is carried out through 3-layer graph rolling.

The spatio-temporal joint attention network uses a 15 frame timing window to feature align the current frame with the historical frames by a cross-space encoder. The multi-head collaborative attention module sets 8 attention heads, each of which learns a different feature representation independently. The spatio-temporal pyramid pooling performs feature extraction on 4 spatial scales and 3 temporal scales, and the dual-stream attention fusion device processes the features of the spatial stream and the temporal stream respectively. In the multi-target tracking stage, the depth association metric firstly calculates a cosine similarity matrix of target features, and sets a similarity threshold value of 0.75 to carry out target association. The motion pattern matching process comprises four basic motion patterns, namely a case that the speed change rate corresponding to the acceleration motion pattern is more than 20%, a case that the speed change rate corresponding to the uniform motion pattern is less than 5%, a case that the direction change corresponding to the steering motion pattern is more than 30 degrees, and a case that the speed corresponding to the stagnation motion pattern is close to 0. The compound motion pattern includes periodic motion patterns (e.g., device reciprocation), abrupt motion patterns (e.g., scram), and alternating motion patterns (e.g., multi-object crossover).

And in the final abnormality detection stage, the hierarchical comparison learning extracts features at three levels respectively, wherein the low-level motion features comprise physical quantities such as position, speed, acceleration and the like, the middle-level behavior features comprise action sequences and track features, and the high-level semantic features comprise behavior intentions and scene semantics. The multi-modal feature fusion combines a temporal modality (time series data with a sampling frequency of 30 Hz), a spatial modality (spatial distribution information of a target) and a behavioral modality (action semantic information). Visual inspection reports of anomaly characteristics contain six key components, anomaly type (e.g., trajectory anomalies, speed anomalies, etc.), anomaly location (image coordinates), anomaly time (time stamp), anomaly class (class 1-5), risk assessment (low, medium, high risk), and treatment recommendations.

For example, a robot arm on an industrial production line performs a grabbing task, and the resolution of an input image is 2560×1440 pixels. Three-scale images of 1280×720, 1920×1080 and 2560×1440 are obtained through multi-scale decomposition. The local entropy calculation uses a 7 x 7 window to obtain an initial threshold 138, which converges to a final threshold 145 over 6 iterations. The signal-to-noise ratio after multi-layer cascaded filtering is increased from the original 18.3dB to 26.7dB. The feature extraction stage VGG branches off 2048-dimensional features, resNet branches off 1024-dimensional features, mobileNet branches off 512-dimensional features. The target detection generates 312 initial candidate areas, and 94 effective candidate areas are reserved through geometric constraint optimization. The space-time characteristic extraction uses 15 frame time sequence windows, and 3 basic motion modes (grabbing, moving and placing) and 2 compound motion modes (cyclic operation and obstacle avoidance adjustment) of the mechanical arm are identified in total. The system detects abnormal behavior, namely, when the time stamp T= 1235.6s, the mechanical arm has speed mutation at the coordinate (1280,720), the abnormal grade is grade 4, and the system generates a detection report containing detailed analysis and processing suggestions.

In the embodiment of the application, the robustness and adaptability of image preprocessing can be effectively improved by combining a multi-scale decomposition and self-adaptive threshold segmentation combined processing mode and combining multi-layer cascade filtering and nonlinear weighted fusion, the high-quality preprocessed image is ensured to be obtained under a complex industrial environment, the advantages of different network structures are fully utilized by the design of the parallel multi-branch heterogeneous neural network, the complementary extraction and effective fusion of multi-dimensional characteristics are realized by matching with a hierarchical self-adaptive attention mechanism and cross-scale characteristic recalibration, the expression capability of the characteristics is obviously enhanced, the accuracy of object detection is further improved by combining a candidate region generation and geometric constraint optimization by introducing a cascade bidirectional region suggestion network, the accuracy of instance segmentation is further optimized by combining a graph convolution network and a boundary dynamic adjustment mechanism, the space-time joint attention network is effectively modeled, the space-time consistency of the characteristic extraction is enhanced, the accuracy of object tracking is also effectively processed by matching with a multi-target tracking strategy of depth correlation measurement and a motion mode, finally the multi-target interaction problem under the complex scene is realized, the full-level visual characteristic analysis is realized by combining with a full-scale visual characteristic analysis scheme, the full-scale visual characteristic detection can be realized by combining with the full-scale visual characteristic detection scheme, and the visual characteristic detection can be realized by combining the full-scale visual analysis system, and the real-time visual characteristic detection can be provided by combining the requirements of an actual visual detection system, and the full-scale visual detection can be provided by an actual visual detection system, and the full-visual detection can be realized by combining the full-scale visual detection, and the visual detection system.

In a specific embodiment, the process of executing step S101 may specifically include the following steps:

(1) Pyramid decomposition processing is carried out on the original image according to scale proportions of 0.5 times, 0.75 times and 1 time to obtain three sub-images with different scales, and partial entropy calculation processing is carried out on the three sub-images with different scales to obtain an initial threshold value diagram;

(2) Performing threshold dynamic updating processing on the initial threshold map through an iterative minimum cross entropy algorithm to obtain an adaptive threshold map, and performing region segmentation processing on the adaptive threshold map to obtain a region segmentation image;

(3) Three layers of wavelet decomposition processing is carried out on the region division image to obtain low-frequency and high-frequency component images, and noise suppression processing is carried out on the low-frequency and high-frequency component images through bilateral filtering to obtain a noise reduction image;

(4) Performing edge detection processing on the noise reduction image through a Sobel operator to obtain an edge feature image, and performing edge connection processing on the edge feature image through morphological operation to obtain an edge enhancement image;

(5) Performing feature fusion processing on the edge enhanced image and the noise reduction image to obtain a feature fusion image, and performing weight distribution processing on the feature fusion image through a Gaussian weight function to obtain a weight coefficient diagram;

(6) And carrying out weighted summation treatment on the weight coefficient graph and the characteristic fusion image to obtain a weighted fusion image, and carrying out gray level distribution adjustment treatment on the weighted fusion image through histogram equalization to obtain a preprocessed image.

Specifically, in the preprocessing stage of the intelligent machine vision detection method based on image processing, a pyramid decomposition technology is adopted to perform multi-scale processing on an original image. Pyramid decomposition is a multi-resolution analysis method that generates image sequences of different resolution levels by continuously downsampling the original image. The original image is taken as the bottom layer (1 time) of the pyramid, and sub-images of 0.75 times and 0.5 times are generated upwards in sequence. In the specific operation, the original image is smoothed by adopting a Gaussian blur kernel, then downsampling is carried out, and the downsampling process is realized by sampling at intervals, namely, pixel points are selected at regular intervals in the horizontal and vertical directions. For an original image with the resolution of 1920×1080, obtaining sub-images with smaller dimensions of 1440×810 and 960×540 after pyramid decomposition. The local entropy calculation is an important index for evaluating the information quantity of the local area of the image, a sliding window with the size of 7 multiplied by 7 is set for each sub-image, the probability distribution of the gray values of pixels in the window is calculated, and then the entropy of the area is obtained. The sliding window starts from the upper left corner of the image, moves on the whole image by taking 1 pixel as a step length, calculates an entropy value at each position, and finally obtains an entropy value graph with the same size as the original image. The entropy calculation considers the gray level distribution of the pixel, and the larger the entropy is, the more scattered the gray level distribution of the region is, and the larger the information amount is.

The iterative minimum cross entropy algorithm is a self-adaptive threshold selection method, and the cross entropy of two categories of the foreground and the background of the image is minimized by continuously adjusting the threshold. The algorithm firstly selects an initial threshold value according to the gray level histogram of the image, updates the threshold value to be the average value of the foreground region and the background region under the current threshold value by calculating the average gray level value of the foreground region and the background region, and repeats the process until the change of the threshold value is smaller than a preset value (such as 0.5) or the maximum iteration number is reached. After the final threshold value is obtained, the pixels with gray values larger than the threshold value in the image are marked as foreground, and the pixels smaller than the threshold value are marked as background, so that region segmentation is completed. Three-layer wavelet decomposition uses a Haar wavelet basis to decompose the image into a low-frequency approximation component (LL) and three-directional high-frequency detail components (LH, HL, HH). The low frequency component contains the main energy and contour information of the image and the high frequency component contains edge and texture details. Wavelet decomposition divides the image into four subbands at each layer, and 10 subbands are obtained after three-layer decomposition. The bilateral filter simultaneously considers the spatial distance weight and the gray value similarity weight of the pixels when noise suppression is carried out, the spatial weight is reduced along with the increase of the distance, and the gray value similarity weight is reduced along with the increase of the gray difference value, so that the edge characteristics are maintained while the noise is removed.

The Sobel operator is a first-order differential operator, and comprises convolution kernels in two directions of horizontal and vertical, and the gradient in the horizontal direction and the gradient in the vertical direction of an image are detected respectively. When the image is subjected to edge detection, gradient values in two directions are calculated respectively, and then gradient amplitude values are calculated to be used as edge intensity. Morphological operations include two basic operations, expansion and erosion, where the image is processed by designing structural elements, expansion is used to expand edges, fill edge gaps, and erosion is used to refine edges, removing noise points. And the feature fusion fuses the edge enhanced image and the noise reduction image, and the weighting is determined by a Gaussian weighting function in a weighted average mode. The Gaussian weight function calculates weight coefficients based on the spatial position and gray value of the pixel, and has the characteristic of smooth transition. And during weighted fusion, multiplying the pixel value of the corresponding position by the corresponding weight coefficient, and then summing to obtain a fused image. And finally, adjusting the gray distribution of the image through histogram equalization, and improving the contrast and detail expression of the image.

For example, in the surface defect detection scene of an industrial product, an original image with the resolution of 2048×1536 is input, and sub-images with smaller dimensions of 1536×1152 and 1024×768 are obtained through pyramid decomposition. The local entropy calculation adopts a 7×7 window, and the distribution range of the obtained entropy is between 0 and 5.7. The initial threshold of the iterative minimum cross entropy algorithm is set to 127, and after 8 iterations of convergence to the final threshold 143, the area of the segmented target region accounts for 23.5% of the whole image. After three layers of wavelet decomposition, the energy ratio of the low frequency component is 82.3%, and the energy ratio of the high frequency component in three directions is 7.2%, 6.8% and 3.7%, respectively. The spatial domain standard deviation of bilateral filtering is set to 3, and the value domain standard deviation is set to 25. The edge intensity value obtained by the Sobel operator detection is distributed between 0 and 255, and 3X 3 cross-shaped structural elements are used through morphological treatment. The weight of the edge image at the fusion stage is 0.3, the weight of the noise reduction image is 0.7, and the final histogram equalization remaps the image gray value to the whole dynamic range of 0 to 255.

In a specific embodiment, the process of executing step S102 may specifically include the following steps:

(1) Respectively constructing a high-resolution branch comprising a VGG structure, a middle-resolution branch comprising a ResNet structure and a low-resolution branch comprising a MobileNet structure for the preprocessed image, performing feature extraction processing to obtain three groups of original feature images, and performing feature interaction processing to the three groups of original feature images through cross-branch connection to obtain complementary feature images;

(2) Carrying out channel weight calculation processing on the complementary feature map through a channel attention unit to obtain channel weighted features, and carrying out space weight distribution processing on the channel weighted features through a space attention unit to obtain multidimensional weighted features;

(3) Performing cross-layer feature stitching processing on the multi-dimensional weighted features through residual connection to obtain a multi-scale feature map, and performing feature resampling processing on the multi-scale feature map through pyramid pooling to obtain resampled features;

(4) Carrying out feature importance evaluation processing on resampled features through a self-adaptive feature selector to obtain a feature importance map, and carrying out feature screening processing on the feature importance map through a soft gating mechanism to obtain screened features;

(5) And performing dimension reduction transformation processing on the screening features through the multi-layer perceptrons to obtain dimension reduction features, and performing normalization processing on the dimension reduction features through feature standardization to obtain fusion features.

Specifically, in the characteristic extraction stage of the parallel multi-branch heterogeneous neural network, three deep neural network branches with different structures are adopted to process the preprocessed image. The high-resolution branch adopts a VGG structure and comprises 5 convolution blocks, each convolution block comprises 2-3 multiplied by 3 convolution layers and a maximum pooling layer, the input image size is 224 multiplied by 224 pixels, and the high-level semantic features are extracted through layer-by-layer convolution and pooling operations. The medium resolution branch adopts ResNet structure, introduces residual connection mechanism to avoid gradient vanishing problem of deep network, and comprises 4 residual blocks, each residual block comprises 3 convolution layers, and input size is 112×112 pixels. The low resolution branches adopt MobileNet structures, the calculated amount is reduced by using depth separable convolution, and the input size is 56×56 pixels. The cross-branch connection realizes the feature multiplexing through the inter-channel interaction of the feature map, and a connection channel is established between the middle layers of each branch. Specifically, after the number of channels is adjusted through 1×1 convolution, the third convolution block output feature map of the high-resolution branch is added with the second residual block output feature map of the middle-resolution branch, and the middle-resolution branch and the low-resolution branch are connected in the same way, so that feature complementation is formed.

The channel attention unit firstly generates two channel descriptors through global average pooling and global maximum pooling respectively, and sends the two descriptors into a shared multi-layer perceptron to obtain channel weights. The spatial attention unit then directs the network to focus on important areas in the image by generating an attention map of spatial dimensions. The channel attention and spatial attention results are multiplied to obtain the final attention weight. Residual connection is used for cross-layer feature stitching, shallow low-level features are directly added to deep high-level features, and feature expression is enriched. Pyramid pooling performs multi-scale pooling operation on the feature map, wherein the pooling operation comprises four scales of 1×1, 2×2, 3×3 and 6×6, and pooling results of each scale are spliced after channel number adjustment, so that fusion of multi-scale features is realized.

The adaptive feature selector assigns a weight to each channel by calculating an importance score for the feature channel. In the specific implementation, statistics of spatial dimensions, including mean, variance and the like, are calculated for the feature map of each channel, and importance scores are obtained through full-connection layer mapping. The soft gating mechanism screens the characteristic channels according to the importance scores, the channels with scores higher than the threshold remain the original values, and the channels with scores lower than the threshold attenuate. The multi-layer perceptron is used for feature dimension reduction and comprises two hidden layers, wherein the first layer uses a ReLU activation function, and the second layer uses a linear activation function to map high-dimension features to specified dimensions. The feature normalization adopts batch normalization operation, and the feature distribution is adjusted to be standard distribution with the mean value of 0 and the variance of 1.

For example, in the quality detection of industrial products, the preprocessed input image is subjected to three branch processing, a 224×224 pixel image is input to a high-resolution branch VGG network, a 14×14×512 feature map is obtained after 5 convolution blocks, a 112×112 pixel image is input to a medium-resolution branch ResNet network, a 7×7×1024 feature map is obtained, a 56×56 pixel image is input to a low-resolution branch MobileNet, and a 4×4×256 feature map is obtained. After the cross-branch connection, the feature dimensions are adjusted to 512, 768, and 384 channels, respectively. The channel attention calculation yields weights ranging from 0.2 to 1.8, the spatial attention generating 28 x 28 attention map. Pyramid pooling performs feature resampling on four scales to obtain feature graphs of 1×1×2048, 2×2×1024, 3×3×512 and 6×6×256, and the feature vectors are spliced to obtain 3840-dimensional feature vectors. The feature selection threshold was set at 0.5, leaving about 70% of the feature channels after screening. Finally, the dimension of the features is reduced to 1024 dimensions through two layers of perceptrons, and the final fusion features are obtained through standardization.

In a specific embodiment, the process of executing step S103 may specifically include the following steps:

(1) Performing feature mapping processing on the fusion features through a forward feature extraction unit to obtain a forward feature map, and performing feature enhancement processing on the forward feature map through a reverse feature extraction unit to obtain cascading bidirectional features;

(2) Performing initial region generation processing on the cascade bidirectional features through a region suggestion generator to obtain initial candidate regions, and performing shape feature extraction processing on the initial candidate regions through a geometric feature extractor to obtain geometric features;

(3) Performing geometric constraint processing on the geometric features through a constraint rule base to obtain constraint candidate frames, and performing boundary optimization processing on the constraint candidate frames through a region optimizer to obtain target candidate regions;

(4) Node feature extraction processing is carried out on the target candidate region through a graph node constructor to obtain graph node features, and connection relation construction processing is carried out on the graph node features through a graph edge generator to obtain graph structure features;

(5) Carrying out feature propagation processing on the graph structural features through a graph convolution layer to obtain graph convolution features, and carrying out boundary position prediction processing on the graph convolution features through a boundary dynamic predictor to obtain a boundary prediction graph;

(6) And carrying out boundary fine adjustment processing on the boundary prediction graph through a dynamic adjustment unit to obtain an adjustment boundary, and carrying out region division processing on the adjustment boundary through an example divider to obtain a division result.

Specifically, in the cascading bidirectional region suggestion network, the forward feature extraction unit firstly performs feature mapping on the fusion features, and adopts 5 feature layers with different scales for processing. Each feature layer contains a combination of 3 x 3 and 1 x 1 convolution layers, capturing local feature and global context information, respectively. The reverse feature extraction unit enhances feature expression through a top-down feature transfer path, and transfers high-level semantic information into a low-level feature map layer by layer to realize feature enhancement. The region proposal generator generates candidate regions based on the multi-scale feature map, sets 3 anchor frames with different sizes and 3 different length-width ratios on each feature layer, and performs positive and negative sample division according to a preset IoU threshold (set to 0.7). The geometric feature extractor calculates shape-related features, including geometric attributes such as area, perimeter, aspect ratio, compactness, etc., for the initial candidate region, forming feature vectors describing the candidate region shape features.

The constraint rule base defines a series of geometric constraints including the area range of the candidate frame (0.01% to 90% of the image area), the aspect ratio range (0.5 to 2.0), the boundary position constraint (the center point of the frame needs to be within the effective area of the image), etc. The region optimizer adjusts the candidate frames according to the constraint conditions, removes the candidate frames which do not meet the constraint, and fine-adjusts the positions and the sizes of the remaining candidate frames. The graph node constructor divides the target candidate region into super pixel blocks, each super pixel block is used as a node of the graph, and the characteristics of the color, texture, position and the like of the node are extracted as node characteristics. The graph edge generator constructs a connection relation based on the spatial adjacency relation and the feature similarity between the nodes, the spatial adjacency adopts 8 neighborhood connection, the feature similarity is measured through cosine distance, and the similarity threshold is set to be 0.8.

The boundary position prediction uses the following formula:

P(x,y)=σ(F_gcn(x,y)·W_b+b_b)

Where P (x, y) represents the probability of existence of a boundary at position (x, y), F _gcn is the graph convolution feature, W _b and b _b are the prediction layer parameters, σ is the sigmoid activation function.

B(x,y)=α·P(x,y)+(1-α)·E(x,y)

Wherein B (x, y) is the final boundary position, E (x, y) is the edge feature map, and α is the balance factor.

The dynamic adjustment unit refines the boundary positions according to the boundary probability distribution of the local area, and gradually adjusts the boundary positions in an iterative mode. The example divider uses the adjusted boundary as constraint, and adopts a region growing algorithm to divide the target region to obtain a final division result.

For example, in an industrial part inspection scenario, the input feature map size is 56×56×256, and 5 scale feature maps are obtained by forward feature extraction, with sizes of 56×56×256, 28×28×512, 14×14×512, 7×7×256, and 4×4×256, respectively. After the reverse feature enhancement, the feature dimension remains unchanged but the feature expression is richer. The region suggestion generator symbiotes 1596 initial candidate boxes on 5 feature layers, and retains 2048 candidate boxes after non-maximum suppression (IoU threshold 0.7). Geometric feature extraction the geometric features of these candidate boxes were calculated with areas ranging from 100 to 10000 pixels, aspect ratios ranging from 0.5 to 2.0, and compactedness (perimeter square/area) ranging from 12 to 20. 643 valid candidate boxes are reserved after constraint rule screening. The graph structure construction divides each candidate region into 50 super-pixel nodes on average, each node extracts 128-dimensional feature vectors, and the number of connections among the nodes is 180 on average. The three-layer graph convolutional network has output feature dimensions of 128, 256, and 512, respectively. The size of the probability map obtained by boundary prediction is the same as that of the input feature map, and the boundary positioning precision reaches the pixel level after 3 rounds of iterative optimization. The final instance segmentation is completed in each candidate region, resulting in an accurate boundary of the target.

In a specific embodiment, the process of executing step S104 may specifically include the following steps:

(1) Carrying out time sequence framing treatment on the segmentation result through a time dimension slicer to obtain a time sequence feature sequence, and carrying out inter-frame association calculation treatment on the time sequence feature sequence through a dynamic self-attention unit to obtain a time sequence association matrix;

(2) Performing space-time feature coding processing on the time sequence incidence matrix through a cross space-time coder to obtain initial space-time features, and performing feature enhancement processing on the initial space-time features through a multi-head collaborative attention module to obtain enhanced space-time features;

(3) Carrying out multi-scale feature extraction processing on the enhanced space-time features through a space-time pyramid pooling unit to obtain multi-scale space-time features, and carrying out feature integration processing on the multi-scale space-time features through a double-flow attention fusion device to obtain fusion space-time features;

(4) Carrying out time sequence dependent modeling processing on the fused space-time features through a layering space-time memory network to obtain space-time memory features, and carrying out scene association processing on the space-time memory features through a context sensing module to obtain scene context features;

(5) And carrying out feature aggregation treatment on the scene context features through a dense connection aggregator to obtain aggregation features, and carrying out feature optimization treatment on the aggregation features through a space-time feature recalibration unit to obtain space-time context features.

Specifically, in the intelligent machine vision detection method based on image processing, the time sequence feature extraction is performed by processing the segmentation result through a time dimension slicer, setting the time window length to 15 frames, setting the sliding step length to 1 frame, and segmenting the continuous image sequence into a plurality of time sequence fragments. The image features in each time sequence segment are subjected to linear projection to obtain feature vectors with consistent dimensions, and a time sequence feature sequence is formed. The dynamic self-attention unit adopts the following calculation mode:

Wherein Q, K, V represents a query matrix, a key matrix, and a value matrix, respectively, d _k is a feature dimension, and a softmax function is used to normalize attention weights. And obtaining an inter-frame association intensity matrix through the calculation. The cross space-time encoder performs space-time joint modeling on the features in the time sequence incidence matrix, and comprises a multi-layer encoder structure. Each layer of encoder comprises a self-attention layer and a feedforward network layer, so that interactive fusion of time sequence information and space information is realized. The multi-head collaborative attention module sets 8 attention heads, each of which learns a different feature representation independently, and then concatenates the outputs of the plurality of heads and performs a linear transformation to obtain enhanced spatiotemporal features. The spatio-temporal pyramid pooling unit performs multi-scale feature extraction in both the temporal and spatial dimensions. And a pooling window of 1, 3 and 5 frames is adopted in the time dimension, and pooling kernels of 1 multiplied by 1,2 multiplied by 2 and 4 multiplied by 4 are adopted in the space dimension to generate feature maps of different scales. The double-flow attention fusion device respectively carries out attention calculation on the time flow and the space flow characteristics, and then fuses the characteristics of the two flows in a weighted summation mode.

The hierarchical space-time memory network adopts a three-layer LSTM structure, the hidden layer dimension of each layer of LSTM unit is 512, and the time information is selectively memorized and forgotten through a gating mechanism. The context sensing module calculates the similarity between the current time feature and the historical feature, and performs feature aggregation according to the similarity weight to realize semantic understanding of scene level. The dense connection aggregator adopts a dense connection structure similar to DenseNet, directly connects the features of different layers, and ensures the full utilization of information. The space-time characteristic recalibration unit carries out importance evaluation on the channel level and the space level on the characteristics and dynamically adjusts the characteristic weights.

For example, in a product quality detection scenario on an industrial production line, the input video frame rate is 30fps, processing 15 frames of image sequences at a time. The feature dimension of the segmentation result of each frame of image is 256 multiplied by 64, and a four-dimensional tensor of 15 multiplied by 256 multiplied by 64 is obtained after time dimension slicing. The dynamic self-attention is calculated to obtain a 15 multiplied by 15 inter-frame association matrix, which reflects the association degree between different time frames. The 4-layer structure of the cross space-time encoder contains 8 attention headers per layer, with a feature dimension of 512. Spatio-temporal pyramid pooling performs feature extraction on 3 time scales and 3 spatial scales, yielding a total of 9 different scale feature representations. The weight ratio of time stream to space stream in the dual stream attention fusion device is 0.4:0.6. The hidden layer state dimensions of the three-layer LSTM are 512, 256 and 128 in sequence, and the 15-frame sequence is modeled to obtain the time sequence dependency characteristics. The final output space-time context feature dimension is 256, and the space-time context feature dimension comprises the time sequence change rule and the space context information of the target. Through the multi-level space-time feature extraction and fusion, the dynamic change feature of the product in the production process can be accurately captured, and reliable feature representation is provided for subsequent anomaly detection.

In a specific embodiment, the process of executing step S105 may specifically include the following steps:

(1) Performing feature mapping processing on the space-time context features through a depth embedding network to obtain a depth embedding vector, and performing association measurement processing on the depth embedding vector through a cosine similarity calculator to obtain a target association matrix;

(2) Performing target pairing processing on the target incidence matrix through a multi-target matcher to obtain a target tracking pair, and performing motion state extraction processing on the target tracking pair through a time sequence state estimator to obtain a motion state sequence;

(3) Basic motion characteristics of the motion state sequence are extracted and processed through a motion mode decomposer to obtain basic motion characteristics including an acceleration motion mode, a uniform motion mode, a steering motion mode and a stagnation motion mode, and composite motion characteristics of the basic motion characteristics are constructed and processed through a mode combiner to obtain composite motion characteristics including a periodic motion mode, a sudden change motion mode and an interactive motion mode;

(4) Performing motion pattern matching processing on the composite motion characteristics through a complementary pattern matcher to obtain a matching pattern sequence, and performing trend prediction processing on the matching pattern sequence through a motion trend analyzer to obtain motion trend characteristics;

(5) And carrying out track construction processing on the motion trend characteristics through a track generator to obtain an initial track, and carrying out track optimization processing on the initial track through a track smoother to obtain a track prediction result.

Specifically, in multi-objective tracking and motion pattern analysis, spatio-temporal context features are first non-linearly mapped through a deep embedding network. The deep embedded network is composed of 4 fully connected layers, the output dimension of each layer is 1024, 512, 256 and 128 in sequence, and the stability of the feature distribution is kept through batch normalization by adopting a ReLU activation function. The cosine similarity calculator calculates the similarity between every two targets based on the feature vector normalized by the L2 norm, and constructs an N multiplied by N target association matrix, wherein N is the number of targets detected by the current frame. The multi-target matcher adopts a Hungary algorithm to perform target association optimization, and the association problem is converted into a bipartite graph maximum weight matching problem. A similarity threshold of 0.75 is set and the association between the targets is established only if the similarity exceeds the threshold. The time sequence state estimator tracks the motion state of the target by using a Kalman filter, and the state vector comprises the position coordinates (x, y), the speed (vx, vy) and the acceleration (ax, ay) of the target, and the state estimation is updated once every frame.

The motion pattern decomposer extracts four basic motion patterns, namely an acceleration motion pattern (acceleration is larger than a threshold value of 2mm/s ²), a uniform motion pattern (acceleration is smaller than a threshold value of 0.5mm/s ²), a steering motion pattern (direction change is larger than 30 degrees) and a stagnation motion pattern (speed is smaller than a threshold value of 1 mm/s), according to the state sequence. The pattern combiner identifies three composite motion patterns, a periodic motion pattern (the basic pattern repeatedly appears at fixed time intervals), a sudden motion pattern (the basic pattern changes drastically) and an interactive motion pattern (the motion patterns of multiple targets have temporal-spatial correlations), based on the time sequence combination rule of the basic motion patterns.

The complementary pattern matcher calculates the matching degree of the current motion sequence and standard patterns in a preset pattern library by adopting a dynamic time warping algorithm (DTW), and selects a pattern with the highest matching degree as a motion pattern of the current state. The motion trend analyzer predicts the motion trend of 5 frames in the future through an autoregressive model based on the motion sequence of 15 frames in the past, and considers the change rule of the characteristics such as speed, acceleration, motion direction and the like. And generating a smooth track curve by adopting cubic spline interpolation according to the predicted motion trend by the track generator, wherein the selection of interpolation points is based on a state estimation result of key time. The track smoother eliminates noise in the track by sliding window averaging and gaussian filtering, the window size is set to 5 frames, and the gaussian kernel standard deviation is 1.5.

For example, in an industrial robot collaboration scenario, the input spatiotemporal context feature dimension is 256, mapped to 128-dimensional feature vectors via a deep embedding network. Assuming that the current frame detects 5 targets, a 5×5 target correlation matrix is constructed. The motion sequence of the target 1 is obtained through state estimation, wherein t=0 moment positions (100 ), speeds (5, 0), t=1 moment positions (105,100), speeds (5, 0), t=2 moment positions (110,100) and speeds (5, 5) are characterized in that uniform linear motion is converted into acceleration steering. Basic motion feature extraction shows that t=0 to t=1 is a uniform motion mode (acceleration 0.1mm/s ²), and t=1 to t=2 is a steering motion mode (direction change 45 degrees) and an acceleration motion mode (acceleration 2.2mm/s ²). Pattern combination analysis found this to be a typical mutant movement pattern. The minimum regular distance between the DTW algorithm and the preset mode is calculated to be 0.82, and the robot obstacle avoidance operation mode is confirmed. The motion trend analysis predicts the position sequence (115,105), (120, 110), (125,115), (130, 120), (135,125) of the future 5 frames, the corresponding velocity sequence (5, 5), (5, 5). The track generator generates a smooth track based on the predicted points, and after track smoothing processing, the deviation of each position is not more than 2 pixels, so that a final track predicted result is obtained.

In a specific embodiment, the process of executing step S106 may specifically include the following steps:

(1) Carrying out hierarchical decomposition treatment on the track prediction result through a multi-level feature extractor to obtain a hierarchical feature set containing low-level motion features, middle-level behavior features and high-level semantic features, and carrying out feature enhancement treatment on the hierarchical feature set through a positive sample enhancer to obtain an enhanced positive sample;

(2) Performing feature coding processing on the enhanced positive samples and a preset abnormal sample library through a contrast feature encoder to obtain contrast feature vectors, and performing similarity calculation processing on the contrast feature vectors through a hierarchical measurement calculator to obtain a hierarchical similarity matrix;

(3) Performing time mode, space mode and behavior mode feature extraction processing on the hierarchical similarity matrix through a multi-mode feature extractor to obtain a multi-mode feature set, and performing feature alignment processing on the multi-mode feature set through a cross-mode aligner to obtain alignment features;

(4) Carrying out abnormality degree quantization processing on the alignment features through an abnormality degree calculator to obtain an abnormality degree score, and carrying out grading processing on the abnormality degree score through a multi-threshold classifier to obtain an abnormality grade;

(5) Carrying out abnormal event positioning processing on the abnormal level through a space-time positioner to obtain abnormal event description, and carrying out context analysis processing on the abnormal event description through a scene association analyzer to obtain a scene understanding result;

(6) And carrying out risk grade evaluation processing on the scene understanding result through a risk evaluator to obtain a risk evaluation result, and carrying out report template filling processing on the risk evaluation result through a report generator to obtain a visual detection report containing abnormality types, abnormality positions, abnormality times, abnormality grades, risk evaluation and processing suggestions.

The method comprises the steps of carrying out hierarchical decomposition on a track prediction result through a multi-level feature extractor, extracting three layers of features, wherein low-level motion features comprise a position coordinate sequence, a speed vector sequence and an acceleration vector sequence of a target, each sequence comprises 15 time points of data, middle-level behavior features comprise curvature change, speed change rate and direction change rate of the motion track, the middle-level behavior features are obtained through calculation through a sliding window, and high-level semantic features comprise motion mode types, state conversion sequences and target interaction relations and are obtained through mapping through a semantic encoder. The positive sample enhancer adopts a data enhancement technology to randomly rotate (+ -15 degrees), translate (+ -10 pixels) and scale the speed (0.8-1.2 times) on the normal samples, and expands the number of the positive samples. The contrast feature encoder encodes the enhanced positive samples and samples in the abnormal sample library into feature vectors of uniform dimensions. The abnormal sample library comprises various abnormal modes marked in advance, such as track mutation, speed abnormality, acceleration abnormality and the like. The hierarchical measurement calculator calculates the similarity among samples on three levels respectively, wherein the lower layer adopts Euclidean distance to measure the difference of the position and the motion parameter, the middle layer adopts DTW algorithm to measure the similarity of the behavior sequence, and the upper layer adopts cosine distance to measure the similarity of the semantic features.

The multi-modal feature extractor further analyzes the hierarchical similarity matrix to extract features from the temporal modality (time series data of sampling frequency 30 Hz), the spatial modality (spatial distribution and movement range of the target), and the behavioral modality (action semantics and interaction mode), respectively. The cross-modal aligner aligns and fuses the features of different modalities through an attention mechanism, and ensures the consistency of feature representation. The anomaly calculator calculates an anomaly score based on the aligned features, and evaluates the anomaly degree of the sample by using an isolated forest algorithm. The multi-threshold classifier classifies the anomaly score into 5 classes, class 1 (score <0.3, normal), class 2 (0.3-0.5, slight anomaly), class 3 (0.5-0.7, moderate anomaly), class 4 (0.7-0.9, severe anomaly), class 5 (> 0.9, dangerous anomaly).

The space-time locator determines the specific space-time position of the abnormal event according to the abnormal level, and records the time stamp and the image coordinate of the abnormal occurrence. The scene correlation analyzer analyzes the cause and the influence range of the abnormality in consideration of the context information of the abnormality event. The risk evaluator evaluates the risk level based on the anomaly level, the impact range, and the duration, and calculates a final risk score using a rule-based scoring mechanism, impact range score (1-5 points) x anomaly level (1-5 points) x duration weight (1-2 points). The report generator integrates all analysis results into a detection report in a standard format. The report contains six key parts, anomaly type (e.g., trajectory anomalies, speed anomalies, etc.), anomaly location (pixel coordinates), anomaly time (time stamp), anomaly level (level 1-5), risk assessment (low, medium, high risk), and treatment recommendations.

For example, in an industrial robot collaboration scenario, motion of a six-axis industrial robot when performing a transport task is analyzed. The low-level feature extraction shows that the position sequence moves from a workbench A (x=1200 mm, y=800 mm, z=500 mm) to a workbench B (x=2000 mm, y=1500 mm, z=500 mm), the robot tip speed changes from the standard running speed of 250mm/s to 750mm/s, the instantaneous acceleration peak reaches 1800mm/s ², and the set safe acceleration threshold value of 1000mm/s ² is exceeded. The middle layer characteristic analysis finds that the attitude angle of TCP (tool center point) is changed drastically in the 8 th frame of the robot in the motion process, the deflection is more than 45 degrees from the vertical, the speed change rate of joints 3 and 4 reaches 180 percent, and the speed change rate exceeds the normal range. The high-level semantic features recognize a 'load mutation caused posture abnormality' mode. And comparing the feature codes with an abnormal sample library, wherein the calculated similarity matrix shows that the similarity with the workpiece falling abnormality reaches 0.87. The multi-mode analysis shows that the speed mutation is detected in a time mode at t=58.5s, the TCP track in a space mode deviates from a preset straight line path by more than 150mm, and the operation specification of uniform speed stable conveying is violated in a behavior mode. The degree of abnormality calculation score was 0.83, and it was determined as level 4 abnormality. The spatio-temporal positioning determines that the anomaly occurred at a time stamp of 58.5s, the position coordinates (1500 mm,1200mm,500 mm), in the transition region between the two tables. Scene association analysis finds that the abnormal position is located on the cross path of the multi-robot cooperative work area and is close to the manual operation area.

The high risk level is determined based on the risk assessment matrix calculation of the influence range score of 4 points (the neighboring stations are swept) x the abnormality level of 4 points (the workpiece may be damaged) x the duration coefficient of 1.5 (the abnormality duration >2 s) =the total score of 24 points. The generated detection report comprises:

Abnormal type, namely abnormal workpiece carrying posture;

abnormal position, TCP coordinates (1500 mm,1200mm,500 mm);

abnormal time 2024-02-15:58:30.5;

abnormality grade 4 (severe abnormality);

risk assessment, high risk (score 24 points).

The processing proposal is that the carrying task of the working unit is stopped immediately, the end effector grabbing mechanism is checked, the load parameters are recalibrated, and whether the weight of the workpiece meets the specification is checked.

The above describes the intelligent machine vision detection method based on image processing in the embodiment of the present application, and the following describes the intelligent machine vision detection device based on image processing in the embodiment of the present application, please refer to fig. 2, and one embodiment of the intelligent machine vision detection device based on image processing in the embodiment of the present application includes:

the segmentation module 201 is configured to perform multi-scale decomposition and adaptive threshold segmentation on an original image to obtain a segmented image, and perform multi-layer cascade filtering and nonlinear weighted fusion on the segmented image to obtain a preprocessed image;

The fusion module 202 is configured to perform complementary feature extraction processing on the preprocessed image through a parallel multi-branch heterogeneous neural network to obtain multi-dimensional features, and perform feature fusion processing on the multi-dimensional features through a hierarchical self-adaptive attention mechanism and cross-scale feature recalibration to obtain fusion features;

The optimization module 203 is configured to perform candidate region generation and geometric constraint optimization processing on the fusion feature through a cascaded bidirectional region suggestion network to obtain a target candidate region, and perform instance segmentation processing on the target candidate region through a graph convolution network and a boundary dynamic adjustment mechanism to obtain a segmentation result;

an extracting module 204, configured to perform feature extraction processing on the segmentation result through a spatio-temporal joint attention network, so as to obtain a spatio-temporal context feature;

the tracking module 205 is configured to perform multi-target tracking processing on the spatio-temporal context feature through depth association measurement and motion pattern matching, so as to obtain a track prediction result;

The generating module 206 is configured to perform abnormal mode modeling processing on the track prediction result through hierarchical comparison learning and multi-mode feature fusion, obtain abnormal features, and generate a visual detection report according to the abnormal features.

Through the cooperative cooperation of the components, the robustness and adaptability of image preprocessing can be effectively improved by combining multi-scale decomposition and self-adaptive threshold segmentation combined processing modes and combining multi-layer cascade filtering and nonlinear weighted fusion, the high-quality preprocessed images are ensured under a complex industrial environment, the advantages of different network structures are fully utilized by the design of the parallel multi-branch heterogeneous neural network, the complementary extraction and effective fusion of multi-dimensional features are realized by matching with a hierarchical self-adaptive attention mechanism and cross-scale feature recalibration, the feature expression capability is obviously enhanced, the introduction of a cascade bidirectional region suggestion network further optimizes the accuracy of object detection by combining candidate region generation and geometric constraint optimization, the application of a spatial-temporal joint attention network realizes effective modeling of time sequence information, the accuracy of object tracking is enhanced, the multi-objective interaction problem under the complex environment can be effectively processed by matching with a depth-associated metric and motion mode matching multi-objective tracking strategy, the visual feature analysis can be realized by combining with a full-scale feature detection scheme, the full-visual feature analysis can be realized by combining with a full-scale visual feature detection system, and the full-visual feature detection can be realized by combining a visual feature detection system, and an overall visual feature detection system can be realized by combining the full-scale with a visual feature detection scheme, and the full-visual feature detection accuracy is realized by combining a full-scale feature detection system, is particularly suitable for the practical application requirements of industrial sites.

The present application also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and may also be a volatile computer readable storage medium, where instructions are stored in the computer readable storage medium, when the instructions are executed on a computer, cause the computer to perform the steps of the intelligent machine vision detection method based on image processing.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, systems and units may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a read-only memory (ROM), a random access memory (random acceS memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

While the application has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that the foregoing embodiments may be modified or equivalents may be substituted for some of the features thereof, and that the modifications or substitutions do not depart from the spirit and scope of the embodiments of the application.

Claims

1. The intelligent machine vision detection method based on image processing is characterized by comprising the following steps of:

performing multi-scale decomposition and self-adaptive threshold segmentation processing on an original image to obtain a segmented image, and performing multi-layer cascade filtering and nonlinear weighted fusion processing on the segmented image to obtain a preprocessed image;

Complementary feature extraction processing is carried out on the preprocessed image through a parallel multi-branch heterogeneous neural network to obtain multi-dimensional features, and feature fusion processing is carried out on the multi-dimensional features through a hierarchical self-adaptive attention mechanism and cross-scale feature recalibration to obtain fusion features;

generating candidate regions and optimizing geometric constraints of the fusion features through a cascading bidirectional region suggestion network to obtain target candidate regions, and carrying out instance segmentation on the target candidate regions through a graph convolution network and a boundary dynamic adjustment mechanism to obtain segmentation results;

Performing feature extraction processing on the segmentation result through a space-time joint attention network to obtain space-time context features;

performing multi-target tracking processing on the space-time context characteristics through depth correlation measurement and motion pattern matching to obtain a track prediction result;

And carrying out abnormal mode modeling processing on the track prediction result through hierarchical comparison learning and multi-mode feature fusion to obtain abnormal features, and generating a visual detection report according to the abnormal features.

2. The intelligent machine vision detection method based on image processing according to claim 1, wherein the performing multi-scale decomposition and adaptive threshold segmentation processing on an original image to obtain a segmented image, and performing multi-layer cascade filtering and nonlinear weighted fusion processing on the segmented image to obtain a preprocessed image comprises:

Pyramid decomposition processing is carried out on the original image according to scale proportions of 0.5 times, 0.75 times and 1 time to obtain three sub-images with different scales, and partial entropy calculation processing is carried out on the three sub-images with different scales to obtain an initial threshold value diagram;

Performing threshold dynamic updating processing on the initial threshold map through an iterative minimum cross entropy algorithm to obtain an adaptive threshold map, and performing region segmentation processing on the adaptive threshold map to obtain a region segmentation image;

three layers of wavelet decomposition processing is carried out on the region division image to obtain low-frequency and high-frequency component images, and noise suppression processing is carried out on the low-frequency and high-frequency component images through bilateral filtering to obtain a noise reduction image;

Performing edge detection processing on the noise reduction image through a Sobel operator to obtain an edge feature image, and performing edge connection processing on the edge feature image through morphological operation to obtain an edge enhancement image;

performing feature fusion processing on the edge enhanced image and the noise reduction image to obtain a feature fusion image, and performing weight distribution processing on the feature fusion image through a Gaussian weight function to obtain a weight coefficient diagram;

And carrying out weighted summation processing on the weight coefficient graph and the characteristic fusion image to obtain a weighted fusion image, and carrying out gray level distribution adjustment processing on the weighted fusion image through histogram equalization to obtain the preprocessing image.

3. The intelligent machine vision detection method based on image processing according to claim 1, wherein the performing complementary feature extraction processing on the preprocessed image through a parallel multi-branch heterogeneous neural network to obtain multi-dimensional features, performing feature fusion processing on the multi-dimensional features through a hierarchical self-adaptive attention mechanism and cross-scale feature recalibration to obtain fusion features, and comprises:

Respectively constructing a high-resolution branch comprising a VGG structure, a middle-resolution branch comprising a ResNet structure and a low-resolution branch comprising a MobileNet structure for the preprocessed image to obtain three groups of original feature images, and performing feature interaction processing on the three groups of original feature images through cross-branch connection to obtain complementary feature images;

carrying out channel weight calculation processing on the complementary feature map through a channel attention unit to obtain channel weighted features, and carrying out space weight distribution processing on the channel weighted features through a space attention unit to obtain multidimensional weighted features;

Performing cross-layer feature stitching processing on the multi-dimensional weighted features through residual connection to obtain a multi-scale feature map, and performing feature resampling processing on the multi-scale feature map through pyramid pooling to obtain resampled features;

performing feature importance evaluation processing on the resampled features through a self-adaptive feature selector to obtain a feature importance map, and performing feature screening processing on the feature importance map through a soft gating mechanism to obtain screening features;

And performing dimension reduction transformation processing on the screening features through a multi-layer perceptron to obtain dimension reduction features, and performing normalization processing on the dimension reduction features through feature normalization to obtain the fusion features.

4. The intelligent machine vision detection method based on image processing according to claim 1, wherein the generating candidate regions and optimizing geometric constraints on the fusion features through a cascaded bidirectional region suggestion network to obtain target candidate regions, and performing instance segmentation processing on the target candidate regions through a graph convolution network and a boundary dynamic adjustment mechanism to obtain segmentation results, includes:

Performing feature mapping processing on the fusion features through a forward feature extraction unit to obtain a forward feature map, and performing feature enhancement processing on the forward feature map through a reverse feature extraction unit to obtain cascading bidirectional features;

Performing initial region generation processing on the cascading bidirectional features through a region suggestion generator to obtain initial candidate regions, and performing shape feature extraction processing on the initial candidate regions through a geometric feature extractor to obtain geometric features;

Performing geometric constraint processing on the geometric features through a constraint rule base to obtain constraint candidate frames, and performing boundary optimization processing on the constraint candidate frames through a region optimizer to obtain the target candidate regions;

node feature extraction processing is carried out on the target candidate region through a graph node constructor to obtain graph node features, and connection relation construction processing is carried out on the graph node features through a graph edge generator to obtain graph structure features;

Carrying out feature propagation processing on the graph structural features through a graph convolution layer to obtain graph convolution features, and carrying out boundary position prediction processing on the graph convolution features through a boundary dynamic predictor to obtain a boundary prediction graph;

And carrying out boundary fine adjustment processing on the boundary prediction graph through a dynamic adjustment unit to obtain an adjustment boundary, and carrying out region division processing on the adjustment boundary through an instance divider to obtain the division result.

5. The intelligent machine vision detection method based on image processing according to claim 1, wherein the feature extraction processing is performed on the segmentation result through a spatio-temporal joint attention network to obtain a spatio-temporal context feature, and the method comprises:

carrying out time sequence framing treatment on the segmentation result through a time dimension slicer to obtain a time sequence feature sequence, and carrying out inter-frame association calculation treatment on the time sequence feature sequence through a dynamic self-attention unit to obtain a time sequence association matrix;

performing space-time feature coding processing on the time sequence incidence matrix through a cross space-time coder to obtain initial space-time features, and performing feature enhancement processing on the initial space-time features through a multi-head collaborative attention module to obtain enhanced space-time features;

carrying out multi-scale feature extraction processing on the enhanced space-time features through a space-time pyramid pooling unit to obtain multi-scale space-time features, and carrying out feature integration processing on the multi-scale space-time features through a double-flow attention fusion device to obtain fusion space-time features;

Performing time sequence dependent modeling processing on the fused space-time features through a hierarchical space-time memory network to obtain space-time memory features, and performing scene association processing on the space-time memory features through a context sensing module to obtain scene context features;

and carrying out feature aggregation treatment on the scene context features through a dense connection aggregator to obtain aggregation features, and carrying out feature optimization treatment on the aggregation features through a space-time feature recalibration unit to obtain the space-time context features.

6. The intelligent machine vision detection method based on image processing according to claim 1, wherein the performing multi-objective tracking processing on the spatio-temporal context features through depth correlation measurement and motion pattern matching to obtain a track prediction result comprises:

Performing feature mapping processing on the space-time context features through a depth embedding network to obtain a depth embedding vector, and performing association measurement processing on the depth embedding vector through a cosine similarity calculator to obtain a target association matrix;

Performing target pairing processing on the target incidence matrix through a multi-target matcher to obtain a target tracking pair, and performing motion state extraction processing on the target tracking pair through a time sequence state estimator to obtain a motion state sequence;

Basic motion characteristics of the motion state sequence are extracted and processed through a motion mode decomposer to obtain basic motion characteristics including an acceleration motion mode, a uniform motion mode, a steering motion mode and a stagnation motion mode, and composite motion characteristics of the basic motion characteristics are constructed and processed through a mode combiner to obtain composite motion characteristics including a periodic motion mode, a sudden change motion mode and an interactive motion mode;

performing motion pattern matching processing on the composite motion characteristics through a complementary pattern matcher to obtain a matching pattern sequence, and performing trend prediction processing on the matching pattern sequence through a motion trend analyzer to obtain motion trend characteristics;

And carrying out track construction processing on the motion trend characteristics through a track generator to obtain an initial track, and carrying out track optimization processing on the initial track through a track smoother to obtain the track prediction result.

7. The intelligent machine vision detection method based on image processing according to claim 1, wherein the performing abnormal mode modeling processing on the track prediction result through hierarchical contrast learning and multi-mode feature fusion to obtain abnormal features, and generating a vision detection report according to the abnormal features comprises:

carrying out layering decomposition treatment on the track prediction result through a multi-level feature extractor to obtain a layering feature set containing low-level motion features, middle-level behavior features and high-level semantic features, and carrying out feature enhancement treatment on the layering feature set through a positive sample enhancer to obtain an enhanced positive sample;

Performing feature coding processing on the enhanced positive samples and a preset abnormal sample library through a contrast feature encoder to obtain contrast feature vectors, and performing similarity calculation processing on the contrast feature vectors through a hierarchical measurement calculator to obtain a hierarchical similarity matrix;

Performing time mode, space mode and behavior mode feature extraction processing on the hierarchical similarity matrix through a multi-mode feature extractor to obtain a multi-mode feature group, and performing feature alignment processing on the multi-mode feature group through a cross-mode aligner to obtain alignment features;

performing abnormality degree quantization processing on the alignment features through an abnormality degree calculator to obtain an abnormality degree score, and performing grading processing on the abnormality degree score through a multi-threshold classifier to obtain an abnormality grade;

carrying out abnormal event positioning processing on the abnormal level through a space-time positioner to obtain abnormal event description, and carrying out context analysis processing on the abnormal event description through a scene association analyzer to obtain a scene understanding result;

And carrying out risk grade evaluation processing on the scene understanding result through a risk evaluator to obtain a risk evaluation result, and carrying out report template filling processing on the risk evaluation result through a report generator to obtain the visual detection report containing abnormality types, abnormality positions, abnormality time, abnormality grades, risk evaluation and processing suggestions.

8. An intelligent machine vision inspection device based on image processing for implementing the intelligent machine vision inspection method based on image processing according to any one of claims 1 to 7, characterized in that the intelligent machine vision inspection device based on image processing comprises:

9. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the intelligent machine vision inspection method based on image processing of any one of claims 1-7.