CN117409077B

CN117409077B - Chip attitude detection method based on multi-scale residual UNet segmentation

Info

Publication number: CN117409077B
Application number: CN202311347754.XA
Authority: CN
Inventors: 王萍; 吴静静; 安聪颖; 李天贺
Original assignee: Wuxi Jiuxiao Technology Co ltd
Current assignee: Wuxi Jiuxiao Technology Co ltd
Priority date: 2023-10-18
Filing date: 2023-10-18
Publication date: 2024-04-05
Anticipated expiration: 2043-10-18
Also published as: CN117409077A

Abstract

The invention discloses a chip attitude detection method based on multi-scale residual UNet segmentation, and belongs to the technical field of digital image processing. The invention provides a multi-stage pose detection algorithm; for chips with large pose changes, a salient pose detection method based on an SVM classifier is provided, and for chips with small pose changes, a precise positioning method based on salient feature points is provided, and the relative distance between the chips and a tray key point is calculated to realize micro pose detection. In addition, a lightweight multi-scale residual inet semantic segmentation network MR-UNet is also provided, so that accurate segmentation of chips with different scales and a tray under a position-changing gesture is realized, the images of color-spraying chips with complex appearance are unified, and only the outline is reserved. The test result shows that the detection accuracy rate of the invention reaches 99.804 percent, has good robustness and real-time performance, and can meet the requirement of mass production on the industrial site.

Description

Chip attitude detection method based on multi-scale residual UNet segmentation

Technical Field

The invention relates to a chip attitude detection method based on multi-scale residual UNet segmentation, belonging to the technical field of digital image processing.

Background

In the chip manufacturing process, the chip needs to be turned over to a metal disc, and a color identification pattern is printed on the surface. To prevent the chips from being crushed by the flipping device, adding high production costs to the manufacturer, it is necessary to ensure that the chips are in the correct pose in the tray. Currently, a manual visual inspection method is generally adopted, the efficiency is low, and the accuracy is easily influenced by subjective factors. Therefore, research on a chip pose automatic detection method based on machine vision to replace manual visual inspection is a necessary trend.

For researches on pose detection methods, the population can be divided into three major categories: methods based on correspondence, template matching and network regression. The pose detection method based on the corresponding relation is to extract a certain kind of key points in the image, and when the object has the conditions of overlapping, shielding and the like, the detection method based on the key points is often not ideal in effect. The pose detection method based on template matching is to acquire template images through rendering under the condition of distance and visual angle change by utilizing a CAD model of an object, then extract template library features, and match the template closest to a target as the pose of the object during detection, so that the robustness to interference such as shielding, clutter and the like is high, but when the surface features of the object change in a complex manner, the detection precision is reduced. The pose detection method based on the network regression is to automatically extract target characteristics through a neural network and then perform pose detection by using a trained network model. However, if only the neural network is used, the performance of the model is easily affected by ambiguity of the target similarity pose, resulting in inaccurate detection results.

Therefore, it is needed to provide a pose detection algorithm which overcomes the above drawbacks, has high accuracy, and can realize automatic and rapid model change of a multi-model chip with complex surface characteristics.

Disclosure of Invention

In order to improve the accuracy of chip gesture detection, the invention provides a chip gesture detection method based on multi-scale residual UNet segmentation, which comprises the following steps:

step 1: acquiring an image of a chip to be detected, and coarsely positioning the salient feature points of the chip to be detected and the key points of the tray according to the fixed coordinates;

step 2: after coarse positioning, image segmentation is carried out on the chip and the material tray by utilizing a multi-scale residual error UNet model MR-UNet;

step 3: and after the segmentation, the pose of the chip is accurately detected in real time by using a multilevel pose detection method and an SVM classifier and a template matching algorithm.

Optionally, the MR-UNet model in the step 2 replaces the second convolution of the two 3×3 convolution layers of the original UNet network encoder stage with a multi-scale residual convolution module MRC; before each 3×3 convolution, a pixel fill operation with padding equal to 1 is performed around the image; introducing BN layer after each convolution operation, and carrying out standardization processing on the input image data so that the data of each layer accords with the distribution with the mean value of 0 and the variance of 1;

then activating the feature map after the BN layer through a ReLU function;

finally halving the number of downsampling channels;

the multi-scale residual convolution module MRC firstly uses four convolution kernels with different sizes of receptive fields to perform feature extraction on an input image in a parallel mode, and performs padding pixel filling operation on the image;

then, carrying out feature fusion on the extracted features through batch normalization operation and a ReLU activation function, and then carrying out 1X 1 convolution operation on the fused feature map again to reduce the number of channels to be the same as the input image;

and finally, adding the input feature images and the channels corresponding to the feature images after dimension reduction pixel by pixel, and outputting the feature images after the addition after ReLU activation.

Optionally, the multi-stage pose detection method includes:

step 31: counting the number of foreground pixels in the segmented image, and judging that the material is less if the number of the foreground pixels is smaller than a preset threshold value; if the number of the foreground pixels is greater than the preset threshold, continuing to detect in step 32;

step 32: classifying the chip images to be detected by adopting an SVM classifier, predicting the category according to the classification model, preliminarily judging the chip images to be detected as normal pose if the output result is 0, and adopting the step 33 to continue detection; if the output result is 1, the judging result is 'severe tilting pose';

step 33: the image which is preliminarily judged to be in the normal pose in the step 32 is preliminarily judged to be in the normal pose by calculating the relative distance between the remarkable chip characteristic points and the key points of the material tray, and if the relative distance is in the preset tolerance range, the step 34 is adopted to continue detection; if the relative distance exceeds the tolerance range, directly outputting a detection result as an offset pose;

step 34: and (3) preliminarily judging the image of the normal pose in the step (33), calculating the rotation angle of the image by acquiring the minimum circumscribed rectangle of the image, and finally judging the chip as the rotation pose or the normal pose by the angle value.

Optionally, the calculating the preset threshold in step 31 includes: firstly, selecting more than 100 chip images with less material gestures after segmentation, counting the number of pixels with the value equal to 255 in each image, adding and dividing the number of images to obtain the average foreground number of pixels, and taking the average foreground number of pixels as a threshold parameter for distinguishing the less material gestures from other gestures.

Optionally, the training process of the SVM classifier in the step 32 includes:

firstly, respectively placing the pictures of the chips of the normal pose and the severe cocked pose into two folders, reading the pictures to be trained in the two folders, extracting HOG feature descriptors, marking the feature vectors of the chips of the normal pose as class 0 tags, and marking the feature vectors of the chips of the severe cocked pose as class 1 tags; and finally, outputting a classification model by combining the label and the SVM classifier.

Optionally, the method based on the relative distance in step 33 includes:

firstly, precisely positioning a chip significant feature point p (x, y) and a tray key point q (x, y) by using a template matching algorithm;

then, the chip salient feature point p (x, y) is used for relative to the upper left corner point O of the chip area ₁ (x ₁ ,y ₁ ) Absolute coordinates plus O ₁ (x ₁ ,y ₁ ) Absolute coordinates of p (x, y) relative to O (x, y) are obtained relative to absolute coordinates of the image origin O (x, y); with respect to the upper left corner O of the tray area, the tray key q (x, y) ₂ (x ₂ ,y ₂ ) Absolute coordinates plus O ₂ (x ₂ ,y ₂ ) Absolute coordinates of q (x, y) relative to O (x, y) are obtained relative to absolute coordinates of the image origin O (x, y);

finally, subtracting the x and y absolute coordinate values of q (x, y) from the x and y absolute coordinate values of the chip salient feature point p (x, y) to obtain an x relative distance x between the two _relative And y relative distance y _relative ；

Obtaining the relative distances x and y of the same row and column positions of all the images by the steps, adding the relative distances x and y and dividing the relative distances x by the number of the images to obtain the average relative distance x and the average relative distance y _average And y _average ；

Because the chip can shake slightly in the clamping groove, the chip can shake slightly in x _average And y _average Increasing the tolerance of n pixels back and forth on the basis of (a), then the tolerance ranges in x and y directions are respectively: [ x ] _average -n，x _average +n]、[y _average -n，y _average +n]Repeating the steps to obtain the tolerance ranges of the rest row and column positions in the image;

in the detection process, if the relative distances of x and y are within the tolerance range, the detection result is primarily judged to be the normal pose, and if one of the detection results exceeds the tolerance range, the detection result is directly output to be the 'offset pose'.

Optionally, the step 34 specifically includes:

calculating the minimum circumscribed rectangle of the image, and simultaneously obtaining the rotation angle alpha, alpha epsilon (-90 degrees, 0 degrees) of the minimum circumscribed rectangle;

the judgment angle beta, beta epsilon [0 DEG, 45 DEG ]:

when beta is more than 3.5 degrees, the rotation gesture is judged, and when beta is less than 3.5 degrees, the normal gesture is judged.

Optionally, the tolerance n selects 5 pixels.

A second object of the present invention is to provide a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the chip gesture detection method of any one of the above.

The invention has the beneficial effects that:

the invention provides an MR-UNet network, which effectively partitions a material tray and a chip to obtain a binary image with clear edge contour; secondly, a multi-stage pose detection method is provided, wherein fewer pose chips are detected by calculating the number of foreground pixels of the segmented image, then a HOG+SVM classifier is utilized to detect severe tilted pose chips, offset pose chips are detected based on relative distances, and finally normal and slight rotation pose chips are detected based on minimum circumscribed rectangles, so that detection accuracy is greatly improved. The effectiveness of the MR-UNet model provided by the invention is proved by a deep learning network segmentation performance comparison test; the feasibility and the robustness of the chip gesture detection method are proved by a multi-stage gesture detection algorithm accuracy verification test, and the requirement of chip batch automatic production can be basically met.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a chip.

FIG. 2 is a complex and varied chip image during production, wherein (a) chips with different characters on the surface are shown; (b) trays of different types; (c) chips exhibiting different poses.

FIG. 3 is a coarse positioning result diagram, wherein (a) shows a chip salient feature coarse positioning result diagram; (b) a map of the results of the coarse positioning of the key points of the tray is shown.

Fig. 4 is a flowchart of a chip gesture detection method based on multi-scale residual UNet segmentation of the present invention.

Fig. 5 is a diagram of the multi-scale residual UNet network model structure of the present invention.

Fig. 6 is a block diagram of a multi-scale residual convolution of the present invention.

Fig. 7 is a flowchart of the severe cocking chip detection of the present invention.

FIG. 8 is a schematic diagram of the offset chip pose detection of the present invention.

Fig. 9 is a visual representation of the segmentation effect of the present invention and other prior art methods.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

Embodiment one:

the embodiment provides a chip multi-stage pose detection method based on multi-scale residual error UNet segmentation, firstly, a multi-scale residual error convolution module is introduced into UNet semantic segmentation model, the number of channels is halved, and chips with changeable poses and abundant appearances and complex tray backgrounds are effectively segmented while the light weight of the model is ensured. Secondly, the multi-stage pose detection method is provided, the pose of the chip is accurately detected in real time by using the SVM classifier and the template matching algorithm through the segmented image, and the real-time performance and the robustness are high.

Referring to fig. 4, the flow of the chip pose detection method of the embodiment includes the following steps:

Embodiment two:

step 1: coarse positioning is carried out on the remarkable characteristic points of the chip to be detected and the key points of the tray according to the fixed coordinates;

step 2: dividing the roughly positioned chip from the material tray based on the MR-UNet of the multi-scale residual UNet division;

In step 2, a lightweight UNet model MR-UNet fused with a multi-scale residual convolution module (Multiscale Residual Convolution Module, MRC module) is proposed.

The model structure diagram of the MR-UNet module is shown in fig. 5, and the second convolution of the two 3×3 convolution layers of the original UNet model encoder stage is replaced by the multi-scale residual convolution module MRC provided in this embodiment, so as to enrich and purify the multi-scale characteristic information of the image.

Before each convolution of 3×3, a pixel filling operation with padding equal to 1 is performed around the image, so that the size of the image after each convolution is unchanged, and the problem of information loss of the image part is solved. Introducing BN layer after each convolution operation, and carrying out standardization processing on the input image data so that the data of each layer accords with the distribution with the mean value of 0 and the variance of 1; and then activating the characteristic map after BN by a ReLU function. And finally, the number of downsampling channels is halved, and the original [64,128,256,512,1024] is reduced to [32,64,128,256,512], so that the light weight of the model is ensured, and meanwhile, the separation effect is better.

The structure of the multi-scale residual convolution module MRC of this embodiment is shown in fig. 6. The MRC module firstly uses convolution kernels with different sizes of receptive fields of 1×1, 3×3, 5×5 and 7×7 to perform feature extraction on an input image in a parallel mode, and performs padding pixel filling operation on the image, wherein the sizes of the four convolution kernels are respectively 0, 1, 2 and 3.

Regarding the setting of four convolution kernel receptive fields and padding, other sizes of convolution kernels may also be selected, but some rules are met: 1. dimension of the feature map: the convolution kernel size chosen should be matched to the feature map size to ensure efficient feature extraction. Larger convolution kernels may capture a larger range of features, but may also result in increased computational effort; 2. computing resources: larger convolution kernels increase computational costs and thus require trade-offs in terms of available computational resources and performance requirements.

And carrying out feature fusion on the extracted features after batch normalization operation and ReLU activation function respectively, then carrying out 1X 1 convolution operation on the fused feature map again, and reducing the dimension of the channel number to be the same as that of the input image. And finally, adding the input feature images and the channels corresponding to the feature images after dimension reduction pixel by pixel, and outputting the feature images after the addition after ReLU activation.

In step 3, an objective is to provide a method for detecting multi-stage pose of a chip, as shown in fig. 4, to provide different defect detection methods for chip classification of different poses, which specifically includes:

(1) For chips with few positions, the number of background pixels of the chips after segmentation is large in proportion to the whole image, and the number of foreground pixels of the chips under other positions is large in proportion to the image. Therefore, the present embodiment uses this feature to detect a few-level-gesture chip. Firstly, selecting 100 chip images with fewer split material gestures, counting the number of pixels with the pixel value equal to 255 in each image, adding and dividing the number by 100 to obtain the average foreground pixel number, and taking the average foreground pixel number as a threshold parameter for distinguishing the fewer material gestures from other material gestures. If the pose is less than the parameter, judging that the pose is less than the parameter, and if the pose is more than the parameter, continuing the subsequent method flow to judge.

(2) The position and rotation angle of the severely tilted chip in the clamping groove are varied, the characteristics are obvious, and the chip is detected by using a classification algorithm in machine learning. Among a large number of classification algorithms, the mode of combining HOG features (Histogram of Oriented Gradient, HOG) with SVM classifiers (Support Vector Machine, SVM) is widely used because of small training difficulty and suitability for small samples, so the above combination method is used for detecting severe warping pose chips in this embodiment.

In the training stage, firstly, the pictures of the chips with normal pose and severe cocked pose are respectively put into two folders, the pictures to be trained in the two folders are read, HOG feature descriptors are extracted, the feature vectors of the chips with normal pose are marked as class 0 labels, and the feature vectors of the chips with severe cocked pose are marked as class 1 labels. And finally, outputting a classification model by combining the label and the SVM classifier. In the prediction stage, firstly, a trained SVM classification model and a picture to be predicted are read, then HOG features are extracted, and the category is predicted according to the classification model. If the output result is 0, the normal pose is primarily judged, and if the output result is 1, the direct output result is the severe tilting pose.

(3) For an offset chip, since features are not obvious, the SVM classification model is prone to misclassification, so that the embodiment provides a method for detecting the SVM classification model based on relative distance.

then, using the chip salient feature point p (x, y) relative to O ₁ (x ₁ ,y ₁ ) Absolute coordinates plus O ₁ (x ₁ ,y ₁ ) Absolute coordinates of p (x, y) relative to O (x, y) are obtained relative to absolute coordinates of the image origin O (x, y); with respect to O by the tray key q (x, y) ₂ (x ₂ ,y ₂ ) Absolute coordinates plus O ₂ (x ₂ ,y ₂ ) Absolute coordinates of q (x, y) relative to O (x, y) are obtained relative to absolute coordinates of the image origin O (x, y);

(4) Because the rotation angle of the slightly rotated chip is smaller, the x and y relative distances between the significant characteristic points of the matched chip and the key points of the tray are possibly in a tolerance range, so that slight rotation omission is caused. Therefore, in this embodiment, the minimum circumscribed rectangle of the image is obtained, the rotation angle of the minimum circumscribed rectangle is calculated, and whether the chip belongs to a slight rotation position is judged through the angle value. According to the related parameters of RectMin, the minimum circumscribed rectangle of the original image can be solved, and meanwhile, the rotation angle alpha, alpha epsilon (-90 degrees and 0 degrees) of the minimum circumscribed rectangle is obtained, the threshold value is calculated according to the following formula (1), beta epsilon [0 degrees and 45 degrees ].

The training process of the SVM classifier of the present embodiment is as follows:

step 1: the training set and the test set of the semantic segmentation model are divided. The input is a coarse positioned chip image, size 384 x 288. The data set comprises elements such as different material tray backgrounds, surfaces with different characters, different poses and the like, and is divided into 3040 training sets and 760 test sets, wherein each picture corresponds to a labeled label binary image (GT). Meanwhile, a cross-validation method is used in the training process, and 15% of training images are randomly extracted as a validation set during each training process so as to conduct finer training performance supervision.

Step 2: the training set and the test set of the SVM are divided. And for the SVM data set, respectively selecting 200 chip images of normal pose and 400 chip images of severe tilted pose after the segmentation of the MR-UNet model, and inputting the chip images into a classifier for training to obtain a classification model.

Step 3: and defining an evaluation index. In this embodiment, the model segmentation performance evaluation index is composed of the Dice coefficient, the cross-over ratio (intersection over union, IOU), the F1 Score (F1-Score) and the Sensitivity (SE), and the calculation formulas are (2) to (5). Parameters (parameters) and calculated amounts (flow) are used as model complexity evaluation indexes.

The TP indicates that a chip foreground region output by a network is a real foreground region; TN represents that the background area of the charging tray output by the network is a real background area; FP indicates that the foreground region of the chip output by the network is not a real foreground region, i.e., the background region is erroneously segmented into foreground regions; FN indicates that the network-output tray background area is not a true background area, i.e. the foreground area is erroneously segmented into background areas. The larger the values of the evaluation indexes are, the better the segmentation effect of the model is, and the quality of the network model can be comprehensively reflected.

To further illustrate the image segmentation performance of the MR-UNet model proposed in this example, a comparative experiment was performed, which was compared with several methods proposed in the literature, UNet, unet++, MSRD-ANet, CDWB-ASPP-UNet (CA-UNet), and RA-UNet, all performed in the same hardware environment and on the same data set, and the test results are shown in table 1, fig. 9, bolded to indicate the best results.

As can be seen from Table 1, the model proposed by the present invention is optimal in terms of four segmentation performance evaluation indexes, namely Dice, IOU, F-Score and SE, namely 0.9851, 0.9708, 0.9875 and 0.9884. Compared with UNet, UNet++, MSRD-ANet, RA-UNet and CA-UNet, the MR-UNet is respectively improved by 0.66%, 4.15%, 3.09%, 7.03% and 0.37% on the index of the Dice coefficient; respectively improving the IOU index by 1.16%, 7.23%, 5.63%, 8.7% and 0.47%; the F1-Score index is respectively improved by 1.27%, 2.39%, 6.71%, 4.88% and 0.43%; the SE indexes are respectively improved by 1.53%, 2.27%, 6.68%, 4.74% and 0.69%. In terms of complexity, the MR-UNet model of the invention has a minimum calculated amount compared with the other models, 50.42G, due to halving the number of channels, but has an increased parameter amount compared with unet++ and RA-UNet by 7.22M and 1.76M respectively due to the parallel convolution process of the MRC module. In general, the MR-UNet provided by the invention has the highest comprehensive precision and better overall performance.

Table 1 comparison of different segmentation methods performance

In order to verify the accuracy and robustness of the multi-stage pose detection algorithm provided by the invention, the pose of 3100 chips after MR-UNet segmentation is identified and verified, and the image comprises various elements such as different material disc backgrounds, different characters and the like. The recognition results of 1000 normal postures, 1000 severe cocked postures, 500 offset postures, 300 slightly rotated postures and 300 less postures are shown in table 2.

TABLE 2 statistical table of pose detection results

The multi-stage pose detection algorithm provided by the invention has the advantages that the chip detection accuracy rate of the multi-stage pose detection algorithm for normal poses and few poses is more than 99 percent; for chips with severe tilting, offset and slight rotation postures, the accuracy rates respectively reach 98.3%, 97.6% and 95.7%, and the situation that the tilting postures are erroneously detected as normal postures does not exist, so that the chips are effectively prevented from being crushed in the production process.

The effectiveness of the MR-UNet model provided by the invention is proved by a deep learning network segmentation performance comparison test; the feasibility and the robustness of the chip pose detection method provided by the invention are proved by a multi-stage pose detection algorithm accuracy verification test, and the requirement of chip batch automatic production can be basically met.

Some steps in the embodiments of the present invention may be implemented by using software, and the corresponding software program may be stored in a readable storage medium, such as an optical disc or a hard disk.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A method for detecting a chip gesture, the method comprising:

step 3: after the segmentation, the pose of the chip is accurately detected in real time by using a multilevel pose detection method and an SVM classifier and a template matching algorithm;

the MR-UNet model in the step 2 replaces the second layer convolution of the two 3 multiplied by 3 convolution layers of the original UNet network encoder stage with a multi-scale residual convolution module MRC; before each 3×3 convolution, a pixel fill operation with padding equal to 1 is performed around the image; introducing BN layer after each convolution operation, and carrying out standardization processing on the input image data so that the data of each layer accords with the distribution with the mean value of 0 and the variance of 1;

then activating the feature map after the BN layer through a ReLU function;

finally halving the number of downsampling channels;

finally, adding the input feature images and the channels corresponding to the feature images after dimension reduction pixel by pixel, and outputting the added feature images after ReLU activation;

the multi-stage pose detection method comprises the following steps:

step 34: the step 33 is primarily judged to be an image of a normal pose, the rotation angle of the image is calculated by acquiring the minimum circumscribed rectangle of the image, and the chip is finally judged to be in a rotation pose or a normal pose through the angle value;

the relative distance-based method in step 33 includes:

then, the chip salient feature point p (x, y) is used for relative to the upper left corner point O of the chip area ₁ (x ₁ ,y ₁ ) Absolute coordinates plus O ₁ (x ₁ ,y ₁ ) Absolute coordinates of the origin O (x, y) of the image relative to O (x, y) to obtain p (x, y)x, y); with respect to the upper left corner O of the tray area, the tray key q (x, y) ₂ (x ₂ ,y ₂ ) Absolute coordinates plus O ₂ (x ₂ ,y ₂ ) Absolute coordinates of q (x, y) relative to O (x, y) are obtained relative to absolute coordinates of the image origin O (x, y);

2. The method for detecting a chip posture according to claim 1, wherein the calculating process of the preset threshold in the step 31 includes: firstly, selecting more than 100 chip images with less material gestures after segmentation, counting the number of pixels with the value equal to 255 in each image, adding and dividing the number of images to obtain the average foreground number of pixels, and taking the average foreground number of pixels as a threshold parameter for distinguishing the less material gestures from other gestures.

3. The method according to claim 1, wherein the training process of the SVM classifier in the step 32 includes:

4. The method for detecting a chip posture according to claim 1, wherein the step 34 specifically includes:

the judgment angle beta, beta epsilon [0 DEG, 45 DEG ]:

5. The chip gesture detection method of claim 1, wherein the tolerance n selects 5 pixels.

6. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the chip gesture detection method according to any one of claims 1 to 5.