CN106803071B

CN106803071B - Method and device for detecting object in image

Info

Publication number: CN106803071B
Application number: CN201611249792.1A
Authority: CN
Inventors: 杨松林
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2020-02-14
Anticipated expiration: 2036-12-29
Also published as: CN106803071A

Abstract

The embodiment of the invention discloses a method and a device for detecting an object in an image, which are used for improving the real-time property of target detection. According to the method, an image to be detected is divided into a plurality of grids according to a preset dividing mode, the divided image is input into a convolutional neural network which is trained in advance, a feature vector corresponding to each grid of the image output by the convolutional neural network is obtained, the maximum value of a category parameter in each feature vector is identified, and when the maximum value is larger than a set threshold value, the position information of an object of a category corresponding to the category parameter is determined according to a central point position parameter and an outline size parameter in the feature vector. In the embodiment of the invention, the type and the position of the object in the image are determined through the pre-trained convolutional neural network, so that the detection of the position and the type of the object can be realized simultaneously, a plurality of characteristic areas are not required to be selected, the detection time is saved, the detection real-time performance and the detection efficiency are improved, and the integral optimization is facilitated.

Description

Method and device for detecting object in image

Technical Field

The invention relates to the technical field of machine learning, in particular to a method and a device for detecting an object in an image.

Background

With the development of video monitoring technology, intelligent video monitoring is applied to more and more scenes, such as traffic, markets, hospitals, communities, parks and the like, and the application of intelligent video monitoring lays a foundation for target detection through images in various scenes.

In the prior art, when target detection is performed in an image, a Convolutional Neural Network (R-CNN) based on a candidate Region and its extension Fast RCNN and FasterRCNN are generally adopted. Fig. 1 is a schematic flow chart of object detection by using R-CNN, and the detection process includes: receiving an input image, extracting candidate regions (region probable) in the image, calculating CNN (CNN) characteristics of each candidate region, and determining the type and position of an object by adopting a classification and regression method. In the process, 2000 candidate regions need to be extracted from the image, the whole extraction process needs 1-2 s of time, then for each candidate region, the CNN features of the candidate region need to be calculated, and many candidate regions are overlapped, so that many repeated works exist when the CNN features are calculated; and the detection process also comprises the following steps: the method comprises the steps of characteristic learning of the propofol, correction of the determined position of the object, false alarm elimination and the like, the whole detection process possibly needs 2-40 s of time, and the real-time performance of object detection is greatly influenced.

In addition, in the process of detecting the object by adopting the R-CNN, the extraction of the image is extracted by adopting a selective search, then the CNN characteristic is calculated by adopting a convolutional neural network, and finally a Support Vector Machine (SVM) is used for classification, so that the position of the target is determined. The three steps are mutually independent methods, and the whole detection process cannot be optimized integrally.

Fig. 2 is a schematic diagram of a process of object detection using fast RCNN, which uses a convolutional neural network, and each sliding window will generate 256-dimensional data in an intermediate layer (intermediate layer), detect the type of the object in a classification layer (clslide layer), and detect the position of the object in a regression layer (reg layer). The detection of the object type and the position is two independent steps, and the two steps need to respectively detect 256-dimensional data, so that the detection time length is increased, and the real-time performance of the object detection is affected.

Disclosure of Invention

The embodiment of the invention discloses a method and a device for detecting an object in an image, which are used for improving the real-time property of object detection and facilitating the integral optimization of the object detection.

In order to achieve the above object, an embodiment of the present invention discloses a method for detecting an object in an image, which is applied to an electronic device, and the method includes:

dividing an image to be detected into a plurality of grids according to a preset dividing mode, wherein the size of the image to be detected is a target size;

inputting the divided images into a convolutional neural network trained in advance, and acquiring a plurality of feature vectors of the images output by the convolutional neural network, wherein each grid corresponds to one feature vector;

and identifying the maximum value of the category parameters in the feature vector aiming at the feature vector corresponding to each grid, and determining the position information of the object of the category corresponding to the category parameters according to the central point position parameters and the overall dimension parameters in the feature vector when the maximum value is larger than a set threshold value.

Further, before dividing the image to be detected into a plurality of grids according to a preset dividing manner, the method further includes:

judging whether the size of the image is a target size;

and if not, adjusting the size of the image to the target size.

Further, the training process of the convolutional neural network comprises:

aiming at each sample image in the sample image set, adopting a rectangular frame to mark a target object;

dividing each sample image into a plurality of grids according to a preset dividing mode, determining a characteristic vector corresponding to each grid, wherein the size of each sample image is a target size, when the grid contains a central point of a target object, setting a value of a category parameter corresponding to the category in the characteristic vector corresponding to the grid to be a preset maximum value according to the category of the target object, determining a value of a central point position parameter in the characteristic vector according to the position of the central point in the grid, determining a value of an outline dimension parameter in the characteristic vector according to the size of a marked rectangular frame of the target object, and when the grid does not contain the central point of the target object, setting the value of each parameter in the characteristic vector corresponding to the grid to be zero;

the convolutional neural network is trained from each sample image for which a feature vector for each mesh is determined.

Further, before dividing each sample image into a plurality of grids according to a preset dividing manner, the method further includes:

judging whether the size of each sample image is a target size or not;

and if not, adjusting the size of the sample image to the target size.

Further, the training the convolutional neural network according to each sample image for which the feature vector of each mesh is determined includes:

selecting sub-sample images from the sample image set, wherein the number of the selected sub-sample images is smaller than the number of the sample images in the sample image set;

and training the convolutional neural network by adopting each selected subsample image.

Further, the preset dividing manner includes:

dividing the image and the sample image into a plurality of grids with the same row number and column number; or the like, or, alternatively,

the image and the sample image are divided into a plurality of grids different in the number of rows and the number of columns.

Further, the method further comprises:

determining the error of the convolutional neural network according to the prediction of the convolutional neural network on the position and the type of the object in the subsample image and the information of the target object marked in the subsample image;

determining that the convolutional neural network training is complete when the error converges, wherein the error is determined using the following loss function:

wherein, S is the number of rows or columns of the divided grids with the same number of columns and the number of predicted rectangular frames per grid preset in B, and is generally 1 or 2, x_iAs the marked center point of the target object is on the abscissa of the grid i,

for predicting the center point of the object on the abscissa, y, of the grid i_iAs the center point of the labeled target object is on the ordinate of the grid i,

for the central point of the predicted object on the ordinate, h, of the grid i_iFor the height, w, of the rectangular frame marked by the target object_iThe width of the rectangular frame marked with the target object is,to predict the height of the rectangular box where the object is located,

for the width of the rectangular frame in which the object is predicted, C_iTo label the probability of whether the target object currently exists in the grid i,

for the predicted probability of whether an object is currently present in the grid i, P_i(c) For the labeled probability of the target object in the grid i belonging to the category c,

for the predicted probability, λ, of an object within this grid i belonging to class c_coordAnd λ_noobjIn order to set the weight value of the user,

and taking 1 when the central point of the object in the jth predicted rectangular frame is positioned in the grid i, otherwise taking 0,

taking 1 when the center point of the object exists in the predicted grid i, otherwise taking 0,

taking 1 if the predicted grid i does not have the center point of the object, otherwise taking 0, wherein,

determined according to the following formula:

P_r(Object) is the predicted oneProbability of whether object is currently present in grid i, P_r(Class | Object) is the conditional probability that an Object within the predicted mesh i belongs to Class c.

Further, the determining the position information of the object of the category corresponding to the category parameter according to the center point position parameter and the outline dimension parameter in the feature vector includes:

determining the position information of the central point in the grid according to the position parameter of the central point;

and determining the central point according to the position information, taking the central point as the center of a rectangular frame, determining the position information of the rectangular frame according to the outline dimension parameter, taking the position information of the rectangular frame as the position information of the object, and taking the object type corresponding to the type parameter as the type of the object.

Further, the determining the position information of the central point in the grid according to the position parameter of the central point includes:

using the set points of the grid as reference points; and determining the position information of the central point in the grid according to the reference point and the position parameters of the central point.

The embodiment of the invention discloses a device for detecting an object in an image, which comprises:

the dividing module is used for dividing the image to be detected into a plurality of grids according to a preset dividing mode, wherein the size of the image to be detected is a target size;

the detection module is used for inputting the divided images into a convolutional neural network which is trained in advance, and acquiring a plurality of feature vectors of the images output by the convolutional neural network, wherein each grid corresponds to one feature vector;

and the determining module is used for identifying the maximum value of the category parameters in the feature vector aiming at the feature vector corresponding to each grid, and determining the position information of the object of the category corresponding to the category parameters according to the central point position parameters and the outline dimension parameters in the feature vector when the maximum value is larger than a set threshold value.

Further, the apparatus further comprises:

the judging and adjusting module is used for judging whether the size of the image is a target size; and if not, adjusting the size of the image to the target size.

Further, the apparatus further comprises:

the training module is used for adopting a rectangular frame to mark a target object aiming at each sample image in the sample image set; dividing each sample image into a plurality of grids according to a preset dividing mode, determining a characteristic vector corresponding to each grid, wherein the size of each sample image is a target size, when the grid contains a central point of a target object, setting a value of a category parameter corresponding to the category in the characteristic vector corresponding to the grid to be a preset maximum value according to the category of the target object, determining a value of a central point position parameter in the characteristic vector according to the position of the central point in the grid, determining a value of an outline dimension parameter in the characteristic vector according to the size of a marked rectangular frame of the target object, and when the grid does not contain the central point of the target object, setting the value of each parameter in the characteristic vector corresponding to the grid to be zero; the convolutional neural network is trained from each sample image for which a feature vector for each mesh is determined.

Further, the training module is further configured to determine, for each sample image, whether the size of the sample image is a target size; and if not, adjusting the size of the sample image to the target size.

Further, the training module is specifically configured to select sub-sample images from the sample image set, where the number of the selected sub-sample images is smaller than the number of the sample images in the sample image set; and training the convolutional neural network by adopting each selected subsample image.

Further, the apparatus further comprises:

the error calculation module is used for determining the error of the convolutional neural network according to the prediction of the position and the category of the object in the subsample image by the convolutional neural network and the target object marked in the subsample image;

wherein, S is the number of rows or columns of the divided grids with the same number of columns and the number of predicted rectangular frames per grid preset in B, and is generally 1 or 2, x_iAs the marked center point of the target object is on the abscissa of the grid i,for predicting the center point of the object on the abscissa, y, of the grid i_iAs the center point of the labeled target object is on the ordinate of the grid i,

for the central point of the predicted object on the ordinate, h, of the grid i_iFor the height, w, of the rectangular frame marked by the target object_iThe width of the rectangular frame marked with the target object is,

to predict the height of the rectangular box where the object is located,for the width of the rectangular frame in which the object is predicted, C_iTo label the probability of whether the target object currently exists in the grid i,

for the predicted probability of whether an object is currently present in the grid i, P_i(c) For the labeled probability of the target object in the grid i belonging to the category c,for the predicted probability, λ, of an object within this grid i belonging to class c_coordAnd λ_noobjIn order to set the weight value of the user,

and taking 1 when the central point of the object in the jth predicted rectangular frame is positioned in the grid i, otherwise taking 0,taking 1 when the center point of the object exists in the predicted grid i, otherwise taking 0,

determined according to the following formula:

P_r(Object) is the predicted probability of whether an Object currently exists in the grid i, P_r(Class | Object) is the conditional probability that an Object within the predicted mesh i belongs to Class c.

Further, the determining module is specifically configured to determine, according to the position parameter of the central point, the position information of the central point in the grid;

Further, the determination module is specifically configured to use a set point of the grid as a reference point; and determining the position information of the central point in the grid according to the reference point and the position parameters of the central point.

The embodiment of the invention provides a method and a device for detecting an object in an image, wherein the method comprises the steps of dividing the image to be detected into a plurality of grids according to a preset dividing mode, wherein the size of the image is a target size, inputting the divided image into a convolutional neural network trained in advance, obtaining a plurality of characteristic vectors of the image output by the convolutional neural network, wherein each grid corresponds to one characteristic vector, identifying the maximum value of a category parameter in each characteristic vector, and determining the position information of the object of the category corresponding to the category parameter according to the position parameter of a central point and the outline size parameter in the characteristic vector when the maximum value is larger than a set threshold value. In the embodiment of the invention, each feature vector corresponding to the image is determined through the convolutional neural network trained in advance, the category and the position of the object in the image are determined according to the category parameter and the position related parameter in the feature vector, the detection of the position and the category of the object can be realized simultaneously, the integral optimization is convenient, in addition, the position and the category of the object are determined according to the feature vector corresponding to each grid, a plurality of feature areas are not required to be selected, the detection time is saved, and the detection real-time performance and the detection efficiency are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic view of a process for object detection using R-CNN;

FIG. 2 is a schematic diagram of an object detection process using fast RCNN;

FIG. 3 is a schematic diagram of an object detection process in an image according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a detailed implementation process of object detection in an image according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a training process of a convolutional neural network according to an embodiment of the present invention;

FIGS. 6A-6D are schematic diagrams illustrating labeling results of a target object according to an embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating a process for constructing the cube structure of FIG. 6D;

fig. 8 is a schematic structural diagram of an object detection apparatus in an image according to an embodiment of the present invention.

Detailed Description

In order to effectively improve the efficiency of object detection, improve the real-time performance of object detection and facilitate the overall optimization of object detection, the embodiment of the invention provides a method and a device for detecting an object in an image.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 3 is a schematic diagram of an object detection process in an image according to an embodiment of the present invention, where the process includes the following steps:

step S301: dividing an image to be detected into a plurality of grids according to a preset dividing mode, wherein the size of the image to be detected is a target size.

The embodiment of the invention is applied to the electronic equipment, and the electronic equipment can be a desktop computer, a notebook computer, other intelligent equipment with processing capacity and the like.

After an image to be detected of a target size is obtained, the image to be detected is divided into a plurality of grids according to a preset dividing mode, wherein the preset dividing mode is the same as the dividing mode of the image when a convolutional neural network is trained. For example, the image may be formed into a plurality of rows, a plurality of columns, and the intervals between the rows and the intervals between the columns may be equal or different for convenience. Of course, the image may be divided into a plurality of irregular grids as long as the image to be detected and the image subjected to convolutional neural network training are ensured to adopt the same grid division mode.

When the image is divided into a plurality of rows and a plurality of columns, the image may be divided into a plurality of grids having the same number of rows and columns, or may be divided into a plurality of grids having different numbers of rows and columns, and the aspect ratio of each grid after division may be the same or different.

Step S302: and inputting the divided image into a convolutional neural network trained in advance, and acquiring a plurality of feature vectors of the image output by the convolutional neural network, wherein each grid corresponds to one feature vector.

In order to detect the category and the position of an object in an image, in the embodiment of the present invention, a convolutional neural network is trained, a feature vector corresponding to each mesh is obtained through the trained convolutional neural network, for example, the image may be divided into 49 meshes of 7 × 7, and after the divided image is input to the trained convolutional neural network, 49 feature vectors may be output, where each feature vector corresponds to one mesh.

Step S303: and identifying the maximum value of the category parameters in the feature vector aiming at the feature vector corresponding to each grid, and determining the position information of the object of the category corresponding to the category parameters according to the central point position parameters and the overall dimension parameters in the feature vector when the maximum value is larger than a set threshold value.

Specifically, the feature vector obtained in the embodiment of the present invention is a multidimensional vector, and the feature vector at least includes: the device comprises a category parameter and a position parameter, wherein the category parameter comprises a plurality of parameters, and the position parameter comprises: a center point location parameter and a physical dimension parameter. And after the feature vector corresponding to each grid is obtained, respectively judging whether the grid detects the object or not according to the feature vector corresponding to each grid. If the maximum value of the plurality of category parameters in the feature vector corresponding to the grid is larger than the set threshold value, the grid detects the object, the category corresponding to the category parameter is the category of the object, and the position of the object can be determined according to the feature vector corresponding to the grid.

Since the position parameters in the feature vectors employed in the convolutional neural network training are determined according to a set method, the position of the object can be determined according to the set method.

In the embodiment of the invention, each feature vector corresponding to the image is determined through the convolutional neural network trained in advance, and the category and the position of the object in the image are determined according to the category parameter and the position related parameter in the feature vector, so that the prediction of the position and the category of the object can be realized simultaneously, the integral optimization is convenient, in addition, the position and the category of the object are determined according to the feature vector corresponding to each grid, a plurality of feature areas are not required to be selected, the detection time is saved, and the detection real-time performance and the detection efficiency are improved.

The object detection in the embodiment of the present invention is performed on an image with a target size, where the target size is a uniform size of an image used in training the convolutional neural network, and the size may be any size as long as the size of the image is the same as that of the image in object detection in training the convolutional neural network. The target size may be, for example, 1024 x 1024, or may be 256 x 512, etc.

Therefore, in an embodiment of the present invention, in order to ensure that the images input into the convolutional neural network are all images of a target size, before dividing the image to be detected into a plurality of grids according to a preset dividing manner, the method further includes:

judging whether the size of the image is a target size;

and if not, adjusting the size of the image to the target size.

And when the image to be detected is in a target size, directly carrying out subsequent processing on the image, and when the image to be detected is in a non-target size, adjusting the image to be detected to the target size. The adjustment of the image size belongs to the prior art, and the process is not described in detail in the embodiment of the present invention.

Specifically, in the embodiment of the present invention, determining the position information of the object of the category corresponding to the category parameter according to the center point position parameter and the outline dimension parameter in the feature vector includes:

Wherein the determining the position information of the central point in the grid according to the position parameter of the central point comprises:

Fig. 4 is a schematic diagram of a detailed implementation process of object detection in an image according to an embodiment of the present invention, where the process includes the following steps:

step S401: an image to be detected is received.

Step S402: and judging whether the size of the image is the target size, if so, performing step S404, and otherwise, performing step S403.

Step S403: and adjusting the size of the image to a target size.

Step S404: dividing an image to be detected into a plurality of grids according to a preset dividing mode, wherein the size of the image to be detected is a target size.

Step S405: and inputting the divided image into a convolutional neural network trained in advance, and acquiring a plurality of feature vectors of the image output by the convolutional neural network, wherein each grid corresponds to one feature vector.

Step S406: and identifying the maximum value of the category parameter in the feature vector aiming at the feature vector corresponding to each grid.

Step S407: and when the maximum value is larger than a set threshold value, taking the set point of the grid as a reference point, and determining the position information of the central point in the grid according to the reference point and the position parameters of the central point.

Step S408: and determining the central point according to the position information, taking the central point as the center of a rectangular frame, determining the position information of the rectangular frame according to the outline dimension parameter, taking the position information of the rectangular frame as the position information of the object, and taking the object type corresponding to the type parameter as the type of the object.

The target detection is performed based on the trained convolutional neural network, and in order to detect the object, the convolutional neural network needs to be trained. In the embodiment of the invention, when the convolutional neural network is trained, a sample image with a target size is divided into a plurality of grids, and if the central point of a certain target object is located in a certain grid, the grid is responsible for detecting the target object, including detecting the type and the corresponding position (bounding box) of the target object.

Fig. 5 is a schematic diagram of a training process of a convolutional neural network according to an embodiment of the present invention, where the training process includes the following steps:

step S501: and marking the target object by adopting a rectangular frame aiming at each sample image in the sample image set.

In the embodiment of the invention, a large number of sample images are adopted to train the convolutional neural network, and then the large number of sample images form a sample image set. A rectangular frame is used to mark the target object in each sample image.

Specifically, as shown in fig. 6A to fig. 6D, the labeling result of the target object is illustrated schematically, and 3 target objects, namely, a dog, a bicycle, and a car, exist in the sample image in fig. 6A. When labeling each target object, the vertices of each target object in four directions, i.e., up, down, left, and right (with respect to the up, down, left, and right directions shown in fig. 6A) are identified in the sample image, and if the vertices are the up and down vertices, two lines parallel to the upper and lower bottom sides of the sample image passing through the up and down vertices are defined as two sides of the rectangular frame, and if the vertices are the left and right vertices, two lines parallel to the left and right sides of the sample image passing through the left and right vertices are defined as the other two sides of the rectangular frame. Such as the rectangular boxes of dogs, bicycles and cars marked with dashed lines in fig. 6A.

Step S502: dividing each sample image into a plurality of grids according to a preset dividing mode, determining a characteristic vector corresponding to each grid, wherein the size of each sample image is a target size, when the grid contains a central point of a target object, setting a value of a category parameter corresponding to the category in the characteristic vector corresponding to the grid to be a preset maximum value according to the category of the target object, determining a value of a central point position parameter in the characteristic vector according to the position of the central point in the grid, determining a value of an outline dimension parameter in the characteristic vector according to the size of a marked rectangular frame of the target object, and when the grid does not contain the central point of the target object, setting the value of each parameter in the characteristic vector corresponding to the grid to be zero.

In the embodiment of the present invention, the sample image may be divided into a plurality of grids according to a preset division manner, where the division manner of the sample image is the same as the division manner of the image to be detected in the detection process.

For example, the image may be formed into a plurality of rows, a plurality of columns, and the intervals between the rows and the intervals between the columns may be equal or different for convenience. Of course, the image may be divided into a plurality of irregular grids as long as the image to be detected and the image subjected to convolutional neural network training are ensured to adopt the same grid division mode.

When the image is divided into a plurality of rows and a plurality of columns, the image may be divided into a plurality of grids having the same number of rows and columns, or may be divided into a plurality of grids having different numbers of rows and columns, and the aspect ratio of each grid after division may be the same or different. For example, the sample image is divided into a plurality of grids of 12 × 10, or 15 × 15, or 6 × 6, etc. When the size of the mesh is equal, the mesh size may be normalized. As shown in fig. 6B, in the embodiment of the present invention, the sample image is divided into a plurality of grids of 7 rows in the transverse direction and 7 columns in the longitudinal direction, and each grid is a square grid, so that the size of each grid after normalization can be regarded as 1 × 1.

Each grid in the sample image corresponds to a feature vector, the feature vector is a multi-dimensional vector, and the feature vector at least comprises: the device comprises a category parameter and a position parameter, wherein the category parameter comprises a plurality of parameters, and the position parameter comprises: a center point location parameter and a physical dimension parameter.

Step S503: the convolutional neural network is trained from each sample image for which a feature vector for each mesh is determined.

Specifically, in the embodiment of the present invention, the convolutional neural network may be trained by using all sample images in the sample image set. However, because the sample image set includes a large number of sample images, in order to improve the training efficiency, in the embodiment of the present invention, training the convolutional neural network according to each sample image for which the feature vector of each mesh is determined includes:

By randomly selecting sub-sample images far smaller than the total number of the sample images, the convolutional neural network is trained, and parameters of the convolutional neural network are continuously updated until the error between the information of the object predicted by each grid and the information of the labeled target object is converged.

Similarly, in the embodiment of the present invention, when training the convolutional neural network, a sample image of a target size is used, and therefore, in the embodiment of the present invention, in order to ensure that the sample images input into the convolutional neural network are all of the target size, before dividing each sample image into a plurality of grids according to a preset dividing manner, the method further includes:

judging whether the size of each sample image is a target size or not;

and if not, adjusting the size of the sample image to the target size.

And when the sample image is in the target size, directly carrying out subsequent processing on the sample image, and when the sample image is not in the target size, adjusting the sample image to the target size. The adjustment of the image size belongs to the prior art, and the process is not described in detail in the embodiment of the present invention.

In the above process, the target size of the sample image may be adjusted first, or the rectangular frame may be labeled first in the sample image. The marking of the rectangular frame is firstly carried out to ensure that the target object can be accurately marked when the size of the sample image is larger, and the target size is firstly adjusted to ensure that the target object can be accurately marked when the size of the sample image is smaller.

In the above labeling process, a feature vector corresponding to each mesh in the sample image may be determined, and in an embodiment of the present invention, the feature vector corresponding to each mesh may be represented as (confidence, cls1, cls2, cls3, …, cls20, x, y, w, h), where confidence is a probability parameter, cls1, cls2, cls3, …, and cls20 are category parameters, and x, y, w, and h are position parameters, where x and y are center point position parameters, and w and y are outline size parameters. When the grid contains the central point of the target object, the value of each parameter in the feature vector corresponding to the grid is determined, and when the grid does not contain the central point of the target object, the value of each parameter in the feature vector corresponding to the grid is 0.

Specifically, since each target object is labeled by using a rectangular frame in the sample image, the center point of the rectangular frame may be considered as the center point of the target object, such as the center points of the three rectangular frames shown in fig. 6C. When the grid includes the center point of the target object, then during labeling, the probability parameter in the feature vector corresponding to the grid may be considered to be 1, that is, the probability that the target object exists in the grid is 1 at present.

Since there are a plurality of categories of target objects included in the sample image, the category parameter cls is used in the embodiment of the present invention to represent the target objects of different categories, namely, cls1, cls2, … … and clsn. For example, n may be 20, i.e., 20 classes of objects are shared, the class of objects represented by cls1 is a car, the class of objects represented by cls2 is a dog, and the class of objects represented by cls3 is a bicycle. When the grid includes the center point of the target object, the class parameter value corresponding to the target object is set to be a maximum value, where the maximum value is greater than a set threshold, for example, the maximum value may be 1, the threshold may be 0.4, and the like.

For example, as shown in fig. 6C, from bottom to top (top and bottom shown in fig. 6C), in the feature vector corresponding to the grid where each central point is located, cls2 in the class parameter in the feature vector corresponding to the first central point is 1, the other class parameters are 0, cls3 in the class parameter in the feature vector corresponding to the second central point is 1, the other class parameters are 0, cls1 in the class parameter in the feature vector corresponding to the third central point is 1, and the other class parameters are 0.

The feature vector further includes position parameters x, y, w, and h of the target object, where x and y are center point position parameters, and values of the center point position parameters are horizontal and vertical coordinate values of the center point of the target object relative to a set point, where the set points corresponding to each grid may be the same or different, for example, the upper left corner of the sample image may be considered as the set point, i.e., the origin of coordinates, because each grid is normalized, and thus the coordinates of each position in each grid are uniquely determined. Of course, in order to simplify the process and reduce the amount of calculation, the corresponding set point of each grid may also be different, and each grid may be regarded as an independent unit, and the upper left corner of the grid is the set point, i.e. the origin of coordinates. Therefore, when labeling is performed, the values of x and y in the feature vector corresponding to the grid can be determined according to the offset of the central point relative to the upper left corner of the grid where the central point is located. The process of determining the x and y values according to the offset of the relative position belongs to the prior art, and is not described in detail in the embodiment of the present invention. In the position parameters, w and h are outline dimension parameters, and the values of the outline dimension parameters are the length and width values of a rectangular frame where the target object is located.

Because the feature vector is a multidimensional vector, in order to accurately represent the feature vector corresponding to each mesh, in the embodiment of the present invention, the cubic structure shown in fig. 6D is constructed according to the construction method shown in fig. 7, and the mesh is correspondingly processed in the convolutional layer, the max pooling layer, the full connection layer, and the output layer, so as to generate a cubic mesh structure, where the depth of the cubic mesh in the Z-axis direction is determined according to the dimension of the feature vector. In the present embodiment, the depth of the cubic grid in the Z-axis direction is 25. The above process of performing corresponding processing in each layer of the convolutional neural network to generate the cubic grid structure belongs to the prior art, and is not described in detail in the embodiment of the present invention.

After a large number of sample images are labeled by adopting the mode, the convolutional neural network is trained by adopting the labeled sample images. Specifically, the plurality of subsample images used in the embodiment of the present invention train the convolutional neural network. In the training process, for each subsample image, obtaining a convolution feature map of the subsample image through a convolution neural network, wherein the convolution feature map comprises a feature vector (confidence, cls1, cls2, cls3, …, cls20, x, y, w, h) corresponding to each grid, the feature vector comprises a position parameter and a category parameter of an object predicted in the grid, and a probability parameter confidence, and the probability parameter confidence represents the overlapping degree of a rectangular frame where the object is predicted by the grid and a rectangular frame marked with the target object.

In the training process, for each sub-sample image, network parameters of the convolutional neural network are adjusted by calculating the error between the prediction information and the labeling information, the sub-sample images which are far less than the total number (batch) of the sample images are randomly selected each time, the convolutional neural network is trained, and the network parameters are updated until the error between the prediction information and the labeling information of each grid is converged. The process of training the convolutional neural network according to the sub-sample image and adjusting the network parameters of the convolutional neural network until the training of the convolutional neural network is completed belongs to the prior art, and is not described in detail in the embodiment of the invention.

In the training process of the convolutional neural network, in order to accurately predict the position and the category information of the object, in the embodiment of the present invention, the last fully-connected layer of the convolutional neural network uses a logic activation function, and the convolutional layer and other fully-connected layers use a Leak ReLU function. Wherein the leakage ReLU function is:

in order to complete the training of the convolutional neural network and make it converge in the embodiment of the present invention, when training the convolutional neural network, the method further includes:

determining the error of the convolutional neural network according to the prediction of the position and the type of the target object in the subsample image by the convolutional neural network and the information of the target object marked in the subsample image;

wherein S is the row number or column number of the divided grids with the same row number and column number, B is the number of the preset rectangular frames predicted by each grid, 1 or 2 is generally taken, xi is the abscissa of the marked target object center point on the grid i,

the horizontal coordinate of the central point of the predicted object in the grid i, yi is the vertical coordinate of the central point of the labeled target object in the grid i,for the central point of the predicted object on the ordinate, h, of the grid i_iFor the height, w, of the rectangular frame marked by the target object_iThe width of the rectangular frame marked with the target object is,to predict the height of the rectangular box where the object is located,

determined according to the following formula:

P_r(Object) is the predicted probability of whether an Object currently exists in the grid i, P_r(Class | Object) to classify objects within predicted mesh i as belonging to Class cThe conditional probability.

In order to make the contribution of the prediction process to the position prediction smaller when the error between the prediction result and the labeled result is larger, in the embodiment of the present invention, the above-mentioned loss function is adopted.

As shown in fig. 6B, in an embodiment of the present invention, each sample image is divided into 49 grids of 7 × 7, each grid can detect 20 classes, so that one sample image can generate 980 detection probabilities, and most grids have a detection probability of 0. This will lead to training discretization where a variable is introduced to solve the problem: i.e. the probability of whether an object is present in a certain grid. Thus, in addition to the 20 class parameters, there is also a predicted probability P of whether an object is currently present in the grid_r(Object), the probability that a target Object within a certain mesh belongs to class cIs P_r(Object) conditional probability P of an Object in a predicted mesh belonging to class c_r(Class | Object). At each mesh, P is paired_r(Object) is updated, P only if there are objects in the grid_r(Class | Object) is updated.

Fig. 8 is a schematic structural diagram of an apparatus for detecting an object in an image according to an embodiment of the present invention, where the apparatus is located in an electronic device, and the apparatus includes:

the dividing module 81 is configured to divide an image to be detected into a plurality of grids according to a preset dividing manner, where the size of the image to be detected is a target size;

the detection module 82 is configured to input the divided image into a convolutional neural network trained in advance, and obtain a plurality of feature vectors of the image output by the convolutional neural network, where each grid corresponds to one feature vector;

the determining module 83 is configured to identify, for a feature vector corresponding to each grid, a maximum value of a category parameter in the feature vector, and determine, when the maximum value is greater than a set threshold, position information of an object of a category corresponding to the category parameter according to a center point position parameter and an outer dimension parameter in the feature vector.

The device further comprises:

a judgment adjustment module 84, configured to judge whether the size of the image is a target size; and if not, adjusting the size of the image to the target size.

The device further comprises:

a training module 85, configured to focus on a target object with a rectangular frame for each sample image in the sample image set; dividing each sample image into a plurality of grids according to a preset dividing mode, determining a characteristic vector corresponding to each grid, wherein the size of each sample image is a target size, when the grid contains a central point of a target object, setting a value of a category parameter corresponding to the category in the characteristic vector corresponding to the grid to be a preset maximum value according to the category of the target object, determining a value of a central point position parameter in the characteristic vector according to the position of the central point in the grid, determining a value of an outline dimension parameter in the characteristic vector according to the size of a marked rectangular frame of the target object, and when the grid does not contain the central point of the target object, setting the value of each parameter in the characteristic vector corresponding to the grid to be zero; the convolutional neural network is trained from each sample image for which a feature vector for each mesh is determined.

The training module 85 is further configured to determine, for each sample image, whether the size of the sample image is a target size; and if not, adjusting the size of the sample image to the target size.

The training module 85 is specifically configured to select sub-sample images from the sample image set, where the number of the selected sub-sample images is smaller than the number of the sample images in the sample image set; and training the convolutional neural network by adopting each selected subsample image.

The device further comprises:

an error calculation module 86, configured to determine an error of the convolutional neural network according to the prediction of the position and the category of the object in the subsample image by the convolutional neural network and information of the target object labeled in the subsample image;

to predict the height of the rectangular box where the object is located,

for the width of the rectangular frame in which the object is predicted, C_iTo label the probability of whether the target object currently exists in the grid i,for the predicted probability of whether an object is currently present in the grid i, P_i(c) For the labeled probability of the target object in the grid i belonging to the category c,

determined according to the following formula:

The determining module 83 is specifically configured to determine the position information of the central point in the grid according to the position parameter of the central point; and determining the central point according to the position information, taking the central point as the center of a rectangular frame, determining the position information of the rectangular frame according to the outline dimension parameter, taking the position information of the rectangular frame as the position information of the object, and taking the object type corresponding to the type parameter as the type of the object.

The determination module 83, in particular for taking the set points of the grid as reference points; and determining the position information of the central point in the grid according to the reference point and the position parameters of the central point.

For the system/apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for detecting an object in an image, which is applied to an electronic device, includes:

identifying the maximum value of the category parameters in the feature vector aiming at the feature vector corresponding to each grid, and determining the position information of the object of the category corresponding to the category parameters according to the central point position parameters and the overall dimension parameters in the feature vector when the maximum value is larger than a set threshold value;

wherein, the preset dividing mode comprises:

dividing the image and the sample image into a plurality of grids with the same row number and column number;

the method further comprises the following steps:

determining the error of the convolutional neural network according to the prediction of the position and the type of the object in the subsample image by the convolutional neural network and the information of the target object marked in the subsample image;

wherein S is the row number or column number of the divided grids with the same row number and column number, B is the number of the predicted rectangular frames of each grid preset, and 1 or 2, x is taken_iAs the marked center point of the target object is on the abscissa of the grid i,

taking 1 when the central point of the predicted object in the jth rectangular frame is positioned in the grid I, otherwise, taking 0, I_i ^objTaking 1 when the center point of the object exists in the predicted grid i, otherwise taking 0,

in the predicted grid iIf there is no center point of the object, 1 is taken, otherwise 0 is taken, wherein,

determined according to the following formula:

2. The method according to claim 1, wherein before the dividing the image to be detected into a plurality of grids according to the preset dividing manner, the method further comprises:

judging whether the size of the image is a target size;

and if not, adjusting the size of the image to the target size.

3. The method of claim 1, wherein the training process of the convolutional neural network comprises:

4. The method of claim 3, wherein before the dividing each sample image into a plurality of grids according to the preset dividing manner, the method further comprises:

judging whether the size of each sample image is a target size or not;

and if not, adjusting the size of the sample image to the target size.

5. The method of claim 3, wherein training the convolutional neural network based on each sample image for which a feature vector for each mesh is determined comprises:

6. The method according to claim 1, wherein the determining, according to the center point position parameter and the outline size parameter in the feature vector, the position information of the object of the category corresponding to the category parameter comprises:

7. The method of claim 6, wherein the determining the position information of the center point in the grid according to the position parameter of the center point comprises:

8. An apparatus for detecting an object in an image, the apparatus comprising:

the determining module is used for identifying the maximum value of the category parameters in the feature vector aiming at the feature vector corresponding to each grid, and determining the position information of the object of the category corresponding to the category parameters according to the central point position parameters and the outline dimension parameters in the feature vector when the maximum value is larger than a set threshold;

wherein the apparatus further comprises:

the error calculation module is used for determining the error of the convolutional neural network according to the prediction of the position and the type of the object in the subsample image by the convolutional neural network and the target object marked in the subsample image;

wherein S is the row number or column number with the same row number and column number of the divided grids, and B is presetThe number of rectangular frames per grid prediction of (1) or (2, x)_iAs the marked center point of the target object is on the abscissa of the grid i,for predicting the center point of the object on the abscissa, y, of the grid i_iAs the center point of the labeled target object is on the ordinate of the grid i,

to predict the height of the rectangular box where the object is located,

taking 1 when the central point of the predicted object in the jth rectangular frame is positioned in the grid I, otherwise, taking 0, I_i ^objPresence of an object in the predicted grid iThe central point of the body is 1, otherwise 0 is taken,

determined according to the following formula:

9. The apparatus of claim 8, further comprising:

10. The apparatus of claim 8, further comprising:

11. The apparatus of claim 10, wherein the training module is further configured to determine, for each sample image, whether the size of the sample image is a target size; and if not, adjusting the size of the sample image to the target size.

12. The apparatus according to claim 11, wherein the training module is specifically configured to select sub-sample images from the sample image set, wherein the number of the selected sub-sample images is smaller than the number of sample images in the sample image set; and training the convolutional neural network by adopting each selected subsample image.

13. The apparatus according to claim 8, wherein the determining module is specifically configured to determine the position information of the central point in the grid according to the position parameter of the central point;

14. The device according to claim 13, characterized in that said determination module is particularly adapted to take as a reference point a set point of said grid; and determining the position information of the central point in the grid according to the reference point and the position parameters of the central point.