[go: up one dir, main page]

CN106803071B - Method and device for detecting object in image - Google Patents

Method and device for detecting object in image Download PDF

Info

Publication number
CN106803071B
CN106803071B CN201611249792.1A CN201611249792A CN106803071B CN 106803071 B CN106803071 B CN 106803071B CN 201611249792 A CN201611249792 A CN 201611249792A CN 106803071 B CN106803071 B CN 106803071B
Authority
CN
China
Prior art keywords
grid
image
central point
size
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611249792.1A
Other languages
Chinese (zh)
Other versions
CN106803071A (en
Inventor
杨松林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dahua Technology Co Ltd
Original Assignee
Zhejiang Dahua Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co Ltd filed Critical Zhejiang Dahua Technology Co Ltd
Priority to CN201611249792.1A priority Critical patent/CN106803071B/en
Publication of CN106803071A publication Critical patent/CN106803071A/en
Priority to PCT/CN2017/107043 priority patent/WO2018121013A1/en
Priority to EP17886017.7A priority patent/EP3545466A4/en
Priority to US16/457,861 priority patent/US11113840B2/en
Application granted granted Critical
Publication of CN106803071B publication Critical patent/CN106803071B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20021Dividing image into blocks, subimages or windows
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention discloses a method and a device for detecting an object in an image, which are used for improving the real-time property of target detection. According to the method, an image to be detected is divided into a plurality of grids according to a preset dividing mode, the divided image is input into a convolutional neural network which is trained in advance, a feature vector corresponding to each grid of the image output by the convolutional neural network is obtained, the maximum value of a category parameter in each feature vector is identified, and when the maximum value is larger than a set threshold value, the position information of an object of a category corresponding to the category parameter is determined according to a central point position parameter and an outline size parameter in the feature vector. In the embodiment of the invention, the type and the position of the object in the image are determined through the pre-trained convolutional neural network, so that the detection of the position and the type of the object can be realized simultaneously, a plurality of characteristic areas are not required to be selected, the detection time is saved, the detection real-time performance and the detection efficiency are improved, and the integral optimization is facilitated.

Description

Method and device for detecting object in image
Technical Field
The invention relates to the technical field of machine learning, in particular to a method and a device for detecting an object in an image.
Background
With the development of video monitoring technology, intelligent video monitoring is applied to more and more scenes, such as traffic, markets, hospitals, communities, parks and the like, and the application of intelligent video monitoring lays a foundation for target detection through images in various scenes.
In the prior art, when target detection is performed in an image, a Convolutional Neural Network (R-CNN) based on a candidate Region and its extension Fast RCNN and FasterRCNN are generally adopted. Fig. 1 is a schematic flow chart of object detection by using R-CNN, and the detection process includes: receiving an input image, extracting candidate regions (region probable) in the image, calculating CNN (CNN) characteristics of each candidate region, and determining the type and position of an object by adopting a classification and regression method. In the process, 2000 candidate regions need to be extracted from the image, the whole extraction process needs 1-2 s of time, then for each candidate region, the CNN features of the candidate region need to be calculated, and many candidate regions are overlapped, so that many repeated works exist when the CNN features are calculated; and the detection process also comprises the following steps: the method comprises the steps of characteristic learning of the propofol, correction of the determined position of the object, false alarm elimination and the like, the whole detection process possibly needs 2-40 s of time, and the real-time performance of object detection is greatly influenced.
In addition, in the process of detecting the object by adopting the R-CNN, the extraction of the image is extracted by adopting a selective search, then the CNN characteristic is calculated by adopting a convolutional neural network, and finally a Support Vector Machine (SVM) is used for classification, so that the position of the target is determined. The three steps are mutually independent methods, and the whole detection process cannot be optimized integrally.
Fig. 2 is a schematic diagram of a process of object detection using fast RCNN, which uses a convolutional neural network, and each sliding window will generate 256-dimensional data in an intermediate layer (intermediate layer), detect the type of the object in a classification layer (clslide layer), and detect the position of the object in a regression layer (reg layer). The detection of the object type and the position is two independent steps, and the two steps need to respectively detect 256-dimensional data, so that the detection time length is increased, and the real-time performance of the object detection is affected.
Disclosure of Invention
The embodiment of the invention discloses a method and a device for detecting an object in an image, which are used for improving the real-time property of object detection and facilitating the integral optimization of the object detection.
In order to achieve the above object, an embodiment of the present invention discloses a method for detecting an object in an image, which is applied to an electronic device, and the method includes:
dividing an image to be detected into a plurality of grids according to a preset dividing mode, wherein the size of the image to be detected is a target size;
inputting the divided images into a convolutional neural network trained in advance, and acquiring a plurality of feature vectors of the images output by the convolutional neural network, wherein each grid corresponds to one feature vector;
and identifying the maximum value of the category parameters in the feature vector aiming at the feature vector corresponding to each grid, and determining the position information of the object of the category corresponding to the category parameters according to the central point position parameters and the overall dimension parameters in the feature vector when the maximum value is larger than a set threshold value.
Further, before dividing the image to be detected into a plurality of grids according to a preset dividing manner, the method further includes:
judging whether the size of the image is a target size;
and if not, adjusting the size of the image to the target size.
Further, the training process of the convolutional neural network comprises:
aiming at each sample image in the sample image set, adopting a rectangular frame to mark a target object;
dividing each sample image into a plurality of grids according to a preset dividing mode, determining a characteristic vector corresponding to each grid, wherein the size of each sample image is a target size, when the grid contains a central point of a target object, setting a value of a category parameter corresponding to the category in the characteristic vector corresponding to the grid to be a preset maximum value according to the category of the target object, determining a value of a central point position parameter in the characteristic vector according to the position of the central point in the grid, determining a value of an outline dimension parameter in the characteristic vector according to the size of a marked rectangular frame of the target object, and when the grid does not contain the central point of the target object, setting the value of each parameter in the characteristic vector corresponding to the grid to be zero;
the convolutional neural network is trained from each sample image for which a feature vector for each mesh is determined.
Further, before dividing each sample image into a plurality of grids according to a preset dividing manner, the method further includes:
judging whether the size of each sample image is a target size or not;
and if not, adjusting the size of the sample image to the target size.
Further, the training the convolutional neural network according to each sample image for which the feature vector of each mesh is determined includes:
selecting sub-sample images from the sample image set, wherein the number of the selected sub-sample images is smaller than the number of the sample images in the sample image set;
and training the convolutional neural network by adopting each selected subsample image.
Further, the preset dividing manner includes:
dividing the image and the sample image into a plurality of grids with the same row number and column number; or the like, or, alternatively,
the image and the sample image are divided into a plurality of grids different in the number of rows and the number of columns.
Further, the method further comprises:
determining the error of the convolutional neural network according to the prediction of the convolutional neural network on the position and the type of the object in the subsample image and the information of the target object marked in the subsample image;
determining that the convolutional neural network training is complete when the error converges, wherein the error is determined using the following loss function:
Figure BDA0001197820270000041
wherein, S is the number of rows or columns of the divided grids with the same number of columns and the number of predicted rectangular frames per grid preset in B, and is generally 1 or 2, xiAs the marked center point of the target object is on the abscissa of the grid i,
Figure BDA0001197820270000043
for predicting the center point of the object on the abscissa, y, of the grid iiAs the center point of the labeled target object is on the ordinate of the grid i,
Figure BDA0001197820270000044
for the central point of the predicted object on the ordinate, h, of the grid iiFor the height, w, of the rectangular frame marked by the target objectiThe width of the rectangular frame marked with the target object is,to predict the height of the rectangular box where the object is located,
Figure BDA0001197820270000046
for the width of the rectangular frame in which the object is predicted, CiTo label the probability of whether the target object currently exists in the grid i,
Figure BDA0001197820270000047
for the predicted probability of whether an object is currently present in the grid i, Pi(c) For the labeled probability of the target object in the grid i belonging to the category c,
Figure BDA0001197820270000048
for the predicted probability, λ, of an object within this grid i belonging to class ccoordAnd λnoobjIn order to set the weight value of the user,
Figure BDA0001197820270000049
and taking 1 when the central point of the object in the jth predicted rectangular frame is positioned in the grid i, otherwise taking 0,
Figure BDA00011978202700000410
taking 1 when the center point of the object exists in the predicted grid i, otherwise taking 0,
Figure BDA00011978202700000411
taking 1 if the predicted grid i does not have the center point of the object, otherwise taking 0, wherein,
Figure BDA00011978202700000412
determined according to the following formula:
Figure BDA00011978202700000413
Pr(Object) is the predicted oneProbability of whether object is currently present in grid i, Pr(Class | Object) is the conditional probability that an Object within the predicted mesh i belongs to Class c.
Further, the determining the position information of the object of the category corresponding to the category parameter according to the center point position parameter and the outline dimension parameter in the feature vector includes:
determining the position information of the central point in the grid according to the position parameter of the central point;
and determining the central point according to the position information, taking the central point as the center of a rectangular frame, determining the position information of the rectangular frame according to the outline dimension parameter, taking the position information of the rectangular frame as the position information of the object, and taking the object type corresponding to the type parameter as the type of the object.
Further, the determining the position information of the central point in the grid according to the position parameter of the central point includes:
using the set points of the grid as reference points; and determining the position information of the central point in the grid according to the reference point and the position parameters of the central point.
The embodiment of the invention discloses a device for detecting an object in an image, which comprises:
the dividing module is used for dividing the image to be detected into a plurality of grids according to a preset dividing mode, wherein the size of the image to be detected is a target size;
the detection module is used for inputting the divided images into a convolutional neural network which is trained in advance, and acquiring a plurality of feature vectors of the images output by the convolutional neural network, wherein each grid corresponds to one feature vector;
and the determining module is used for identifying the maximum value of the category parameters in the feature vector aiming at the feature vector corresponding to each grid, and determining the position information of the object of the category corresponding to the category parameters according to the central point position parameters and the outline dimension parameters in the feature vector when the maximum value is larger than a set threshold value.
Further, the apparatus further comprises:
the judging and adjusting module is used for judging whether the size of the image is a target size; and if not, adjusting the size of the image to the target size.
Further, the apparatus further comprises:
the training module is used for adopting a rectangular frame to mark a target object aiming at each sample image in the sample image set; dividing each sample image into a plurality of grids according to a preset dividing mode, determining a characteristic vector corresponding to each grid, wherein the size of each sample image is a target size, when the grid contains a central point of a target object, setting a value of a category parameter corresponding to the category in the characteristic vector corresponding to the grid to be a preset maximum value according to the category of the target object, determining a value of a central point position parameter in the characteristic vector according to the position of the central point in the grid, determining a value of an outline dimension parameter in the characteristic vector according to the size of a marked rectangular frame of the target object, and when the grid does not contain the central point of the target object, setting the value of each parameter in the characteristic vector corresponding to the grid to be zero; the convolutional neural network is trained from each sample image for which a feature vector for each mesh is determined.
Further, the training module is further configured to determine, for each sample image, whether the size of the sample image is a target size; and if not, adjusting the size of the sample image to the target size.
Further, the training module is specifically configured to select sub-sample images from the sample image set, where the number of the selected sub-sample images is smaller than the number of the sample images in the sample image set; and training the convolutional neural network by adopting each selected subsample image.
Further, the apparatus further comprises:
the error calculation module is used for determining the error of the convolutional neural network according to the prediction of the position and the category of the object in the subsample image by the convolutional neural network and the target object marked in the subsample image;
determining that the convolutional neural network training is complete when the error converges, wherein the error is determined using the following loss function:
wherein, S is the number of rows or columns of the divided grids with the same number of columns and the number of predicted rectangular frames per grid preset in B, and is generally 1 or 2, xiAs the marked center point of the target object is on the abscissa of the grid i,for predicting the center point of the object on the abscissa, y, of the grid iiAs the center point of the labeled target object is on the ordinate of the grid i,
Figure BDA0001197820270000063
for the central point of the predicted object on the ordinate, h, of the grid iiFor the height, w, of the rectangular frame marked by the target objectiThe width of the rectangular frame marked with the target object is,
Figure BDA0001197820270000064
to predict the height of the rectangular box where the object is located,for the width of the rectangular frame in which the object is predicted, CiTo label the probability of whether the target object currently exists in the grid i,
Figure BDA0001197820270000071
for the predicted probability of whether an object is currently present in the grid i, Pi(c) For the labeled probability of the target object in the grid i belonging to the category c,for the predicted probability, λ, of an object within this grid i belonging to class ccoordAnd λnoobjIn order to set the weight value of the user,
Figure BDA0001197820270000073
and taking 1 when the central point of the object in the jth predicted rectangular frame is positioned in the grid i, otherwise taking 0,taking 1 when the center point of the object exists in the predicted grid i, otherwise taking 0,
Figure BDA0001197820270000075
taking 1 if the predicted grid i does not have the center point of the object, otherwise taking 0, wherein,
Figure BDA0001197820270000076
determined according to the following formula:
Figure BDA0001197820270000077
Pr(Object) is the predicted probability of whether an Object currently exists in the grid i, Pr(Class | Object) is the conditional probability that an Object within the predicted mesh i belongs to Class c.
Further, the determining module is specifically configured to determine, according to the position parameter of the central point, the position information of the central point in the grid;
and determining the central point according to the position information, taking the central point as the center of a rectangular frame, determining the position information of the rectangular frame according to the outline dimension parameter, taking the position information of the rectangular frame as the position information of the object, and taking the object type corresponding to the type parameter as the type of the object.
Further, the determination module is specifically configured to use a set point of the grid as a reference point; and determining the position information of the central point in the grid according to the reference point and the position parameters of the central point.
The embodiment of the invention provides a method and a device for detecting an object in an image, wherein the method comprises the steps of dividing the image to be detected into a plurality of grids according to a preset dividing mode, wherein the size of the image is a target size, inputting the divided image into a convolutional neural network trained in advance, obtaining a plurality of characteristic vectors of the image output by the convolutional neural network, wherein each grid corresponds to one characteristic vector, identifying the maximum value of a category parameter in each characteristic vector, and determining the position information of the object of the category corresponding to the category parameter according to the position parameter of a central point and the outline size parameter in the characteristic vector when the maximum value is larger than a set threshold value. In the embodiment of the invention, each feature vector corresponding to the image is determined through the convolutional neural network trained in advance, the category and the position of the object in the image are determined according to the category parameter and the position related parameter in the feature vector, the detection of the position and the category of the object can be realized simultaneously, the integral optimization is convenient, in addition, the position and the category of the object are determined according to the feature vector corresponding to each grid, a plurality of feature areas are not required to be selected, the detection time is saved, and the detection real-time performance and the detection efficiency are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic view of a process for object detection using R-CNN;
FIG. 2 is a schematic diagram of an object detection process using fast RCNN;
FIG. 3 is a schematic diagram of an object detection process in an image according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a detailed implementation process of object detection in an image according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a training process of a convolutional neural network according to an embodiment of the present invention;
FIGS. 6A-6D are schematic diagrams illustrating labeling results of a target object according to an embodiment of the present invention;
FIG. 7 is a schematic diagram illustrating a process for constructing the cube structure of FIG. 6D;
fig. 8 is a schematic structural diagram of an object detection apparatus in an image according to an embodiment of the present invention.
Detailed Description
In order to effectively improve the efficiency of object detection, improve the real-time performance of object detection and facilitate the overall optimization of object detection, the embodiment of the invention provides a method and a device for detecting an object in an image.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 3 is a schematic diagram of an object detection process in an image according to an embodiment of the present invention, where the process includes the following steps:
step S301: dividing an image to be detected into a plurality of grids according to a preset dividing mode, wherein the size of the image to be detected is a target size.
The embodiment of the invention is applied to the electronic equipment, and the electronic equipment can be a desktop computer, a notebook computer, other intelligent equipment with processing capacity and the like.
After an image to be detected of a target size is obtained, the image to be detected is divided into a plurality of grids according to a preset dividing mode, wherein the preset dividing mode is the same as the dividing mode of the image when a convolutional neural network is trained. For example, the image may be formed into a plurality of rows, a plurality of columns, and the intervals between the rows and the intervals between the columns may be equal or different for convenience. Of course, the image may be divided into a plurality of irregular grids as long as the image to be detected and the image subjected to convolutional neural network training are ensured to adopt the same grid division mode.
When the image is divided into a plurality of rows and a plurality of columns, the image may be divided into a plurality of grids having the same number of rows and columns, or may be divided into a plurality of grids having different numbers of rows and columns, and the aspect ratio of each grid after division may be the same or different.
Step S302: and inputting the divided image into a convolutional neural network trained in advance, and acquiring a plurality of feature vectors of the image output by the convolutional neural network, wherein each grid corresponds to one feature vector.
In order to detect the category and the position of an object in an image, in the embodiment of the present invention, a convolutional neural network is trained, a feature vector corresponding to each mesh is obtained through the trained convolutional neural network, for example, the image may be divided into 49 meshes of 7 × 7, and after the divided image is input to the trained convolutional neural network, 49 feature vectors may be output, where each feature vector corresponds to one mesh.
Step S303: and identifying the maximum value of the category parameters in the feature vector aiming at the feature vector corresponding to each grid, and determining the position information of the object of the category corresponding to the category parameters according to the central point position parameters and the overall dimension parameters in the feature vector when the maximum value is larger than a set threshold value.
Specifically, the feature vector obtained in the embodiment of the present invention is a multidimensional vector, and the feature vector at least includes: the device comprises a category parameter and a position parameter, wherein the category parameter comprises a plurality of parameters, and the position parameter comprises: a center point location parameter and a physical dimension parameter. And after the feature vector corresponding to each grid is obtained, respectively judging whether the grid detects the object or not according to the feature vector corresponding to each grid. If the maximum value of the plurality of category parameters in the feature vector corresponding to the grid is larger than the set threshold value, the grid detects the object, the category corresponding to the category parameter is the category of the object, and the position of the object can be determined according to the feature vector corresponding to the grid.
Since the position parameters in the feature vectors employed in the convolutional neural network training are determined according to a set method, the position of the object can be determined according to the set method.
In the embodiment of the invention, each feature vector corresponding to the image is determined through the convolutional neural network trained in advance, and the category and the position of the object in the image are determined according to the category parameter and the position related parameter in the feature vector, so that the prediction of the position and the category of the object can be realized simultaneously, the integral optimization is convenient, in addition, the position and the category of the object are determined according to the feature vector corresponding to each grid, a plurality of feature areas are not required to be selected, the detection time is saved, and the detection real-time performance and the detection efficiency are improved.
The object detection in the embodiment of the present invention is performed on an image with a target size, where the target size is a uniform size of an image used in training the convolutional neural network, and the size may be any size as long as the size of the image is the same as that of the image in object detection in training the convolutional neural network. The target size may be, for example, 1024 x 1024, or may be 256 x 512, etc.
Therefore, in an embodiment of the present invention, in order to ensure that the images input into the convolutional neural network are all images of a target size, before dividing the image to be detected into a plurality of grids according to a preset dividing manner, the method further includes:
judging whether the size of the image is a target size;
and if not, adjusting the size of the image to the target size.
And when the image to be detected is in a target size, directly carrying out subsequent processing on the image, and when the image to be detected is in a non-target size, adjusting the image to be detected to the target size. The adjustment of the image size belongs to the prior art, and the process is not described in detail in the embodiment of the present invention.
Specifically, in the embodiment of the present invention, determining the position information of the object of the category corresponding to the category parameter according to the center point position parameter and the outline dimension parameter in the feature vector includes:
determining the position information of the central point in the grid according to the position parameter of the central point;
and determining the central point according to the position information, taking the central point as the center of a rectangular frame, determining the position information of the rectangular frame according to the outline dimension parameter, taking the position information of the rectangular frame as the position information of the object, and taking the object type corresponding to the type parameter as the type of the object.
Wherein the determining the position information of the central point in the grid according to the position parameter of the central point comprises:
using the set points of the grid as reference points; and determining the position information of the central point in the grid according to the reference point and the position parameters of the central point.
Fig. 4 is a schematic diagram of a detailed implementation process of object detection in an image according to an embodiment of the present invention, where the process includes the following steps:
step S401: an image to be detected is received.
Step S402: and judging whether the size of the image is the target size, if so, performing step S404, and otherwise, performing step S403.
Step S403: and adjusting the size of the image to a target size.
Step S404: dividing an image to be detected into a plurality of grids according to a preset dividing mode, wherein the size of the image to be detected is a target size.
Step S405: and inputting the divided image into a convolutional neural network trained in advance, and acquiring a plurality of feature vectors of the image output by the convolutional neural network, wherein each grid corresponds to one feature vector.
Step S406: and identifying the maximum value of the category parameter in the feature vector aiming at the feature vector corresponding to each grid.
Step S407: and when the maximum value is larger than a set threshold value, taking the set point of the grid as a reference point, and determining the position information of the central point in the grid according to the reference point and the position parameters of the central point.
Step S408: and determining the central point according to the position information, taking the central point as the center of a rectangular frame, determining the position information of the rectangular frame according to the outline dimension parameter, taking the position information of the rectangular frame as the position information of the object, and taking the object type corresponding to the type parameter as the type of the object.
The target detection is performed based on the trained convolutional neural network, and in order to detect the object, the convolutional neural network needs to be trained. In the embodiment of the invention, when the convolutional neural network is trained, a sample image with a target size is divided into a plurality of grids, and if the central point of a certain target object is located in a certain grid, the grid is responsible for detecting the target object, including detecting the type and the corresponding position (bounding box) of the target object.
Fig. 5 is a schematic diagram of a training process of a convolutional neural network according to an embodiment of the present invention, where the training process includes the following steps:
step S501: and marking the target object by adopting a rectangular frame aiming at each sample image in the sample image set.
In the embodiment of the invention, a large number of sample images are adopted to train the convolutional neural network, and then the large number of sample images form a sample image set. A rectangular frame is used to mark the target object in each sample image.
Specifically, as shown in fig. 6A to fig. 6D, the labeling result of the target object is illustrated schematically, and 3 target objects, namely, a dog, a bicycle, and a car, exist in the sample image in fig. 6A. When labeling each target object, the vertices of each target object in four directions, i.e., up, down, left, and right (with respect to the up, down, left, and right directions shown in fig. 6A) are identified in the sample image, and if the vertices are the up and down vertices, two lines parallel to the upper and lower bottom sides of the sample image passing through the up and down vertices are defined as two sides of the rectangular frame, and if the vertices are the left and right vertices, two lines parallel to the left and right sides of the sample image passing through the left and right vertices are defined as the other two sides of the rectangular frame. Such as the rectangular boxes of dogs, bicycles and cars marked with dashed lines in fig. 6A.
Step S502: dividing each sample image into a plurality of grids according to a preset dividing mode, determining a characteristic vector corresponding to each grid, wherein the size of each sample image is a target size, when the grid contains a central point of a target object, setting a value of a category parameter corresponding to the category in the characteristic vector corresponding to the grid to be a preset maximum value according to the category of the target object, determining a value of a central point position parameter in the characteristic vector according to the position of the central point in the grid, determining a value of an outline dimension parameter in the characteristic vector according to the size of a marked rectangular frame of the target object, and when the grid does not contain the central point of the target object, setting the value of each parameter in the characteristic vector corresponding to the grid to be zero.
In the embodiment of the present invention, the sample image may be divided into a plurality of grids according to a preset division manner, where the division manner of the sample image is the same as the division manner of the image to be detected in the detection process.
For example, the image may be formed into a plurality of rows, a plurality of columns, and the intervals between the rows and the intervals between the columns may be equal or different for convenience. Of course, the image may be divided into a plurality of irregular grids as long as the image to be detected and the image subjected to convolutional neural network training are ensured to adopt the same grid division mode.
When the image is divided into a plurality of rows and a plurality of columns, the image may be divided into a plurality of grids having the same number of rows and columns, or may be divided into a plurality of grids having different numbers of rows and columns, and the aspect ratio of each grid after division may be the same or different. For example, the sample image is divided into a plurality of grids of 12 × 10, or 15 × 15, or 6 × 6, etc. When the size of the mesh is equal, the mesh size may be normalized. As shown in fig. 6B, in the embodiment of the present invention, the sample image is divided into a plurality of grids of 7 rows in the transverse direction and 7 columns in the longitudinal direction, and each grid is a square grid, so that the size of each grid after normalization can be regarded as 1 × 1.
Each grid in the sample image corresponds to a feature vector, the feature vector is a multi-dimensional vector, and the feature vector at least comprises: the device comprises a category parameter and a position parameter, wherein the category parameter comprises a plurality of parameters, and the position parameter comprises: a center point location parameter and a physical dimension parameter.
Step S503: the convolutional neural network is trained from each sample image for which a feature vector for each mesh is determined.
Specifically, in the embodiment of the present invention, the convolutional neural network may be trained by using all sample images in the sample image set. However, because the sample image set includes a large number of sample images, in order to improve the training efficiency, in the embodiment of the present invention, training the convolutional neural network according to each sample image for which the feature vector of each mesh is determined includes:
selecting sub-sample images from the sample image set, wherein the number of the selected sub-sample images is smaller than the number of the sample images in the sample image set;
and training the convolutional neural network by adopting each selected subsample image.
By randomly selecting sub-sample images far smaller than the total number of the sample images, the convolutional neural network is trained, and parameters of the convolutional neural network are continuously updated until the error between the information of the object predicted by each grid and the information of the labeled target object is converged.
Similarly, in the embodiment of the present invention, when training the convolutional neural network, a sample image of a target size is used, and therefore, in the embodiment of the present invention, in order to ensure that the sample images input into the convolutional neural network are all of the target size, before dividing each sample image into a plurality of grids according to a preset dividing manner, the method further includes:
judging whether the size of each sample image is a target size or not;
and if not, adjusting the size of the sample image to the target size.
And when the sample image is in the target size, directly carrying out subsequent processing on the sample image, and when the sample image is not in the target size, adjusting the sample image to the target size. The adjustment of the image size belongs to the prior art, and the process is not described in detail in the embodiment of the present invention.
In the above process, the target size of the sample image may be adjusted first, or the rectangular frame may be labeled first in the sample image. The marking of the rectangular frame is firstly carried out to ensure that the target object can be accurately marked when the size of the sample image is larger, and the target size is firstly adjusted to ensure that the target object can be accurately marked when the size of the sample image is smaller.
In the above labeling process, a feature vector corresponding to each mesh in the sample image may be determined, and in an embodiment of the present invention, the feature vector corresponding to each mesh may be represented as (confidence, cls1, cls2, cls3, …, cls20, x, y, w, h), where confidence is a probability parameter, cls1, cls2, cls3, …, and cls20 are category parameters, and x, y, w, and h are position parameters, where x and y are center point position parameters, and w and y are outline size parameters. When the grid contains the central point of the target object, the value of each parameter in the feature vector corresponding to the grid is determined, and when the grid does not contain the central point of the target object, the value of each parameter in the feature vector corresponding to the grid is 0.
Specifically, since each target object is labeled by using a rectangular frame in the sample image, the center point of the rectangular frame may be considered as the center point of the target object, such as the center points of the three rectangular frames shown in fig. 6C. When the grid includes the center point of the target object, then during labeling, the probability parameter in the feature vector corresponding to the grid may be considered to be 1, that is, the probability that the target object exists in the grid is 1 at present.
Since there are a plurality of categories of target objects included in the sample image, the category parameter cls is used in the embodiment of the present invention to represent the target objects of different categories, namely, cls1, cls2, … … and clsn. For example, n may be 20, i.e., 20 classes of objects are shared, the class of objects represented by cls1 is a car, the class of objects represented by cls2 is a dog, and the class of objects represented by cls3 is a bicycle. When the grid includes the center point of the target object, the class parameter value corresponding to the target object is set to be a maximum value, where the maximum value is greater than a set threshold, for example, the maximum value may be 1, the threshold may be 0.4, and the like.
For example, as shown in fig. 6C, from bottom to top (top and bottom shown in fig. 6C), in the feature vector corresponding to the grid where each central point is located, cls2 in the class parameter in the feature vector corresponding to the first central point is 1, the other class parameters are 0, cls3 in the class parameter in the feature vector corresponding to the second central point is 1, the other class parameters are 0, cls1 in the class parameter in the feature vector corresponding to the third central point is 1, and the other class parameters are 0.
The feature vector further includes position parameters x, y, w, and h of the target object, where x and y are center point position parameters, and values of the center point position parameters are horizontal and vertical coordinate values of the center point of the target object relative to a set point, where the set points corresponding to each grid may be the same or different, for example, the upper left corner of the sample image may be considered as the set point, i.e., the origin of coordinates, because each grid is normalized, and thus the coordinates of each position in each grid are uniquely determined. Of course, in order to simplify the process and reduce the amount of calculation, the corresponding set point of each grid may also be different, and each grid may be regarded as an independent unit, and the upper left corner of the grid is the set point, i.e. the origin of coordinates. Therefore, when labeling is performed, the values of x and y in the feature vector corresponding to the grid can be determined according to the offset of the central point relative to the upper left corner of the grid where the central point is located. The process of determining the x and y values according to the offset of the relative position belongs to the prior art, and is not described in detail in the embodiment of the present invention. In the position parameters, w and h are outline dimension parameters, and the values of the outline dimension parameters are the length and width values of a rectangular frame where the target object is located.
Because the feature vector is a multidimensional vector, in order to accurately represent the feature vector corresponding to each mesh, in the embodiment of the present invention, the cubic structure shown in fig. 6D is constructed according to the construction method shown in fig. 7, and the mesh is correspondingly processed in the convolutional layer, the max pooling layer, the full connection layer, and the output layer, so as to generate a cubic mesh structure, where the depth of the cubic mesh in the Z-axis direction is determined according to the dimension of the feature vector. In the present embodiment, the depth of the cubic grid in the Z-axis direction is 25. The above process of performing corresponding processing in each layer of the convolutional neural network to generate the cubic grid structure belongs to the prior art, and is not described in detail in the embodiment of the present invention.
After a large number of sample images are labeled by adopting the mode, the convolutional neural network is trained by adopting the labeled sample images. Specifically, the plurality of subsample images used in the embodiment of the present invention train the convolutional neural network. In the training process, for each subsample image, obtaining a convolution feature map of the subsample image through a convolution neural network, wherein the convolution feature map comprises a feature vector (confidence, cls1, cls2, cls3, …, cls20, x, y, w, h) corresponding to each grid, the feature vector comprises a position parameter and a category parameter of an object predicted in the grid, and a probability parameter confidence, and the probability parameter confidence represents the overlapping degree of a rectangular frame where the object is predicted by the grid and a rectangular frame marked with the target object.
In the training process, for each sub-sample image, network parameters of the convolutional neural network are adjusted by calculating the error between the prediction information and the labeling information, the sub-sample images which are far less than the total number (batch) of the sample images are randomly selected each time, the convolutional neural network is trained, and the network parameters are updated until the error between the prediction information and the labeling information of each grid is converged. The process of training the convolutional neural network according to the sub-sample image and adjusting the network parameters of the convolutional neural network until the training of the convolutional neural network is completed belongs to the prior art, and is not described in detail in the embodiment of the invention.
In the training process of the convolutional neural network, in order to accurately predict the position and the category information of the object, in the embodiment of the present invention, the last fully-connected layer of the convolutional neural network uses a logic activation function, and the convolutional layer and other fully-connected layers use a Leak ReLU function. Wherein the leakage ReLU function is:
Figure BDA0001197820270000171
in order to complete the training of the convolutional neural network and make it converge in the embodiment of the present invention, when training the convolutional neural network, the method further includes:
determining the error of the convolutional neural network according to the prediction of the position and the type of the target object in the subsample image by the convolutional neural network and the information of the target object marked in the subsample image;
determining that the convolutional neural network training is complete when the error converges, wherein the error is determined using the following loss function:
Figure BDA0001197820270000172
wherein S is the row number or column number of the divided grids with the same row number and column number, B is the number of the preset rectangular frames predicted by each grid, 1 or 2 is generally taken, xi is the abscissa of the marked target object center point on the grid i,
Figure BDA0001197820270000173
the horizontal coordinate of the central point of the predicted object in the grid i, yi is the vertical coordinate of the central point of the labeled target object in the grid i,for the central point of the predicted object on the ordinate, h, of the grid iiFor the height, w, of the rectangular frame marked by the target objectiThe width of the rectangular frame marked with the target object is,to predict the height of the rectangular box where the object is located,
Figure BDA0001197820270000176
for the width of the rectangular frame in which the object is predicted, CiTo label the probability of whether the target object currently exists in the grid i,
Figure BDA0001197820270000177
for the predicted probability of whether an object is currently present in the grid i, Pi(c) For the labeled probability of the target object in the grid i belonging to the category c,for the predicted probability, λ, of an object within this grid i belonging to class ccoordAnd λnoobjIn order to set the weight value of the user,
Figure BDA0001197820270000179
and taking 1 when the central point of the object in the jth predicted rectangular frame is positioned in the grid i, otherwise taking 0,
Figure BDA0001197820270000181
taking 1 when the center point of the object exists in the predicted grid i, otherwise taking 0,
Figure BDA0001197820270000182
taking 1 if the predicted grid i does not have the center point of the object, otherwise taking 0, wherein,
Figure BDA0001197820270000183
determined according to the following formula:
Figure BDA0001197820270000184
Pr(Object) is the predicted probability of whether an Object currently exists in the grid i, Pr(Class | Object) to classify objects within predicted mesh i as belonging to Class cThe conditional probability.
In order to make the contribution of the prediction process to the position prediction smaller when the error between the prediction result and the labeled result is larger, in the embodiment of the present invention, the above-mentioned loss function is adopted.
As shown in fig. 6B, in an embodiment of the present invention, each sample image is divided into 49 grids of 7 × 7, each grid can detect 20 classes, so that one sample image can generate 980 detection probabilities, and most grids have a detection probability of 0. This will lead to training discretization where a variable is introduced to solve the problem: i.e. the probability of whether an object is present in a certain grid. Thus, in addition to the 20 class parameters, there is also a predicted probability P of whether an object is currently present in the gridr(Object), the probability that a target Object within a certain mesh belongs to class cIs Pr(Object) conditional probability P of an Object in a predicted mesh belonging to class cr(Class | Object). At each mesh, P is pairedr(Object) is updated, P only if there are objects in the gridr(Class | Object) is updated.
Fig. 8 is a schematic structural diagram of an apparatus for detecting an object in an image according to an embodiment of the present invention, where the apparatus is located in an electronic device, and the apparatus includes:
the dividing module 81 is configured to divide an image to be detected into a plurality of grids according to a preset dividing manner, where the size of the image to be detected is a target size;
the detection module 82 is configured to input the divided image into a convolutional neural network trained in advance, and obtain a plurality of feature vectors of the image output by the convolutional neural network, where each grid corresponds to one feature vector;
the determining module 83 is configured to identify, for a feature vector corresponding to each grid, a maximum value of a category parameter in the feature vector, and determine, when the maximum value is greater than a set threshold, position information of an object of a category corresponding to the category parameter according to a center point position parameter and an outer dimension parameter in the feature vector.
The device further comprises:
a judgment adjustment module 84, configured to judge whether the size of the image is a target size; and if not, adjusting the size of the image to the target size.
The device further comprises:
a training module 85, configured to focus on a target object with a rectangular frame for each sample image in the sample image set; dividing each sample image into a plurality of grids according to a preset dividing mode, determining a characteristic vector corresponding to each grid, wherein the size of each sample image is a target size, when the grid contains a central point of a target object, setting a value of a category parameter corresponding to the category in the characteristic vector corresponding to the grid to be a preset maximum value according to the category of the target object, determining a value of a central point position parameter in the characteristic vector according to the position of the central point in the grid, determining a value of an outline dimension parameter in the characteristic vector according to the size of a marked rectangular frame of the target object, and when the grid does not contain the central point of the target object, setting the value of each parameter in the characteristic vector corresponding to the grid to be zero; the convolutional neural network is trained from each sample image for which a feature vector for each mesh is determined.
The training module 85 is further configured to determine, for each sample image, whether the size of the sample image is a target size; and if not, adjusting the size of the sample image to the target size.
The training module 85 is specifically configured to select sub-sample images from the sample image set, where the number of the selected sub-sample images is smaller than the number of the sample images in the sample image set; and training the convolutional neural network by adopting each selected subsample image.
The device further comprises:
an error calculation module 86, configured to determine an error of the convolutional neural network according to the prediction of the position and the category of the object in the subsample image by the convolutional neural network and information of the target object labeled in the subsample image;
determining that the convolutional neural network training is complete when the error converges, wherein the error is determined using the following loss function:
Figure BDA0001197820270000201
wherein, S is the number of rows or columns of the divided grids with the same number of columns and the number of predicted rectangular frames per grid preset in B, and is generally 1 or 2, xiAs the marked center point of the target object is on the abscissa of the grid i,
Figure BDA0001197820270000202
for predicting the center point of the object on the abscissa, y, of the grid iiAs the center point of the labeled target object is on the ordinate of the grid i,
Figure BDA0001197820270000203
for the central point of the predicted object on the ordinate, h, of the grid iiFor the height, w, of the rectangular frame marked by the target objectiThe width of the rectangular frame marked with the target object is,
Figure BDA0001197820270000204
to predict the height of the rectangular box where the object is located,
Figure BDA0001197820270000205
for the width of the rectangular frame in which the object is predicted, CiTo label the probability of whether the target object currently exists in the grid i,for the predicted probability of whether an object is currently present in the grid i, Pi(c) For the labeled probability of the target object in the grid i belonging to the category c,
Figure BDA0001197820270000207
for the predicted probability, λ, of an object within this grid i belonging to class ccoordAnd λnoobjIn order to set the weight value of the user,
Figure BDA0001197820270000208
and taking 1 when the central point of the object in the jth predicted rectangular frame is positioned in the grid i, otherwise taking 0,
Figure BDA0001197820270000209
taking 1 when the center point of the object exists in the predicted grid i, otherwise taking 0,
Figure BDA00011978202700002010
taking 1 if the predicted grid i does not have the center point of the object, otherwise taking 0, wherein,
Figure BDA00011978202700002011
determined according to the following formula:
Figure BDA00011978202700002012
Pr(Object) is the predicted probability of whether an Object currently exists in the grid i, Pr(Class | Object) is the conditional probability that an Object within the predicted mesh i belongs to Class c.
The determining module 83 is specifically configured to determine the position information of the central point in the grid according to the position parameter of the central point; and determining the central point according to the position information, taking the central point as the center of a rectangular frame, determining the position information of the rectangular frame according to the outline dimension parameter, taking the position information of the rectangular frame as the position information of the object, and taking the object type corresponding to the type parameter as the type of the object.
The determination module 83, in particular for taking the set points of the grid as reference points; and determining the position information of the central point in the grid according to the reference point and the position parameters of the central point.
The embodiment of the invention provides a method and a device for detecting an object in an image, wherein the method comprises the steps of dividing the image to be detected into a plurality of grids according to a preset dividing mode, wherein the size of the image is a target size, inputting the divided image into a convolutional neural network trained in advance, obtaining a plurality of characteristic vectors of the image output by the convolutional neural network, wherein each grid corresponds to one characteristic vector, identifying the maximum value of a category parameter in each characteristic vector, and determining the position information of the object of the category corresponding to the category parameter according to the position parameter of a central point and the outline size parameter in the characteristic vector when the maximum value is larger than a set threshold value. In the embodiment of the invention, each feature vector corresponding to the image is determined through the convolutional neural network trained in advance, the category and the position of the object in the image are determined according to the category parameter and the position related parameter in the feature vector, the detection of the position and the category of the object can be realized simultaneously, the integral optimization is convenient, in addition, the position and the category of the object are determined according to the feature vector corresponding to each grid, a plurality of feature areas are not required to be selected, the detection time is saved, and the detection real-time performance and the detection efficiency are improved.
For the system/apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (14)

1. A method for detecting an object in an image, which is applied to an electronic device, includes:
dividing an image to be detected into a plurality of grids according to a preset dividing mode, wherein the size of the image to be detected is a target size;
inputting the divided images into a convolutional neural network trained in advance, and acquiring a plurality of feature vectors of the images output by the convolutional neural network, wherein each grid corresponds to one feature vector;
identifying the maximum value of the category parameters in the feature vector aiming at the feature vector corresponding to each grid, and determining the position information of the object of the category corresponding to the category parameters according to the central point position parameters and the overall dimension parameters in the feature vector when the maximum value is larger than a set threshold value;
wherein, the preset dividing mode comprises:
dividing the image and the sample image into a plurality of grids with the same row number and column number;
the method further comprises the following steps:
determining the error of the convolutional neural network according to the prediction of the position and the type of the object in the subsample image by the convolutional neural network and the information of the target object marked in the subsample image;
determining that the convolutional neural network training is complete when the error converges, wherein the error is determined using the following loss function:
Figure FDA0002114888350000011
wherein S is the row number or column number of the divided grids with the same row number and column number, B is the number of the predicted rectangular frames of each grid preset, and 1 or 2, x is takeniAs the marked center point of the target object is on the abscissa of the grid i,
Figure FDA0002114888350000012
for predicting the center point of the object on the abscissa, y, of the grid iiAs the center point of the labeled target object is on the ordinate of the grid i,
Figure FDA0002114888350000013
for the central point of the predicted object on the ordinate, h, of the grid iiFor the height, w, of the rectangular frame marked by the target objectiThe width of the rectangular frame marked with the target object is,to predict the height of the rectangular box where the object is located,
Figure FDA0002114888350000022
for the width of the rectangular frame in which the object is predicted, CiTo label the probability of whether the target object currently exists in the grid i,
Figure FDA0002114888350000023
for the predicted probability of whether an object is currently present in the grid i, Pi(c) For the labeled probability of the target object in the grid i belonging to the category c,
Figure FDA0002114888350000024
for the predicted probability, λ, of an object within this grid i belonging to class ccoordAnd λnoobjIn order to set the weight value of the user,
Figure FDA0002114888350000025
taking 1 when the central point of the predicted object in the jth rectangular frame is positioned in the grid I, otherwise, taking 0, Ii objTaking 1 when the center point of the object exists in the predicted grid i, otherwise taking 0,
Figure FDA0002114888350000026
in the predicted grid iIf there is no center point of the object, 1 is taken, otherwise 0 is taken, wherein,
Figure FDA0002114888350000027
determined according to the following formula:
Figure FDA0002114888350000028
Pr(Object) is the predicted probability of whether an Object currently exists in the grid i, Pr(Class | Object) is the conditional probability that an Object within the predicted mesh i belongs to Class c.
2. The method according to claim 1, wherein before the dividing the image to be detected into a plurality of grids according to the preset dividing manner, the method further comprises:
judging whether the size of the image is a target size;
and if not, adjusting the size of the image to the target size.
3. The method of claim 1, wherein the training process of the convolutional neural network comprises:
aiming at each sample image in the sample image set, adopting a rectangular frame to mark a target object;
dividing each sample image into a plurality of grids according to a preset dividing mode, determining a characteristic vector corresponding to each grid, wherein the size of each sample image is a target size, when the grid contains a central point of a target object, setting a value of a category parameter corresponding to the category in the characteristic vector corresponding to the grid to be a preset maximum value according to the category of the target object, determining a value of a central point position parameter in the characteristic vector according to the position of the central point in the grid, determining a value of an outline dimension parameter in the characteristic vector according to the size of a marked rectangular frame of the target object, and when the grid does not contain the central point of the target object, setting the value of each parameter in the characteristic vector corresponding to the grid to be zero;
the convolutional neural network is trained from each sample image for which a feature vector for each mesh is determined.
4. The method of claim 3, wherein before the dividing each sample image into a plurality of grids according to the preset dividing manner, the method further comprises:
judging whether the size of each sample image is a target size or not;
and if not, adjusting the size of the sample image to the target size.
5. The method of claim 3, wherein training the convolutional neural network based on each sample image for which a feature vector for each mesh is determined comprises:
selecting sub-sample images from the sample image set, wherein the number of the selected sub-sample images is smaller than the number of the sample images in the sample image set;
and training the convolutional neural network by adopting each selected subsample image.
6. The method according to claim 1, wherein the determining, according to the center point position parameter and the outline size parameter in the feature vector, the position information of the object of the category corresponding to the category parameter comprises:
determining the position information of the central point in the grid according to the position parameter of the central point;
and determining the central point according to the position information, taking the central point as the center of a rectangular frame, determining the position information of the rectangular frame according to the outline dimension parameter, taking the position information of the rectangular frame as the position information of the object, and taking the object type corresponding to the type parameter as the type of the object.
7. The method of claim 6, wherein the determining the position information of the center point in the grid according to the position parameter of the center point comprises:
using the set points of the grid as reference points; and determining the position information of the central point in the grid according to the reference point and the position parameters of the central point.
8. An apparatus for detecting an object in an image, the apparatus comprising:
the dividing module is used for dividing the image to be detected into a plurality of grids according to a preset dividing mode, wherein the size of the image to be detected is a target size;
the detection module is used for inputting the divided images into a convolutional neural network which is trained in advance, and acquiring a plurality of feature vectors of the images output by the convolutional neural network, wherein each grid corresponds to one feature vector;
the determining module is used for identifying the maximum value of the category parameters in the feature vector aiming at the feature vector corresponding to each grid, and determining the position information of the object of the category corresponding to the category parameters according to the central point position parameters and the outline dimension parameters in the feature vector when the maximum value is larger than a set threshold;
wherein the apparatus further comprises:
the error calculation module is used for determining the error of the convolutional neural network according to the prediction of the position and the type of the object in the subsample image by the convolutional neural network and the target object marked in the subsample image;
determining that the convolutional neural network training is complete when the error converges, wherein the error is determined using the following loss function:
Figure FDA0002114888350000041
wherein S is the row number or column number with the same row number and column number of the divided grids, and B is presetThe number of rectangular frames per grid prediction of (1) or (2, x)iAs the marked center point of the target object is on the abscissa of the grid i,for predicting the center point of the object on the abscissa, y, of the grid iiAs the center point of the labeled target object is on the ordinate of the grid i,
Figure FDA0002114888350000043
for the central point of the predicted object on the ordinate, h, of the grid iiFor the height, w, of the rectangular frame marked by the target objectiThe width of the rectangular frame marked with the target object is,
Figure FDA0002114888350000044
to predict the height of the rectangular box where the object is located,
Figure FDA0002114888350000045
for the width of the rectangular frame in which the object is predicted, CiTo label the probability of whether the target object currently exists in the grid i,
Figure FDA0002114888350000046
for the predicted probability of whether an object is currently present in the grid i, Pi(c) For the labeled probability of the target object in the grid i belonging to the category c,for the predicted probability, λ, of an object within this grid i belonging to class ccoordAnd λnoobjIn order to set the weight value of the user,
Figure FDA0002114888350000052
taking 1 when the central point of the predicted object in the jth rectangular frame is positioned in the grid I, otherwise, taking 0, Ii objPresence of an object in the predicted grid iThe central point of the body is 1, otherwise 0 is taken,
Figure FDA0002114888350000053
taking 1 if the predicted grid i does not have the center point of the object, otherwise taking 0, wherein,
Figure FDA0002114888350000054
determined according to the following formula:
Figure FDA0002114888350000055
Pr(Object) is the predicted probability of whether an Object currently exists in the grid i, Pr(Class | Object) is the conditional probability that an Object within the predicted mesh i belongs to Class c.
9. The apparatus of claim 8, further comprising:
the judging and adjusting module is used for judging whether the size of the image is a target size; and if not, adjusting the size of the image to the target size.
10. The apparatus of claim 8, further comprising:
the training module is used for adopting a rectangular frame to mark a target object aiming at each sample image in the sample image set; dividing each sample image into a plurality of grids according to a preset dividing mode, determining a characteristic vector corresponding to each grid, wherein the size of each sample image is a target size, when the grid contains a central point of a target object, setting a value of a category parameter corresponding to the category in the characteristic vector corresponding to the grid to be a preset maximum value according to the category of the target object, determining a value of a central point position parameter in the characteristic vector according to the position of the central point in the grid, determining a value of an outline dimension parameter in the characteristic vector according to the size of a marked rectangular frame of the target object, and when the grid does not contain the central point of the target object, setting the value of each parameter in the characteristic vector corresponding to the grid to be zero; the convolutional neural network is trained from each sample image for which a feature vector for each mesh is determined.
11. The apparatus of claim 10, wherein the training module is further configured to determine, for each sample image, whether the size of the sample image is a target size; and if not, adjusting the size of the sample image to the target size.
12. The apparatus according to claim 11, wherein the training module is specifically configured to select sub-sample images from the sample image set, wherein the number of the selected sub-sample images is smaller than the number of sample images in the sample image set; and training the convolutional neural network by adopting each selected subsample image.
13. The apparatus according to claim 8, wherein the determining module is specifically configured to determine the position information of the central point in the grid according to the position parameter of the central point;
and determining the central point according to the position information, taking the central point as the center of a rectangular frame, determining the position information of the rectangular frame according to the outline dimension parameter, taking the position information of the rectangular frame as the position information of the object, and taking the object type corresponding to the type parameter as the type of the object.
14. The device according to claim 13, characterized in that said determination module is particularly adapted to take as a reference point a set point of said grid; and determining the position information of the central point in the grid according to the reference point and the position parameters of the central point.
CN201611249792.1A 2016-12-29 2016-12-29 Method and device for detecting object in image Active CN106803071B (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN201611249792.1A CN106803071B (en) 2016-12-29 2016-12-29 Method and device for detecting object in image
PCT/CN2017/107043 WO2018121013A1 (en) 2016-12-29 2017-10-20 Systems and methods for detecting objects in images
EP17886017.7A EP3545466A4 (en) 2016-12-29 2017-10-20 Systems and methods for detecting objects in images
US16/457,861 US11113840B2 (en) 2016-12-29 2019-06-28 Systems and methods for detecting objects in images

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611249792.1A CN106803071B (en) 2016-12-29 2016-12-29 Method and device for detecting object in image

Publications (2)

Publication Number Publication Date
CN106803071A CN106803071A (en) 2017-06-06
CN106803071B true CN106803071B (en) 2020-02-14

Family

ID=58985345

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611249792.1A Active CN106803071B (en) 2016-12-29 2016-12-29 Method and device for detecting object in image

Country Status (1)

Country Link
CN (1) CN106803071B (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3545466A4 (en) 2016-12-29 2019-11-27 Zhejiang Dahua Technology Co., Ltd. Systems and methods for detecting objects in images
CN107392158A (en) * 2017-07-27 2017-11-24 济南浪潮高新科技投资发展有限公司 A kind of method and device of image recognition
CN108229307B (en) 2017-11-22 2022-01-04 北京市商汤科技开发有限公司 Method, device and equipment for object detection
CN108062547B (en) * 2017-12-13 2021-03-09 北京小米移动软件有限公司 Character detection method and device
CN110110189A (en) * 2018-02-01 2019-08-09 北京京东尚科信息技术有限公司 Method and apparatus for generating information
CN108460761A (en) * 2018-03-12 2018-08-28 北京百度网讯科技有限公司 Method and apparatus for generating information
CN108960232A (en) * 2018-06-08 2018-12-07 Oppo广东移动通信有限公司 Model training method and device, electronic equipment and computer readable storage medium
US11373411B1 (en) 2018-06-13 2022-06-28 Apple Inc. Three-dimensional object estimation using two-dimensional annotations
CN110610184B (en) * 2018-06-15 2023-05-12 阿里巴巴集团控股有限公司 Method, device and equipment for detecting salient targets of images
CN108968811A (en) * 2018-06-20 2018-12-11 四川斐讯信息技术有限公司 A kind of object identification method and system of sweeping robot
CN108921840A (en) * 2018-07-02 2018-11-30 北京百度网讯科技有限公司 Display screen peripheral circuit detection method, device, electronic equipment and storage medium
CN109272050B (en) * 2018-09-30 2019-11-22 北京字节跳动网络技术有限公司 Image processing method and device
CN109558791B (en) * 2018-10-11 2020-12-01 浙江大学宁波理工学院 Bamboo shoot searching device and method based on image recognition
CN109726741B (en) * 2018-12-06 2023-05-30 江苏科技大学 Method and device for detecting multiple target objects
CN109685069B (en) * 2018-12-27 2020-03-13 乐山师范学院 Image detection method, device and computer readable storage medium
US10460210B1 (en) * 2019-01-22 2019-10-29 StradVision, Inc. Method and device of neural network operations using a grid generator for converting modes according to classes of areas to satisfy level 4 of autonomous vehicles
CN111597845A (en) * 2019-02-20 2020-08-28 中科院微电子研究所昆山分所 Two-dimensional code detection method, device and equipment and readable storage medium
CN111639660B (en) * 2019-03-01 2024-01-12 中科微至科技股份有限公司 Image training method, device, equipment and medium based on convolution network
CN109961107B (en) * 2019-04-18 2022-07-19 北京迈格威科技有限公司 Training method and device for target detection model, electronic equipment and storage medium
CN111914850B (en) * 2019-05-07 2023-09-19 百度在线网络技术(北京)有限公司 Picture feature extraction method, device, server and medium
CN110338835B (en) * 2019-07-02 2023-04-18 深圳安科高技术股份有限公司 Intelligent scanning three-dimensional monitoring method and system
CN110930386B (en) * 2019-11-20 2024-02-20 重庆金山医疗技术研究院有限公司 Image processing method, device, equipment and storage medium
CN111353555A (en) * 2020-05-25 2020-06-30 腾讯科技(深圳)有限公司 Label detection method and device and computer readable storage medium
CN112084874B (en) * 2020-08-11 2023-12-29 深圳市优必选科技股份有限公司 Object detection method and device and terminal equipment
CN112446867B (en) * 2020-11-25 2023-05-30 上海联影医疗科技股份有限公司 Method, device, equipment and storage medium for determining blood flow parameters
CN112785564B (en) * 2021-01-15 2023-06-06 武汉纺织大学 Pedestrian detection tracking system and method based on mechanical arm
CN113935425B (en) * 2021-10-21 2024-08-16 中国船舶集团有限公司第七一一研究所 Object identification method, device, terminal and storage medium
CN114739388B (en) * 2022-04-20 2023-07-14 中国移动通信集团广东有限公司 Indoor positioning and navigation method and system based on UWB and laser radar

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104517113A (en) * 2013-09-29 2015-04-15 浙江大华技术股份有限公司 Image feature extraction method and device and image sorting method and device
CN105975931A (en) * 2016-05-04 2016-09-28 浙江大学 Convolutional neural network face recognition method based on multi-scale pooling

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104517113A (en) * 2013-09-29 2015-04-15 浙江大华技术股份有限公司 Image feature extraction method and device and image sorting method and device
CN105975931A (en) * 2016-05-04 2016-09-28 浙江大学 Convolutional neural network face recognition method based on multi-scale pooling

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
End-to-end people detection in crowded scenes;Russell Stewart 等;《网页在线公开:https://arxiv.org/abs/1506.04878》;20150708;第2节、第3节,图2、图4 *

Also Published As

Publication number Publication date
CN106803071A (en) 2017-06-06

Similar Documents

Publication Publication Date Title
CN106803071B (en) Method and device for detecting object in image
US11878433B2 (en) Method for detecting grasping position of robot in grasping object
CN108805016B (en) Head and shoulder area detection method and device
Zhou et al. Exploring faster RCNN for fabric defect detection
WO2017059576A1 (en) Apparatus and method for pedestrian detection
CN110059558A (en) A kind of orchard barrier real-time detection method based on improvement SSD network
CN104424634A (en) Object tracking method and device
CN112766170B (en) Self-adaptive segmentation detection method and device based on cluster unmanned aerial vehicle image
CN111292377B (en) Target detection method, device, computer equipment and storage medium
CN111382638B (en) Image detection method, device, equipment and storage medium
CN111738164B (en) Pedestrian detection method based on deep learning
CN114882423A (en) Truck warehousing goods identification method based on improved Yolov5m model and Deepsort
CN112784494B (en) Training method of false positive recognition model, target recognition method and device
CN111080697B (en) Method, apparatus, computer device and storage medium for detecting direction of target object
CN112580435B (en) Face positioning method, face model training and detecting method and device
CN116758631B (en) Big data driven behavior intelligent analysis method and system
CN117763350A (en) Labeling data cleaning method and device
CN112733741B (en) Traffic sign board identification method and device and electronic equipment
CN108475339B (en) Method and system for classifying objects in an image
CN112861689A (en) Searching method and device of coordinate recognition model based on NAS technology
CN117036966B (en) Learning method, device, equipment and storage medium for point feature in map
CN113936255B (en) Intersection multi-steering vehicle counting method and system
CN118097785B (en) Human body posture analysis method and system
US20230410477A1 (en) Method and device for segmenting objects in images using artificial intelligence
Zhang et al. Recognition of italian gesture language based on augmented yolov5 algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant