CN116681884B

CN116681884B - Object detection method and related device

Info

Publication number: CN116681884B
Application number: CN202310965922.5A
Authority: CN
Inventors: 申远
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-08-02
Filing date: 2023-08-02
Publication date: 2023-12-08
Anticipated expiration: 2043-08-02
Also published as: CN116681884A

Abstract

The application discloses an object detection method and a related device, which can be applied to the fields of digital twinning, automatic driving, auxiliary driving, intelligent traffic, traffic simulation and the like. And acquiring an image to be detected, which comprises an object to be detected, and determining the position information of a first detection frame corresponding to a first component part and the position information of a second detection frame corresponding to a second component part in the image to be detected. And determining a first position relation of the first detection frame and the second detection frame under the image coordinate system based on the position information of the first detection frame and the position information of the second detection frame. And calculating the heading gesture of the object to be detected under the image coordinate system by using the position information of the first detection frame and the position information of the second detection frame based on the first position relation. The course gesture can more accurately reflect the actual condition of the object to be detected in the real scene, and the accuracy of subsequent processing is improved. The method can reduce the difference between the result of the projection of the object to the 3D space and the situation in the real scene in the digital twin scene, and improve the originality.

Description

Object detection method and related device

Technical Field

The present application relates to the field of computer technologies, and in particular, to an object detection method and a related device.

Background

Object sensing technology generally uses a sensor to collect related information of an object, and further uses the related information to perform subsequent processing. With the development of technology, object sensing technology is widely applied to various scenes, such as an automatic driving scene, an intelligent traffic scene, a digital twin scene, and the like.

When the current object sensing technology is applied to various scenes, the image including the object is usually collected, and then the image is detected and identified, so that the position of the object in the image is obtained, and then the position information of the object is obtained for subsequent processing. Taking a digital twin scene as an example, a specific implementation manner of digital twin may be to acquire an image from an image acquisition device, identify a position of an object in the image by using a target detection technology, where the position is a 2D (2D) position, and project the 2D position of the object in the image into a 2D (Three Dimensional, 3D) space by using internal parameters and external parameters of the image acquisition device, thereby providing 3D position information of the object available for real-time twin.

However, the acquired image is a 2D image, the 2D position information of the object is acquired through the object sensing technology, so that the actual situation of the object is difficult to accurately reflect, and further the subsequent processing is affected, for example, in a digital twin scene, the result of the projection of the object into the 3D space is greatly different from the situation of the object in the actual scene.

Disclosure of Invention

In order to solve the technical problems, the application provides an object detection method and a related device, which can determine the heading gesture of an object to be detected from an image to be detected, thereby more accurately reflecting the actual condition of the object to be detected in a real scene and improving the accuracy of subsequent processing. For example, in a digital twin scene, the difference between the result of the projection of the object to be detected into the 3D space and the situation of the object to be detected in the real scene can be reduced, and the reduction degree is greatly improved.

The embodiment of the application discloses the following technical scheme:

in one aspect, an embodiment of the present application provides an object detection method, including:

acquiring an image to be detected, wherein the image to be detected comprises an object to be detected;

determining position information of a first detection frame corresponding to a first component in the image to be detected, and determining position information of a second detection frame corresponding to a second component in the image to be detected, wherein the first component and the second component are different components included in the object to be detected, and the first component and the second component are obtained by carrying out structural division on the object to be detected along the movement direction of the object to be detected;

Determining a first position relation of the first detection frame and the second detection frame under an image coordinate system based on the position information of the first detection frame and the position information of the second detection frame;

and calculating the heading gesture of the object to be detected under the image coordinate system by using the position information of the first detection frame and the position information of the second detection frame based on the first position relation.

In one aspect, an embodiment of the present application provides an object detection apparatus, including an acquisition unit, a determination unit, and a calculation unit:

the acquisition unit is used for acquiring an image to be detected, wherein the image to be detected comprises an object to be detected;

the determining unit is configured to determine position information of a first detection frame corresponding to a first component in the image to be detected, and determine position information of a second detection frame corresponding to a second component in the image to be detected, where the first component and the second component are different components included in the object to be detected, and the first component and the second component are obtained by performing structural division on the object to be detected along a movement direction of the object to be detected;

The determining unit is further used for determining a first position relation between the first detection frame and the second detection frame under an image coordinate system based on the position information of the first detection frame and the position information of the second detection frame;

the calculating unit is used for calculating the course gesture of the object to be detected under the image coordinate system by using the position information of the first detection frame and the position information of the second detection frame based on the first position relation.

In one aspect, an embodiment of the present application provides a computer device including a processor and a memory:

the memory is used for storing a computer program and transmitting the computer program to the processor;

the processor is configured to perform the method of any of the preceding aspects according to instructions in the computer program.

In one aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the method of any one of the preceding aspects.

In one aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the preceding aspects.

According to the technical scheme, in order to utilize the related information of the object to be detected, the image to be detected comprising the object to be detected can be obtained, then the position information of the first detection frame corresponding to the first component in the image to be detected is determined, and the position information of the second detection frame corresponding to the second component in the image to be detected is determined. The first component part and the second component part are different component parts included in the object to be detected, and the first component part and the second component part are obtained by carrying out structural division on the object to be detected along the moving direction of the object to be detected, so that the heading gesture of the object to be detected can be determined based on the position information of the first detection frame and the position information of the second detection frame, and further information of the object to be detected can be extracted from the image to be detected, so that the actual condition of the object to be detected can be accurately reflected. The positions of the first detection frame and the second detection frame may affect the determination mode of the heading gesture, so the first position relationship between the first detection frame and the second detection frame under the image coordinate system may be determined based on the position information of the first detection frame and the position information of the second detection frame, and then the heading gesture of the object to be detected under the image coordinate system may be calculated based on the first position relationship and by using the position information of the first detection frame and the position information of the second detection frame. The application can determine the course gesture of the object to be detected from the image to be detected, thereby reflecting the actual condition of the object to be detected in the real scene more accurately and improving the accuracy of subsequent processing. For example, in a digital twin scene, the difference between the result of the projection of the object to be detected into the 3D space and the situation of the object to be detected in the real scene can be reduced, and the reduction degree is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is an exemplary diagram of labeling based on 2D images provided in the related art;

fig. 2 is an application scenario architecture diagram of an object detection method according to an embodiment of the present application;

FIG. 3 is a flowchart of an object detection method according to an embodiment of the present application;

fig. 4 is an exemplary diagram of each detection frame in an image to be detected according to an embodiment of the present application;

FIG. 5 is a diagram illustrating a first positional relationship between a first detection frame and a second detection frame according to an embodiment of the present application;

FIG. 6 is a diagram illustrating an exemplary network structure of a target detection model according to an embodiment of the present application;

FIG. 7 is a flowchart of a training method of a target detection model according to an embodiment of the present application;

fig. 8a is an exemplary diagram of each detection frame in a sample detection image according to an embodiment of the present application;

FIG. 8b is an exemplary diagram of each detection frame after normalization according to an embodiment of the present application;

FIG. 8c is an exemplary diagram of a pseudo 3D cube provided by an embodiment of the present application;

fig. 9 is a block diagram of an object detection apparatus according to an embodiment of the present application;

fig. 10 is a block diagram of a terminal according to an embodiment of the present application;

fig. 11 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

The related information of the object is acquired through the object perception technology so as to be used for subsequent processing in various scenes. For example, in a digital twin scene, an image of an object is acquired through an object sensing technology, a position of the object in the image is identified through an object detection technology, the position is a 2D (2D) position, and the 2D position of the object in the image is projected into a 3D space through internal parameters and external parameters of an image acquisition device, so that real-time twin-available 3D position information of the object is provided.

However, due to the fact that the 2D picture naturally lacks the expression capability of 3D information, the 2D position information of the object is obtained through the object perception technology, and the actual situation of the object is difficult to accurately reflect, so that the subsequent processing is affected. Taking a vehicle as an example, referring to fig. 1, in general, a bottom center point (for example, shown by a black dot in fig. 1) of a 2D full frame (for example, shown by a dotted rectangle in fig. 1) is directly used as a vehicle grounding point, and a homography back projection is performed, when a true heading posture of the vehicle has a heading angle with an optical axis of a lens, the bottom center point cannot accurately represent a grounding position of the vehicle, and meanwhile, cannot know that the grounding position is the bottom center grounding point of a part (a vehicle head, a vehicle tail and a vehicle side) of the vehicle, that is, only the position information of the vehicle can be represented, and the heading posture of the vehicle cannot be estimated, thereby affecting subsequent processing.

For example, in a digital twin scene, the result of projecting an object into a 3D space may be greatly different from the situation of the object in a real scene, for example, the real scene is that the vehicle is shown in fig. 1, but the heading pose of the vehicle is not obtained because only the position information of the vehicle is obtained, so that when the heading pose of the vehicle is projected into the 3D space, the heading pose of the vehicle may not be the same as that of fig. 1, for example, the bottom center point may be projected into the center point of the side of the vehicle, or the bottom center point may be projected into the center point of the vehicle head, and so on, which is greatly different from the actual situation of the vehicle. The course gesture can reflect the actual condition of the object, so that the object course gesture expression method which is easy to operate and deploy, strong in universality and low in cost is constructed, and the application of the object perception technology in various scenes can be promoted.

In order to solve the above technical problems, an embodiment of the present application provides an object detection method, which can determine position information of a first detection frame corresponding to a first component in an image to be detected, and determine position information of a second detection frame corresponding to a second component in the image to be detected. The first component part and the second component part are different component parts included by the object to be detected, and the first component part and the second component part are obtained by carrying out structural division on the object to be detected along the moving direction of the object to be detected, so that the course gesture of the object to be detected can be determined based on the position information of the first detection frame and the position information of the second detection frame, thereby reflecting the actual condition of the object to be detected in a real scene more accurately, and improving the accuracy of subsequent processing. For example, in a digital twin scene, the difference between the result of the projection of the object to be detected into the 3D space and the situation of the object to be detected in the real scene can be reduced, and the reduction degree is greatly improved.

It should be noted that, the object detection method provided in the embodiment of the present application may be applied to various scenes, where the scenes may generally use object sensing technologies, such as an autopilot scene, an auxiliary driving scene, an autopilot simulation scene, an intelligent traffic scene, a digital twin scene, etc., and the application scene of the object detection method is not limited in the embodiment of the present application, and the embodiment of the present application will mainly be described by taking the digital twin scene as an example. The method provided by the embodiment of the application can determine the heading gesture of the object to be detected, further carry out subsequent processing based on the heading gesture, and the subsequent processing modes can be different according to different application scenes. For example, in an automatic driving scene, after the heading gesture is acquired, a vehicle decision can be made based on the heading gesture; in the auxiliary driving scene, after the course gesture is acquired, collision risk prediction can be performed based on the course gesture, so that driving of a driver is assisted; in the automatic driving simulation scene, after the course gesture is acquired, an automatic driving algorithm can be determined based on the course gesture; in the traffic simulation scene, after the course gesture is acquired, vehicle track simulation analysis can be performed based on the course gesture, so that the traffic condition is interpreted, analyzed and a problem is found out to optimize a traffic system; in the digital twin scene, after the heading gesture is acquired, the object to be detected can be projected to a 3D space for display based on the heading gesture, and the like.

The object detection method provided by the embodiment of the application can be executed by computer equipment, and the computer equipment can be a server or a terminal, for example. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service. Terminals include, but are not limited to, smart phones, computers, intelligent voice interaction devices, intelligent appliances, vehicle terminals, aircraft, and the like.

As shown in fig. 2, fig. 2 shows an application scenario architecture diagram of an object detection method, where an application scenario is described by taking a server to execute the object detection method provided by the embodiment of the present application as an example.

A server 200 may be included in the application scenario, and the server 200 may obtain relevant information of the object to be detected through an object sensing technology.

Specifically, the server 200 may acquire a to-be-detected image including the to-be-detected object. The object to be detected may be an object to be detected in a real scene, and the object to be detected is usually a dynamic object capable of moving, and the object to be detected may be a vehicle on a road, an unmanned aerial vehicle on the sky, or the like, which is not limited in the embodiment of the present application. The image to be detected may be an image obtained by shooting the object to be detected in the real scene through the image acquisition device, and the mode of acquiring the image to be detected by the server at this time may be that the image acquisition device sends the acquired image to be detected to the server. The image acquisition device may be a device capable of shooting to obtain an image, and may be different according to different real scenes, for example, on a road, an object to be detected may be a vehicle to be detected, at this time, the image acquisition device may be a road side camera, and the road side camera may be a self-built camera device or an old camera device, and is not limited to a visible light camera, and is also applicable to an infrared night vision camera, and is applicable to a non-pinhole model camera, and after distortion is removed, the method may still be applied; in another example, in the sky, the object to be detected may be an unmanned aerial vehicle, and the image capturing device may be other cameras or the like.

After obtaining the image to be detected, the server 200 may determine the position information of the first detection frame corresponding to the first component in the image to be detected, and determine the position information of the second detection frame corresponding to the second component in the image to be detected. The first component part and the second component part are different component parts included in the object to be detected, and the first component part and the second component part are obtained by carrying out structural division on the object to be detected along the moving direction of the object to be detected, so that the heading gesture of the object to be detected can be determined based on the position information of the first detection frame and the position information of the second detection frame, and further information of the object to be detected can be extracted from the image to be detected, so that the actual condition of the object to be detected can be accurately reflected. The first detection frame is a detection frame for indicating the position of the first component in the image to be detected, and the second detection frame is a detection frame for indicating the position of the second component in the image to be detected, wherein the detection frames can be in various shapes, such as rectangles, squares and the like.

Taking the to-be-detected object as an example of the to-be-detected vehicle, the first component may be the head of the to-be-detected vehicle, the second component may be the tail of the to-be-detected vehicle, and at this time, the first detection frame and the second detection frame may be respectively shown in fig. 2; alternatively, the first component may be a tail of the vehicle to be detected, and the second component may be a head of the vehicle to be detected, which is not limited by the embodiment of the present application. Of course, the first component and the second component may be other components that are divided along the movement direction of the object to be detected, and taking the object to be detected as the vehicle to be detected as an example, the first component may be the head of the vehicle to be detected, and the second component may be the cabin head of the vehicle to be detected.

The positions of the first detection frame and the second detection frame may affect the determination mode of the heading gesture, so the server 200 may determine the first positional relationship between the first detection frame and the second detection frame under the image coordinate system based on the position information of the first detection frame and the position information of the second detection frame, and further calculate the heading gesture of the object to be detected under the image coordinate system based on the first positional relationship by using the position information of the first detection frame and the position information of the second detection frame. Wherein the first positional relationship may indicate a relative positional condition between the first detection frame and the second detection frame, e.g., the first detection frame is in an upper left, upper right, lower left, lower right, etc., relationship of the second detection frame. The heading gesture may refer to a vector formed by the instantaneous motion direction of the object to be detected under a certain coordinate system (for example, an image coordinate system) when the object to be detected moves. Taking the to-be-detected object as an example of the to-be-detected vehicle, the heading gesture, i.e. the vehicle gesture, may refer to a vector formed by the instantaneous movement direction of the to-be-detected vehicle under a certain coordinate system (for example, an image coordinate system) when the to-be-detected vehicle is running on a road.

By the method, the server 200 can determine the course gesture of the object to be detected from the image to be detected, so that the actual condition of the object to be detected in the real scene can be reflected more accurately, and the accuracy of subsequent processing is improved. For example, in a digital twin scene, the difference between the result of the projection of the object to be detected into the 3D space and the situation of the object to be detected in the real scene can be reduced, and the reduction degree is greatly improved.

It should be noted that, the method provided by the embodiment of the application may relate to artificial intelligence, and the embodiment of the application is mainly based on artificial intelligence to automatically detect objects. Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions. For example, in the embodiment of the present application, object recognition may be performed by using a target detection model, so as to obtain a first detection frame, a second detection frame, and so on, where the target detection model may be a model obtained based on machine learning training.

The method provided by the embodiment of the application can relate to an artificial intelligence automatic driving technology, wherein the automatic driving technology generally comprises high-precision map, environment perception, behavior decision, path planning, motion control and other technologies, and has wide application prospect.

It should be noted that, in the specific embodiment of the present application, relevant data such as user information may be involved in the process of data processing, and when the above embodiment of the present application is applied to a specific product or technology, it is required to obtain individual consent or individual permission of the user, and the collection, use and processing of relevant data are required to comply with relevant laws and regulations and standards of relevant countries and regions.

Next, the object detection method provided in the embodiment of the present application will be described in detail with reference to the accompanying drawings by taking a server executing the object detection method as an example. Referring to fig. 3, fig. 3 shows a flow chart of an object detection method, the method comprising:

s301, acquiring an image to be detected.

In order to utilize the related information of the object to be detected and ensure that the follow-up processing by utilizing the related information can obtain more accurate results, the embodiment of the application can acquire the related information capable of reflecting the actual condition of the object to be detected in the real scene. It is understood that the object to be detected is an object to be detected in a real scene, and the object to be detected is usually a dynamic object capable of moving, and the object to be detected may be a vehicle on a road, an unmanned aerial vehicle on the sky, or the like. The heading gesture of the object to be detected can generally reflect the actual situation of the object to be detected in the real scene, so the related information required to be acquired in the embodiment of the application can be the heading gesture.

To obtain the heading pose, the server may obtain a to-be-detected image including the to-be-detected object. The image to be detected may be an image obtained by shooting the object to be detected in the real scene through the image acquisition device, and the mode of acquiring the image to be detected by the server at this time may be that the image acquisition device sends the acquired image to be detected to the server. The image capturing device may be a device capable of capturing an image, and may be different according to a real scene, for example, on a road, the object to be detected may be a vehicle to be detected, and the image capturing device may be a road side camera.

S302, determining the position information of a first detection frame corresponding to a first component in the image to be detected, and determining the position information of a second detection frame corresponding to a second component in the image to be detected.

After the image to be detected is obtained, the server can determine the position information of the first detection frame corresponding to the first component in the image to be detected and the position information of the second detection frame corresponding to the second component in the image to be detected. The first component part and the second component part are different component parts included in the object to be detected, and the first component part and the second component part are obtained by carrying out structural division on the object to be detected along the moving direction of the object to be detected, so that the heading gesture of the object to be detected can be determined based on the position information of the first detection frame and the position information of the second detection frame, and further information of the object to be detected can be extracted from the image to be detected, so that the actual condition of the object to be detected can be accurately reflected.

The first detection frame is a detection frame for indicating the position of the first component in the image to be detected, and the second detection frame is a detection frame for indicating the position of the second component in the image to be detected, wherein the detection frames can be in various shapes, such as rectangles, squares and the like. The position information of the first detection frame and the position information of the second detection frame may be represented by coordinates, and if the detection frame is rectangular, the coordinates of the detection frame may reflect one vertex coordinate of the detection frame and the width and height of the detection frame, thereby representing the position information of the detection frame. For example, the position information of the first detection frame may be represented as (x 1, y1, w1, h 1), where (x 1, y 1) is the coordinate of one vertex of the first detection frame (such as the vertex of the upper left corner), w1 is the width of the first detection frame (i.e., the actual pixel width of the first detection frame on the image to be detected), reflects the actual pixel width of the first component on the image to be detected, and h1 is the height of the first detection frame (i.e., the actual pixel height of the first detection frame on the image to be detected), reflects the actual pixel height of the first component on the image to be detected; the position information of the second detection frame may be represented as (x 2, y2, w2, h 2), where (x 2, y 2) is the coordinate of one vertex of the second detection frame (e.g., the vertex of the upper left corner), w2 is the width of the second detection frame (i.e., the actual pixel width of the second detection frame on the image to be detected), reflects the actual pixel width of the second component on the image to be detected, and h2 is the height of the second detection frame (i.e., the actual pixel height of the second detection frame on the image to be detected), reflects the actual pixel height of the second component on the image to be detected.

In the embodiment of the application, since the objects to be detected can be different objects, the first component part and the second component part can be different from each other, and even if the objects to be detected are the same, the first component part and the second component part can be different component parts included in the objects to be detected. In general, it is sufficient to ensure that the first component and the second component are aligned along the movement direction of the object to be detected, and can be used to represent the heading gesture of the object to be detected. In one possible implementation manner, the object to be detected is a vehicle to be detected, the first component may be a head of the vehicle to be detected, the first detection frame may be a head detection frame at this time, the second component may be a tail of the vehicle to be detected, and the second detection frame may be a tail detection frame at this time; or, the first component may be a tail of the vehicle to be detected, at this time, the first detection frame may be a tail detection frame, the second component may be a head of the vehicle to be detected, at this time, the second detection frame may be a head detection frame.

Referring to fig. 4, in fig. 4, the vehicle to be detected is taken as an example of the object to be detected, where the first component may be a vehicle head, the second component may be a vehicle tail, where the first detection frame may be a vehicle head detection frame (see 401 in fig. 4), and the second detection frame may be a vehicle tail detection frame (see 402 in fig. 4).

According to the method, the head and the tail are used as the first component and the second component, and the head and the tail are components which are easy to identify on the vehicle, so that the accuracy of identification is improved, and the accuracy of calculation of the subsequent cruising attitude is improved.

It should be noted that after the first detection frame and the second detection frame are obtained, if the first detection frame and the second detection frame respectively correspond to the head and the tail of the object to be detected, for example, the first detection frame is a vehicle head detection frame, the second detection frame is a vehicle tail detection frame, or the first detection frame is a vehicle tail detection frame, and the second detection frame is a vehicle head detection frame, the first detection frame and the second detection frame may form a cube with a pseudo 3D structure, so as to frame the whole object to be detected, which is shown by a cube formed by a dotted line in fig. 4. The cube can express 3D structure information of an object to be detected to a certain extent, so that approximate expression of the 3D structure information on a 2D image to be detected is realized.

It should be noted that, in the embodiment of the present application, the object to be detected and the component parts of the object to be detected may be identified by using the target detection technology, so as to obtain the position information of the first detection frame and the second detection frame. When the object to be detected is photographed to obtain the image to be detected, the components of the object to be detected in the image to be detected are not necessarily all visible due to the angle, the azimuth and other reasons between the object to be detected and the image acquisition device, and in this case, in order to ensure the accuracy of recognition, the visible components of the object to be detected can be recognized.

Based on the above, the manner of determining the position information of the first detection frame corresponding to the first component in the image to be detected and the manner of determining the position information of the second detection frame corresponding to the second component in the image to be detected may be to identify the object to be detected in the image to be detected, obtain the position information of the third detection frame corresponding to the object to be detected, and identify the component to be detected of the object to be detected, obtain the position information of the target detection frame corresponding to the component to be detected, where the component to be detected is a component of the first component and the second component visible in the image to be detected, and the target detection frame corresponding to the component to be detected is located in the range of the third detection frame, and by limiting the whole detection frame (i.e. the third detection frame) of the object to be detected, the accuracy of identifying the target detection frame may be improved. Then, position information of the first detection frame and position information of the second detection frame are determined based on the position information of the target detection frame.

The position information of the third detection frame may be represented by coordinates, and if the third detection frame is rectangular, the coordinates of the third detection frame may reflect one vertex coordinate of the third detection frame and the width and height of the third detection frame, thereby representing the position information of the third detection frame. In this case, a vertex of the third detection frame may be used as an origin of the coordinate system, and two sides passing through the vertex may be used as x-axis and y-axis of the coordinate system, respectively, and at this time, the position information of the third detection frame may be represented as (0, w, h), (0, 0) being coordinates of a vertex (for example, a vertex of an upper left corner) of the third detection frame, w being a width of the third detection frame (i.e., an actual pixel width of the third detection frame on the image to be detected), reflecting an actual pixel width of the object to be detected on the image to be detected, and h being a height of the third detection frame (i.e., an actual pixel height of the third detection frame on the image to be detected), reflecting an actual pixel height of the object to be detected on the image to be detected.

Referring to fig. 4, a first detection frame 401 may be identified in fig. 4, a second detection frame 402 may be identified, a third detection frame 403 may be identified, a vertex of an upper left corner of the third detection frame is taken as an origin, two sides passing through the vertex are respectively taken as an x axis and a y axis to construct a coordinate system (see fig. 4), so that the first detection frame, the second detection frame and the third detection frame are represented by coordinates under the coordinate system, and detailed representation methods are referred to the above description and are not repeated herein.

According to the embodiment of the application, the object to be detected and the visible component parts of the object to be detected are identified, so that the identification accuracy of the target detection frame is improved, and the accuracy of the position information of the first detection frame and the position information of the second detection frame is further ensured.

It can be understood that, due to the angle, the direction and the like between the object to be detected and the image capturing device, the visible component parts of the object to be detected may be different according to the angle and the direction, that is, the component parts to be detected may be different, so that the manner of determining the position information of the first detection frame and the position information of the second detection frame based on the position information of the target detection frame may be different. Next, a description will be given of a manner of determining the position information of the first detection frame and the position information of the second detection frame when the components to be detected are different.

If the component to be detected is a first component, i.e. the first component is visible in the image to be detected, the object to be detected and the first component can be identified by means of a target detection technique. In this case, the manner of determining the position information of the first detection frame and the position information of the second detection frame based on the position information of the target detection frame may be to determine the position information of the target detection frame as the position information of the first detection frame, and determine the position information of the second detection frame according to the principle of rigid body symmetry based on the position information of the third detection frame and the position information of the first detection frame. The rigid body refers to an object whose shape and size are not changed under the action of external force, and the principle of symmetry of the rigid body may refer to the property that the rigid body remains unchanged after some operation is performed on the rigid body, such as axisymmetry, rotational symmetry, and the like.

Referring to fig. 4, in fig. 4, the vehicle to be detected is taken as an example, where the first component may be a vehicle head, the second component may be a vehicle tail, if the component to be detected is the first component, that is, the third detection frame (i.e., the whole detection frame may also be referred to as a full frame) and the first detection frame (i.e., the vehicle head detection frame) are identified, the position information of the third detection frame may be represented as (0, w, h), the position information of the first detection frame may be represented as (x 1, y1, w1, h 1), and since the third detection frame and the first detection frame are identified, the (0, w, h) and (x 1, y1, w1, h 1) are known, and the position information (x 2, y2, w2, h 2) of the second detection frame is unknown, so that the position information of the second detection frame may be determined according to the principle of symmetry of the third detection frame and the first detection frame. Based on the principle of rigid symmetry, referring to fig. 4, x2= (w- (x1+w1)), y2= (h- (y1+h1)), w2=w1, h2=h1 can be determined, so the position information of the second detection frame can be expressed as (x2= (w- (x1+w1)), y2= (h- (y1+h1)), w2=w1, h2=h1).

If the component to be detected is a second component, i.e. the second component is visible in the image to be detected, the object to be detected and the second component can be identified by means of a target detection technique. In this case, the manner of determining the position information of the first detection frame and the position information of the second detection frame based on the position information of the target detection frame may be to determine the position information of the target detection frame as the position information of the second detection frame, and the position information of the first detection frame may be determined in accordance with the principle of rigid body symmetry based on the position information of the third detection frame and the position information of the second detection frame.

Referring to fig. 4, in the example of the vehicle to be detected, the first component may be a vehicle head, the second component may be a vehicle tail, if the component to be detected is the second component, that is, the third detection frame (i.e., the whole detection frame may also be referred to as the whole detection frame) and the second detection frame (i.e., the vehicle tail detection frame) are identified, the position information of the third detection frame may be represented as (0, w, h), the position information of the second detection frame may be represented as (x 2, y2, w2, h 2), and since the third detection frame and the second detection frame are identified, the position information of the (0, w, h) and (x 2, y2, w2, h 2) are known, and the position information of the second detection frame (x 1, y1, w1, h 1) is unknown, so the position information of the first detection frame may be determined according to the principle of rigid symmetry. Based on the principle of rigid symmetry, referring to fig. 4, x1= (w- (x2+w2)), y1= (h- (y2+h2)), w1=w2, h1=h2 can be determined, so the position information of the second detection frame can be expressed as (x1= (w- (x2+w2)), y1= (h- (y2+h2)), w1=w2, h1=h2).

If the component to be detected includes a first component and a second component, that is, the first component and the second component are visible in the image to be detected, the object to be detected and the first component and the second component may be identified by a target detection technique, and the method of determining the position information of the first detection frame and the position information of the second detection frame based on the position information of the target detection frame may be to determine the position information of the target detection frame corresponding to the first component as the position information of the first detection frame and determine the position information of the target detection frame corresponding to the second component as the position information of the second detection frame. That is, the position information of the first detection frame and the position information of the second detection frame are both detected and known, and are not obtained by solving.

Referring to fig. 4, in fig. 4, an object to be detected is taken as an example of a vehicle to be detected, where the first component may be a vehicle head, the second component may be a vehicle tail, if the component to be detected includes the first component and the second component, that is, the third detection frame (i.e., the whole detection frame may also be referred to as a full vehicle frame), the first detection frame (i.e., the vehicle head detection frame), and the second detection frame (i.e., the vehicle tail detection frame) are identified, the position information of the third detection frame may be represented as (0, w, h), the position information of the first detection frame may be represented as (x 1, y1, w1, h 1), the position information of the second detection frame may be represented as (x 2, y2, w2, h 2), which are all known, and the solution is not needed.

S303, determining a first position relation between the first detection frame and the second detection frame under an image coordinate system based on the position information of the first detection frame and the position information of the second detection frame.

The positions of the first detection frame and the second detection frame may affect the determination mode of the heading gesture, so the server may determine the first positional relationship of the first detection frame and the second detection frame under the image coordinate system based on the position information of the first detection frame and the position information of the second detection frame. The first positional relationship may indicate a relative position between the first detection frame and the second detection frame, e.g., the first detection frame is in an upper left, upper right, lower left, lower right, etc., relationship with the second detection frame, and the first positional relationship may be different, and the resulting heading gesture may be different.

When the position information is represented by the vertex of the detection frame and the width and height of the detection frame, the first positional relationship may be determined based on the magnitude relationship between the vertex coordinates of the first detection frame and the vertex coordinates of the second detection frame.

Taking the position information of the first detection frame as (x 1, y1, w1, h 1), the position information of the second detection frame as (x 2, y2, w2, h 2) as an example, if x1< = x2 and y1< = y2, it may be determined that the first position relationship is that the first detection frame is at the upper left side of the second detection frame, see fig. 5 and 502, where 501 may identify the second detection frame, 501 may identify the first detection frame, i.e. the port of the object to be detected; if x1> x2 and y1< = y2, it may be determined that the first positional relationship is that the first detection frame is at the upper right of the second detection frame, as shown in 501 and 503 in fig. 5, where 501 may identify the second detection frame and 503 may identify the first detection frame; if x1< = x2 and y1> y2, it may be determined that the first positional relationship is that the first detection frame is at the lower left of the second detection frame, see fig. 5 for 501 and 504, where 501 may identify the second detection frame and 504 may identify the first detection frame; if x1> x2 and y1> y2, it may be determined that the first positional relationship is that the first detection frame is at the lower right side of the second detection frame, see 501 and 505 in fig. 5, where 501 may identify the second detection frame and 505 may identify the first detection frame.

It should be noted that, if the object to be detected is a vehicle to be detected, the first component is a vehicle head, and the second component is a vehicle tail, when x1< =x2 and y1< =y2, a vehicle side chord on a visible side in the image to be detected is a vehicle port (i.e., the left side of the vehicle head is a port when viewing from the vehicle tail to the vehicle head); when x1> x2 and y1< = y2, the vehicle side chord on the visible side in the image to be detected is the vehicle starboard (i.e. the right side of the vehicle head is starboard when seen from the vehicle tail to the vehicle head); when x1< = x2 and y1> y2, the vehicle side chord on the visible side in the image to be detected is the vehicle port; when x1> x2 and y1> y2, the vehicle side chord on the visible side in the image to be detected is the vehicle starboard.

S304, calculating the heading gesture of the object to be detected under the image coordinate system by using the position information of the first detection frame and the position information of the second detection frame based on the first position relation.

After the first position relation is obtained, the server can calculate the heading gesture of the object to be detected under the image coordinate system by using the position information of the first detection frame and the position information of the second detection frame based on the first position relation. According to the difference of the first position relation, the heading gesture calculated by using the position information of the first detection frame and the position information of the second detection frame is different. In general, the heading attitude can be represented by the direction of a side chord on the visible side of the object to be detected in the image to be detected.

Continuing with the example in which the object to be detected is a vehicle to be detected, the first component is a vehicle head, and the second component is a vehicle tail, when x1< =x2 and y1< =y2, the vehicle side chord on the visible side in the image to be detected is a port of the vehicle, so that the direction of the port of the vehicle may be used to represent the heading gesture, for example, as indicated by the dashed arrow between 501 and 502 in fig. 5. At this time, heading attitude= [ (x 1, y1+h1), (x 2, y2+h2) ]. When x1> x2 and y1< =y2, the vehicle side chord on the visible side in the image to be detected is the vehicle starboard, so the heading attitude can be represented by the direction of the vehicle starboard, for example, as indicated by a broken-line arrow between 501 and 503 in fig. 5. At this time, heading attitude= [ (x1+w1, y1+h1), (x2+w2, y2+h2) ]. When x1< = x2 and y1> y2, the vehicle side chord on the visible side in the image to be detected is the vehicle port, so the heading attitude can be represented by the direction of the vehicle port, for example, as indicated by the dashed arrow between 501 and 504 in fig. 5. At this time, heading attitude= [ (x1+w1, y1+h1), (x2+w2, y2+h2) ]. When x1> x2 and y1> y2, the vehicle side chord on the visible side in the image to be detected is the vehicle starboard, so the heading attitude can be represented by the direction of the vehicle starboard, for example, as indicated by the dashed arrow between 501 and 505 in fig. 5. At this time, heading attitude= [ (x 1, y1+h1), (x 2, y2+h2) ].

According to the method and the device for determining the heading state of the object to be detected, the heading state of the object to be detected is calculated through the side chord on the visible side, errors can be reduced, and accuracy of determining the heading state is improved.

After the course gesture of the object to be detected is obtained by the method provided by the embodiment of the application, corresponding processing can be carried out by utilizing the course information under different application scenes. For example, in a digital twin scene, a dynamic obstacle such as an object to be detected and the like shot by an image acquisition device is subjected to sensing extraction, and the position, the heading gesture, the category and the like of the object to be detected are provided for a real-time twin product platform, and the object to be detected is projected into a 3D space based on the heading gesture, so that the actual condition (such as the actual condition, the track and the like of a vehicle to be detected) of the object to be detected is subjected to twin reduction display in a virtual 3D environment.

Specifically, the server may acquire an internal parameter and an external parameter of an image acquisition device, where the image acquisition device is configured to acquire the image to be detected, and further convert the heading gesture of the object to be detected under the image coordinate system to the three-dimensional coordinate system based on the internal parameter and the external parameter of the image acquisition device. The internal parameters can be parameter matrixes determined by focal length and distortion parameters of the image acquisition equipment when leaving the factory; the external parameters may refer to position coordinates of the image capturing device and a rotation matrix determined by the mounting position of the image capturing device and a defined coordinate system.

Compared with the method for acquiring the 3D coordinate information of the object to be detected based on the 3D equipment such as the laser radar and the like provided by the related technology, the method and the device for acquiring the 3D coordinate information of the object to be detected can analyze the 3D structure of the object to be detected by utilizing the 2D image to be detected without using the 3D equipment such as the laser radar and the like, so that the service deployment and selling requirements of products and the iteration requirements of an internal algorithm model are unified in hardware requirements, and the internal research and development cost can be effectively reduced. Meanwhile, 3D structure analysis of an object to be detected is decoupled from internal parameters and external parameters of image acquisition equipment, namely, the heading gesture of the object to be detected is estimated on an image coordinate system, when projection to a 3D space is needed, the heading gesture under the 2D image coordinate system can be converted into the 3D space only by inputting the internal parameters and the external parameters of the image acquisition equipment, and the method improves the generalization expansion capability of a business algorithm and simultaneously reduces the equipment cost required by an iterative algorithm (the 3D structure analysis can be carried out without lasers).

When the object to be detected is a vehicle to be detected, the analysis of the vehicle structure of the vehicle to be detected is realized. The vehicle structure can refer to the identification and analysis of structural components of a vehicle to be detected, and the obtained key semantic components comprise a whole vehicle body, a vehicle head, a vehicle tail and the like.

It may be appreciated that, in the embodiment of the present application, when identifying an object to be detected and a component to be detected, a target detection technique may be used for identifying, and in a possible implementation manner, the target detection technique may be implemented by using a target detection model. Based on this, in one possible implementation manner, the identifying the object to be detected in the image to be detected may obtain the position information of the third detection frame corresponding to the object to be detected, and the identifying the component to be detected of the object to be detected may obtain the position information of the target detection frame corresponding to the component to be detected by identifying the object to be detected in the image to be detected through the target detection model, and identifying the component to be detected of the object to be detected through the target detection model, so as to obtain the position information of the target detection frame corresponding to the component to be detected.

The object detection model may be a machine learning model, which is used for object detection. The embodiment of the application does not limit the network structure of the target detection model, so long as the target detection can be realized. In one possible implementation, the network structure of the object detection model may be shown in fig. 6, and mainly includes two parts identified by 601 and 602, where the part identified by 601 may be composed of a convolution (Conv) layer, a batch normalization (Batch Normalization, BN) layer, and an activation function (ReLU) layer. The portion identified at 602 may be composed of Deconvolution (DeConv), BN layer, and ReLU layer.

When the image to be detected is obtained, the image to be detected can be subjected to size change, so that the size of the image to be detected input into the target detection model accords with the processing requirement of the target detection model. Inputting the image to be detected after the size change into a target detection model, extracting the characteristics through a convolution layer to obtain a characteristic image with a certain size, and then processing the characteristic image through a BN layer and a ReLU layer to obtain a processed characteristic image. The processed feature map is processed by the deconvolution layer, the BN layer and the ReLU layer to obtain a target feature map, the size of the target feature map is increased relative to that of the processed feature map, and the information quantity is increased, so that the type, the position information of the third detection frame and the position information of the target detection frame can be accurately output based on the target feature map. It will be appreciated that the object detection model has the ability to output positional information of the first detection frame and positional information of the second detection frame, but the object detection frame may be the first detection frame and/or the second detection frame, depending on the first component and the second component being visible.

It will be appreciated that when the recognition is performed by the target detection model, the accuracy of the recognition of the target detection model will affect the accuracy of the position information of the third detection frame and the target detection frame, and thus the accuracy of the subsequent calculation of the heading state, and therefore, the accuracy of the target detection model is very important. The accuracy of the target detection model can be determined by the training process of the target detection model, and based on the accuracy, the embodiment of the application further provides a training method of the target detection model in order to ensure the accuracy of the target detection model. Referring to fig. 7, the method includes:

S701, acquiring a sample image, wherein the sample image is marked with a first standard detection frame corresponding to a first sample component part, a second standard detection frame corresponding to a second sample component part, a third standard detection frame corresponding to a sample detection object, a visible mark of the first sample component part and a visible mark of the second sample component part, the first sample component part and the second sample component part are different component parts included in the sample detection object, the first sample component part and the second sample component part are obtained by carrying out structural division on the sample detection object along the movement direction of the sample detection object, and the first standard detection frame and the second standard detection frame are positioned in the range of the third standard detection frame.

In the embodiment of the application, in order to train and obtain the position information capable of outputting the first detection frame, the second detection frame and the third detection frame, the image can be marked first, so that a sample image is constructed. When the labeling is carried out, a first standard detection frame corresponding to the first sample component part, a second standard detection frame corresponding to the second sample component part, a third standard detection frame corresponding to the sample detection object, a visible mark of the first sample component part and a visible mark of the second sample component part are mainly labeled. The visible mark can be represented by numbers, symbols and the like, the number is taken as an example, the visible mark is 1 to represent the visible, the visible mark is 0 to represent the invisible, and the specific form of the visible mark is not limited in the embodiment of the application.

Taking a sample detection object as a vehicle as an example, after an image of the vehicle is obtained, a full frame of the vehicle can be marked, in the marked full frame, a head detection frame/tail detection frame is marked, the width of the head detection frame/tail detection frame reflects the actual pixel width of the vehicle on the image, the bottom edge of the head detection frame/tail detection frame is approximately drawn to a place where a chassis of the vehicle is grounded, the range of the full frame cannot be exceeded, and the height of the head detection frame/tail detection frame is drawn according to the actual height of the head/tail, so that the sample image is obtained. The result of the above-described drawing is shown in fig. 8 a. The detection frame shown in 801 is a full frame (i.e., a third standard detection frame), the detection frame shown in 802 is a vehicle head detection frame (i.e., a first standard detection frame), and the detection frame shown in 803 is a vehicle tail detection frame (i.e., a second standard detection frame). And whether the head/tail is visible or not is identified by a visible mark (the visible mark is 0-invisible, and the visible mark is 1-visible) according to whether the head/tail is visible or not (self-shielding).

In the embodiment of the application, when the sample detection object is marked, not only the third standard detection frame (namely the whole detection frame) corresponding to the sample detection object, but also the first standard detection frame corresponding to the first sample component part and the second standard detection frame corresponding to the second sample component part are marked, so that the target detection model obtained by subsequent training can identify the detection frames of two different component parts included in one object, and the two component parts are obtained by carrying out structural division on the sample detection object along the movement direction of the sample detection object. Compared with chassis key point labeling provided in the related art, the labeling mode provided by the embodiment of the application is easier.

S702, outputting a first prediction detection frame corresponding to a target sample component part through a model to be trained based on the sample image, and outputting a second prediction detection frame corresponding to a sample detection object through the model to be trained, wherein the target sample component part is a component part with visible indication in the first sample component part and the second sample component part.

The first prediction detection frame may be a detection frame obtained by predicting a visible component, and generally refers to a prediction detection frame of the first sample component and/or the second sample component. The second prediction detection frame is an overall detection frame reflecting the overall sample detection object obtained by predicting the sample detection object. Taking the example that the sample detection object is a vehicle, the first sample component is a vehicle head, the second sample component is a vehicle tail, the first prediction detection frame can be a detection frame obtained by predicting visible components in the vehicle head and the vehicle tail, and the second prediction detection frame can be a full predicted frame.

S703, determining, based on the position information of the first prediction detection frame, the position information of the third prediction detection frame corresponding to the first sample component and the position information of the fourth prediction detection frame corresponding to the second sample component.

The third prediction detection frame is a detection frame corresponding to the first sample component and is used for indicating the first sample component; the fourth predictive detection frame is a detection frame corresponding to the second sample component and is used for indicating the second sample component. Taking the example that the sample detection object is a vehicle, the first sample component is a vehicle head, the second sample component is a vehicle tail, the third prediction detection frame is a vehicle head detection frame, and the fourth prediction detection frame is a vehicle tail detection frame.

S704, determining a second position relation of the third prediction detection frame and the fourth prediction detection frame under the image coordinate system based on the position information of the third prediction detection frame and the position information of the fourth prediction detection frame.

S705, calculating the predicted course gesture of the sample detection object under the image coordinate system by using the position information of the third prediction detection frame and the position information of the fourth prediction detection frame based on the second position relation.

The specific implementation of S702-S703 is similar to the implementation introduced in S302, and the specific implementation of S704-S705 is similar to the specific implementation of S303-S304, and will not be repeated here.

S706, constructing heading loss according to the predicted heading posture and the standard heading posture, constructing first predicted loss according to the third predicted detection frame and the first standard detection frame, constructing second predicted loss according to the fourth predicted detection frame and the second standard detection frame, and constructing third predicted loss according to the second predicted detection frame and the third standard detection frame.

In the embodiment of the application, not only the prediction loss corresponding to each detection frame can be constructed, but also the heading loss can be constructed based on the obtained predicted heading posture, so that the model to be trained is trained by combining the heading loss.

In addition, the target detection model can also identify the class of the sample detection object, so that the standard classification result of the sample detection object can be marked in the sample image, and class loss can be constructed according to the prediction classification result and the standard classification result after the prediction classification result of the sample detection object is output through the model to be trained.

In one possible implementation, the heading loss may be represented by L1 loss, where L1 loss is calculated as follows:

L1 loss= abs(Predict-GT)

where abs represents absolute value, prediction represents predicted heading pose, and GT represents standard heading pose.

Taking the vector of the predicted heading-direction gesture as an example and representing the head-tail coordinate point of the vector of the predicted heading-direction gesture by (m, n), the predicted heading-direction gesture can be expressed as a predicted=atan (m, n); taking the example of the vector passing (M, N) of the standard heading pose, M and N represent the end-to-end coordinate points of the vector of the standard heading pose, the standard heading pose may be expressed as gt=atan (M, N).

In one possible implementation, the first prediction loss, the second prediction loss, and the third prediction loss are all losses corresponding to the prediction detection box, but are obtained from different prediction detection boxes. Typically, the corresponding penalty of the predicted detection box may be represented by an intersection ratio (Intersection Over Union, IOU), and the IOU calculates the overlap ratio of the predicted detection box and the standard detection box, i.e. the ratio of their intersection and union. Specifically, in the embodiment of the present application, the used cross-over ratio may be a generalized cross-over ratio (Generalized Intersection Over Union, GIOU). The calculation formula of GIOU can be as follows:

wherein A represents a prediction detection frame, B represents a standard detection frame, the intersection ratio IOU of A and B is calculated first, C is the minimum outer envelope frame of A and B, and C\A U B is the value of C minus A U B.

For example, when a is the third predictive detection box, B is the first standard detection box, and the first predictive loss is constructed by the above formula; when A is a fourth prediction detection frame, B is a second standard detection frame, and a second prediction loss is constructed through the formula; when a is the second prediction detection box, B is the third standard detection box, and the third prediction loss is constructed by the above formula.

It will be appreciated that the classification penalty may be constructed based on the difference between the predicted classification result and the standard classification result, and in one possible implementation, the classification penalty may be represented by a classification cross entropy, the calculation formula of which may be as follows:

Entropy = -∑P _i log(P _i )

wherein Entropy represents class cross Entropy, P _i The prediction classification result may specifically refer to a probability that the sample detection object belongs to each category.

In some cases, errors may exist in manual labeling, in order to ensure the accuracy of a standard detection frame in subsequent use, the labeling result may be effectively normalized under the condition of utilizing the principle of rigid symmetry, so as to correct errors possibly caused by manual labeling. Specifically, if the first standard detection frame and the second standard detection frame are rectangular detection frames, normalizing at least one detection frame of the first standard detection frame and the second standard detection frame to obtain a processed first standard detection frame and a processed second standard detection frame. The first prediction loss may be constructed according to the third prediction detection frame and the first standard detection frame, the first prediction loss may be constructed according to the third prediction detection frame and the processed first standard detection frame, and the second prediction loss may be constructed according to the fourth prediction detection frame and the processed second standard detection frame. In this way, the target detection model obtained through training can obtain normalized data when being identified.

In the normalization processing, one of the first standard detection frame and the second standard detection frame may be used as a reference, and the width and the height of the other standard detection frame may be made to coincide with the standard detection frame used as the reference, respectively. In general, a standard detection frame having a relatively large height and width may be used as a reference.

Taking the example that the sample detection object is a vehicle as an example, the first sample component may be a vehicle head, the second sample component may be a vehicle tail, and the labeled sample image may be as shown in fig. 8 a. On the basis of 8a, the headstock detection frame/tailstock detection frame is normalized by utilizing the range constraint of the headstock detection frame/tailstock detection frame and the full frame. When the sample detection object is a vehicle, the widths of the head detection frame and the tail detection frame are basically consistent, and the heights may be different, so that the width can be not processed, but the height can be normalized. The specific method comprises the following steps: for the headstock detection frame/tailstock detection frame, when the heights of the two detection frames are inconsistent, the height of the headstock detection frame/tailstock detection frame is taken as the height of the headstock detection frame/tailstock detection frame. In fig. 8a, the height of the tail detection frame is greater than that of the head detection frame, so the height of the tail detection frame can be kept unchanged, and the height of the tail detection frame is used as the height of the head detection frame, so that normalization processing is completed, and after normalization processing, the head detection frame and the tail detection frame are shown in fig. 8 b. This forms a pseudo 3D cube that frames the entire vehicle as shown by the cuboid enclosed by the black dashed line in fig. 8 c.

And S707, training the model to be trained based on the heading loss, the first prediction loss, the second prediction loss and the third prediction loss to obtain a target detection model.

In the embodiment of the application, after the heading loss, the first prediction loss, the second prediction loss and the third prediction loss are obtained, the model to be trained can be trained based on the losses to obtain the target detection model. Besides the constraint of the first prediction loss, the second prediction loss and the third prediction loss, the position of the third prediction detection frame and the position of the fourth prediction detection frame in the second prediction detection frame are further constrained by combining the heading loss, so that the third prediction detection frame, the fourth prediction detection frame and the second prediction detection frame of the target detection model regression are guaranteed to have reasonable topological structure relations, and the recognition accuracy of the target detection model is improved.

If the first prediction loss, the second prediction loss and the third prediction loss are GIOU, and classification loss is obtained in addition to the losses, the above 3 losses can be synthesized after the heading loss, the GIOU and the classification loss are obtained, and besides the constraint of the GIOU and the classification loss, the positions of the third prediction detection frame and the fourth prediction detection frame in the second prediction detection frame can be constrained through the heading loss, so that the recognition accuracy of the target detection model is improved.

Next, the object detection method provided by the embodiment of the present application will be described in connection with an actual application scenario. The application scene is a digital twin scene, and particularly dynamic obstacles (such as vehicles to be detected) of the digital twin scene are perceived in real time, so that the traffic flow and the track on the road are subjected to twin reduction display in a virtual 3D environment.

Based on the above, in order to realize accurate reduction display of the vehicle to be detected, the vehicle to be detected can be shot by a road side camera to obtain an image to be detected. Identifying the vehicle to be detected in the image to be detected to obtain the position information of the whole frame, and identifying the visible head or tail of the vehicle in the image to be detected to obtain the position information of the corresponding detection frame. If the vehicle head is visible, the position information of the vehicle head detection frame is obtained through recognition, and the position information of the vehicle tail detection frame can be calculated based on the position information of the whole vehicle frame and the position information of the vehicle head detection frame. If the vehicle tail is visible, the position information of the vehicle tail detection frame is obtained through recognition, and the position information of the vehicle head detection frame can be calculated based on the position information of the whole vehicle frame and the position information of the vehicle tail detection frame. If the vehicle head and the vehicle tail are visible, the position information of the vehicle head detection frame and the position information of the vehicle tail detection frame are directly obtained through recognition.

After the position information of the whole vehicle frame, the position information of the vehicle head detection frame and the position information of the vehicle tail detection frame are obtained, the first position relation between the vehicle head detection frame and the vehicle tail detection frame under the image coordinate system can be determined, and then the heading gesture of the object to be detected under the image coordinate system is calculated by utilizing the position information of the vehicle head detection frame and the position information of the vehicle tail detection frame based on the first position relation, so that the reusable representation of the heading gesture is realized based on the low-cost road side camera.

And providing the position, heading fonts, categories and the like of the vehicle to be detected for a real-time twin product platform, and carrying out twin reduction display on the traffic flow and the track on the road in a virtual 3D environment.

It should be noted that, based on the implementation manner provided in the above aspects, further combinations may be further performed to provide further implementation manners.

Based on the object detection method provided in the corresponding embodiment of fig. 3, the embodiment of the application further provides an object detection device 900. Referring to fig. 9, the object detection apparatus 900 includes an acquisition unit 901, a determination unit 902, and a calculation unit 903:

the acquiring unit 901 is configured to acquire an image to be detected, where the image to be detected includes an object to be detected;

The determining unit 902 is configured to determine location information of a first detection frame corresponding to a first component in the image to be detected, and determine location information of a second detection frame corresponding to a second component in the image to be detected, where the first component and the second component are different components included in the object to be detected, and the first component and the second component are obtained by performing structural division on the object to be detected along a movement direction of the object to be detected;

the determining unit 902 is further configured to determine a first positional relationship between the first detection frame and the second detection frame under an image coordinate system based on the positional information of the first detection frame and the positional information of the second detection frame;

the calculating unit 903 is configured to calculate, based on the first positional relationship, a heading gesture of the object to be detected in the image coordinate system using the positional information of the first detection frame and the positional information of the second detection frame.

In a possible implementation manner, the determining unit 902 is specifically configured to:

the method comprises the steps of identifying an object to be detected in an image to be detected to obtain position information of a third detection frame corresponding to the object to be detected, and identifying a component to be detected of the object to be detected to obtain position information of a target detection frame corresponding to the component to be detected, wherein the component to be detected is a component visible in the image to be detected by the first component and the second component, and the target detection frame corresponding to the component to be detected is located in the range of the third detection frame;

And determining the position information of the first detection frame and the position information of the second detection frame based on the position information of the target detection frame.

if the component to be detected is a first component, determining the position information of the target detection frame as the position information of the first detection frame;

and determining the position information of the second detection frame according to the principle of rigid symmetry based on the position information of the third detection frame and the position information of the first detection frame.

if the component to be detected is a second component, determining the position information of the target detection frame as the position information of the second detection frame;

and determining the position information of the first detection frame according to the principle of rigid symmetry based on the position information of the third detection frame and the position information of the second detection frame.

In a possible implementation manner, the component to be detected includes the first component and the second component, and the determining unit 902 is specifically configured to:

And determining the position information of the target detection frame corresponding to the first component as the position information of the first detection frame, and determining the position information of the target detection frame corresponding to the second component as the position information of the second detection frame.

identifying an object to be detected in the image to be detected through a target detection model to obtain the position information of a third detection frame corresponding to the object to be detected, and identifying a component part to be detected of the object to be detected through the target detection model to obtain the position information of a target detection frame corresponding to the component part to be detected.

In a possible implementation manner, the apparatus further includes a training unit, where the training unit is configured to:

the method comprises the steps of obtaining a sample image, wherein a first standard detection frame corresponding to a first sample component part, a second standard detection frame corresponding to a second sample component part, a third standard detection frame corresponding to a sample detection object, a visible mark of the first sample component part and a visible mark of the second sample component part are marked in the sample image, the first sample component part and the second sample component part are different component parts included in the sample detection object, the first sample component part and the second sample component part are obtained by carrying out structural division on the sample detection object along the movement direction of the sample detection object, and the first standard detection frame and the second standard detection frame are positioned in the range of the third standard detection frame;

Outputting a first prediction detection frame corresponding to a target sample component part through a model to be trained based on the sample image, and outputting a second prediction detection frame corresponding to the sample detection object through the model to be trained, wherein the target sample component part is a component part with visible indication in the first sample component part and the second sample component part;

determining the position information of a third prediction detection frame corresponding to the first sample component and the position information of a fourth prediction detection frame corresponding to the second sample component based on the position information of the first prediction detection frame;

determining a second position relation of the third prediction detection frame and the fourth prediction detection frame under an image coordinate system based on the position information of the third prediction detection frame and the position information of the fourth prediction detection frame;

calculating a predicted heading state of the sample detection object under the image coordinate system by using the position information of the third prediction detection frame and the position information of the fourth prediction detection frame based on the second position relation;

constructing heading loss according to the predicted heading pose and the standard heading pose, constructing a first predicted loss according to the third predicted detection frame and the first standard detection frame, constructing a second predicted loss according to the fourth predicted detection frame and the second standard detection frame, and constructing a third predicted loss according to the second predicted detection frame and the third standard detection frame;

And training the model to be trained based on the heading loss, the first prediction loss, the second prediction loss and the third prediction loss to obtain the target detection model.

In a possible implementation manner, the first standard detection frame and the second standard detection frame are rectangular detection frames, and the apparatus further includes a processing unit:

the processing unit is used for carrying out normalization processing on at least one detection frame in the first standard detection frame and the second standard detection frame to obtain a processed first standard detection frame and a processed second standard detection frame;

the training unit is specifically configured to:

constructing the first prediction loss according to the third prediction detection frame and the processed first standard detection frame;

and constructing the second prediction loss according to the fourth prediction detection frame and the processed second standard detection frame.

In one possible implementation manner, the object to be detected is a vehicle to be detected, the first component is a vehicle head of the vehicle to be detected, the first detection frame is a vehicle head detection frame, the second component is a vehicle tail of the vehicle to be detected, and the second detection frame is a vehicle tail detection frame; or the first component is the tail of the vehicle to be detected, the first detection frame is a tail detection frame, the second component is the head of the vehicle to be detected, and the second detection frame is a head detection frame.

In a possible implementation manner, the apparatus further includes a conversion unit:

the acquiring unit 901 is further configured to acquire an internal parameter and an external parameter of an image acquisition device, where the image acquisition device is configured to acquire the image to be detected, after calculating, based on the first positional relationship, a heading posture of the object to be detected in the image coordinate system using the positional information of the first detection frame and the positional information of the second detection frame;

the conversion unit is used for converting the course gesture of the object to be detected under the image coordinate system into a three-dimensional coordinate system based on the internal parameters and the external parameters of the image acquisition equipment.

The embodiment of the application also provides computer equipment which can execute the object detection method. The computer device may be, for example, a terminal, taking a smart phone as an example:

fig. 10 is a block diagram illustrating a part of a structure of a smart phone according to an embodiment of the present application. Referring to fig. 10, the smart phone includes: radio Frequency (RF) circuit 1010, memory 1020, input unit 1030, display unit 1040, sensor 1050, audio circuit 1060, wireless fidelity (WiFi) module 1070, processor 1080, and power source 1090. The input unit 1030 may include a touch panel 1031 and other input devices 1032, the display unit 1040 may include a display panel 1041, and the audio circuit 1060 may include a speaker 1061 and a microphone 1062. It will be appreciated that the smartphone structure shown in fig. 10 is not limiting of the smartphone, and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.

The memory 1020 may be used to store software programs and modules that the processor 1080 performs various functional applications and data processing of the smartphone by executing the software programs and modules stored in the memory 1020. The memory 1020 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebooks, etc.) created according to the use of the smart phone, etc. In addition, memory 1020 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state memory device.

Processor 1080 is the control center of the smartphone, connects the various parts of the entire smartphone with various interfaces and lines, performs various functions of the smartphone and processes the data by running or executing software programs and/or modules stored in memory 1020, and invoking data stored in memory 1020. Optionally, processor 1080 may include one or more processing units; preferably, processor 1080 may integrate an application processor primarily handling operating systems, user interfaces, applications, etc., with a modem processor primarily handling wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 1080.

In this embodiment, processor 1080 in the smartphone may perform the methods provided in any of the embodiments described above.

The computer device provided in the embodiment of the present application may also be a server, as shown in fig. 11, fig. 11 is a block diagram of a server 1100 provided in the embodiment of the present application, where the server 1100 may have a relatively large difference due to different configurations or performances, and may include one or more processors, such as a central processing unit (Central Processing Units, abbreviated as CPU) 1122, and a memory 1132, one or more storage media 1130 (such as one or more mass storage devices) storing application programs 1142 or data 1144. Wherein the memory 1132 and the storage medium 1130 may be transitory or persistent. The program stored on the storage medium 1130 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 1122 may be provided in communication with a storage medium 1130, executing a series of instruction operations in the storage medium 1130 on the server 1100.

The Server 1100 may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input/output interfaces 1158, and/or one or more operating systems 1141, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM , Linux ^TM ，FreeBSD ^TM Etc.

In this embodiment, the central processor 1122 in the server 1100 may perform the method provided in any of the embodiments described above.

According to an aspect of the present application, there is provided a computer-readable storage medium for storing a computer program for executing the object detection method according to the foregoing embodiments.

According to one aspect of the present application, there is provided a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device performs the methods provided in the various alternative implementations of the above embodiments.

The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. An object detection method, the method comprising:

determining position information of a first detection frame corresponding to a first component in the image to be detected, and determining position information of a second detection frame corresponding to a second component in the image to be detected, wherein the first component and the second component are different components included in the object to be detected, the first component and the second component are obtained by carrying out structural division on the object to be detected along the moving direction of the object to be detected, and the position information of the first detection frame and the position information of the second detection frame are obtained by adopting a target detection model;

calculating the heading attitude of the object to be detected under the image coordinate system by using the position information of the first detection frame and the position information of the second detection frame based on the first position relation;

the training mode of the target detection model comprises the following steps:

2. The method according to claim 1, wherein determining the position information of the first detection frame corresponding to the first component in the image to be detected and determining the position information of the second detection frame corresponding to the second component in the image to be detected includes:

identifying an object to be detected in the image to be detected to obtain position information of a third detection frame corresponding to the object to be detected, and identifying a component part to be detected of the object to be detected to obtain position information of a target detection frame corresponding to the component part to be detected, wherein the component part to be detected is a component part of the first component part and the second component part which are visible in the image to be detected, and the target detection frame corresponding to the component part to be detected is located in the range of the third detection frame;

3. The method of claim 2, wherein the determining the position information of the first detection frame and the position information of the second detection frame based on the position information of the target detection frame comprises:

4. The method of claim 2, wherein the determining the position information of the first detection frame and the position information of the second detection frame based on the position information of the target detection frame comprises:

5. The method of claim 2, wherein the component to be detected comprises the first component and the second component, wherein the determining the position information of the first detection frame and the position information of the second detection frame based on the position information of the target detection frame comprises:

6. The method according to claim 2, wherein the identifying the object to be detected in the image to be detected to obtain the position information of the third detection frame corresponding to the object to be detected, and identifying the component to be detected of the object to be detected to obtain the position information of the target detection frame corresponding to the component to be detected, includes:

7. The method of claim 1, the first standard detection frame and the second standard detection frame being rectangular detection frames, the method further comprising:

Normalizing at least one of the first standard detection frame and the second standard detection frame to obtain a processed first standard detection frame and a processed second standard detection frame;

said constructing a first prediction loss from said third prediction detection box and said first standard detection box, comprising:

said constructing a second prediction loss from said fourth prediction detection box and said second standard detection box, comprising:

8. The method of any one of claims 1-7, wherein the object to be detected is a vehicle to be detected, the first component is a head of the vehicle to be detected, the first detection frame is a head detection frame, the second component is a tail of the vehicle to be detected, and the second detection frame is a tail detection frame; or the first component is the tail of the vehicle to be detected, the first detection frame is a tail detection frame, the second component is the head of the vehicle to be detected, and the second detection frame is a head detection frame.

9. The method according to any one of claims 1 to 7, wherein after calculating a heading state of the object to be detected in the image coordinate system using the position information of the first detection frame and the position information of the second detection frame based on the first positional relationship, the method further comprises:

acquiring internal parameters and external parameters of an image acquisition device, wherein the image acquisition device is used for acquiring the image to be detected;

and converting the course gesture of the object to be detected under the image coordinate system into a three-dimensional coordinate system based on the internal parameters and the external parameters of the image acquisition equipment.

10. An object detection device, characterized in that the device comprises an acquisition unit, a determination unit, a calculation unit and a training unit:

the determining unit is configured to determine position information of a first detection frame corresponding to a first component in the image to be detected, and determine position information of a second detection frame corresponding to a second component in the image to be detected, where the first component and the second component are different components included in the object to be detected, and the first component and the second component are obtained by performing structural division on the object to be detected along a movement direction of the object to be detected, and the position information of the first detection frame and the position information of the second detection frame are both obtained by detecting by using a target detection model;

the calculating unit is used for calculating the course gesture of the object to be detected under the image coordinate system by utilizing the position information of the first detection frame and the position information of the second detection frame based on the first position relation;

the training unit is used for:

11. The apparatus according to claim 10, wherein the determining unit is specifically configured to:

12. The apparatus according to claim 11, wherein the determining unit is specifically configured to:

13. The apparatus according to claim 11, wherein the determining unit is specifically configured to:

14. The apparatus according to claim 11, wherein the component to be detected comprises the first component and the second component, the determining unit being specifically configured to:

15. The apparatus according to claim 11, wherein the determining unit is specifically configured to:

16. The apparatus of claim 10, wherein the first standard detection frame and the second standard detection frame are rectangular detection frames, the apparatus further comprising a processing unit:

the training unit is specifically configured to:

17. The apparatus of any one of claims 10-16, wherein the object to be detected is a vehicle to be detected, the first component is a head of the vehicle to be detected, the first detection frame is a head detection frame, the second component is a tail of the vehicle to be detected, and the second detection frame is a tail detection frame; or the first component is the tail of the vehicle to be detected, the first detection frame is a tail detection frame, the second component is the head of the vehicle to be detected, and the second detection frame is a head detection frame.

18. The apparatus according to any one of claims 10-16, further comprising a conversion unit:

the acquiring unit is further configured to acquire an internal parameter and an external parameter of an image acquisition device after calculating a heading posture of the object to be detected in the image coordinate system based on the first position relationship by using position information of the first detection frame and position information of the second detection frame, where the image acquisition device is configured to acquire the image to be detected;

19. A computer device, the computer device comprising a processor and a memory:

the processor is configured to perform the method of any of claims 1-9 according to instructions in the computer program.

20. A computer readable storage medium for storing a computer program which, when executed by a processor, causes the processor to perform the method of any one of claims 1-9.