CN113642425A

CN113642425A - Image detection method, device, electronic device and storage medium based on multimodality

Info

Publication number: CN113642425A
Application number: CN202110859555.1A
Authority: CN
Inventors: 岳海潇; 王珂尧; 冯浩城
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2021-11-12

Abstract

The disclosure provides an image detection method, an image detection device, electronic equipment and a storage medium based on multiple modes, relates to the technical field of artificial intelligence, particularly relates to the technical fields of computer vision, deep learning and the like, and can be applied to a face recognition scene. The specific implementation scheme is as follows: the method comprises the steps of obtaining a reference frame image and a frame image to be processed, wherein the reference frame image and the frame image to be processed are different in modality, identifying a reference detection frame from the reference frame image, mapping the reference detection frame to the frame image to be processed to obtain a target detection frame, and using the target detection frame for image detection, so that the detection frames in the images with different modalities can be accurately mapped.

Description

Multi-mode-based image detection method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of computer vision and deep learning technologies, which can be applied in face recognition scenes, and in particular, to a method and an apparatus for image detection based on multiple modalities, an electronic device, and a storage medium.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning technology, a deep learning technology, a big data processing technology, a knowledge map technology and the like.

Face recognition is an important component of computer vision technology and image processing technology, and is widely applied to the fields of traffic, finance and the like. The image detection is a key link in the process of face recognition, and in the related technology, the image detection is mainly carried out based on video frames acquired by a visible light camera and a near infrared camera.

Disclosure of Invention

Provided are a multi-modality based image detection method, apparatus, electronic device, storage medium, and computer program product.

According to a first aspect, there is provided a multi-modality based image detection method, comprising: acquiring a reference frame image and a frame image to be processed, wherein the reference frame image and the frame image to be processed have different modes; identifying a reference detection frame from among the reference frame images; and mapping the reference detection frame into the frame image to be processed to obtain a target detection frame, wherein the target detection frame is used for image detection.

According to a second aspect, there is provided a multi-modality based image detection apparatus, comprising: the first acquisition module is used for acquiring a reference frame image and a frame image to be processed, wherein the reference frame image and the frame image to be processed have different modes; the identification module is used for identifying a reference detection frame from the reference frame image; and the mapping module is used for mapping the reference detection frame into the frame image to be processed to obtain a target detection frame, and the target detection frame is used for image detection.

According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a multi-modality based image detection method as set forth in embodiments of the present disclosure.

According to a fourth aspect, a non-transitory computer-readable storage medium is presented storing computer instructions for causing a computer to perform a multimodal based image detection method presented by an embodiment of the present disclosure.

According to a fifth aspect, a computer program product is presented, comprising a computer program which, when executed by a processor, implements the multimodal based image detection method presented by an embodiment of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of a convolutional neural network provided in accordance with an embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of an image inspection system provided in accordance with an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an image detection process provided according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to a fifth embodiment of the present disclosure;

fig. 9 is a block diagram of an electronic device for implementing the multi-modality based image detection method according to the embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure.

It should be noted that the main execution body of the multi-modal-based image detection method of this embodiment is a multi-modal-based image detection apparatus, which may be implemented by software and/or hardware, and the apparatus may be configured in an electronic device, and the electronic device may include, but is not limited to, a terminal, a server, and the like.

The embodiment of the disclosure relates to the technical field of artificial intelligence, in particular to the technical field of computer vision, deep learning and the like, can be applied to a face recognition scene, can accurately map detection frames in images of different modalities, and can effectively improve the image detection accuracy and recall rate in a complex scene and improve the effectiveness of an image recognition system when image detection is carried out by adopting a target detection frame obtained by mapping.

Wherein, Artificial Intelligence (Artificial Intelligence), english is abbreviated as AI. The method is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence.

Deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The final goal of deep learning is to make a machine capable of human-like analytical learning, and to recognize data such as characters, images, and sounds.

Computer vision means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eye observation or transmitted to an instrument for detection.

Face recognition is a biometric technology for identity recognition based on facial feature information of a person. The method comprises the steps of collecting images or video streams containing human faces by using a camera or a camera, automatically detecting and tracking the human faces in the images, and further carrying out face recognition on the detected human faces by a series of related technologies, generally called face recognition and face recognition, wherein the face recognition process is authorized by a user corresponding to the human faces and is carried out by various public and legal compliance modes, and the face recognition process conforms to related laws and regulations.

As shown in fig. 1, the multi-modality based image detection method includes:

s101: and acquiring a reference frame image and a frame image to be processed, wherein the reference frame image and the frame image to be processed have different modes.

In the embodiment of the disclosure, a reference frame image and a frame image to be processed are obtained first.

The image currently processed may be referred to as a frame image to be processed, and the frame image to be processed may be an image obtained by image capture of an arbitrary target object, and may be one or more frame images in a captured video stream, or may also be a separate image, which is not limited thereto.

The image that is used for reference in detecting the frame image to be processed may be referred to as a reference frame image, and the reference frame image may be, for example, the same as the target object in the frame image to be processed, or may also have associated information with the frame image to be processed, and may assist in detecting the target object in the frame image to be processed, which is not limited in this respect.

And, the modality of the reference frame image and the modality of the frame image to be processed are different, for example: the modality of the reference frame image may be a Near Infrared (NIR) modality, and the modality of the frame image to be processed may be a Red Green Blue (RGB) modality; alternatively, the reference frame image may be an RGB modality, and the frame image to be processed may be an NIR modality; alternatively, the reference frame image and the frame image to be processed may be any other possible different modalities, which is not limited in this respect.

In some embodiments, the reference frame image and the frame image to be processed may be images captured at the same time, and the reference frame image may be captured by a reference camera and the frame image to be processed may be captured by a target camera, wherein the reference camera and the target camera may be one of a visible light camera (RGB camera), a near infrared camera (NIR camera), and any other possible cameras for capturing images of different modalities, without limitation. Moreover, the RGB camera may further support an Auto Exposure (AE) module, hereinafter referred to as "FaceAE".

That is, the reference camera and the target camera may be used to capture images for the same target object at the same time, and the image captured by the reference camera and the image captured by the target camera are respectively used as the reference frame image and the frame image to be processed. Therefore, the reference frame image captured at the same time is adopted to assist in processing the frame image to be processed, the correlation between the reference frame image and the frame image to be processed is stronger, and therefore the accuracy of cross-mode image detection can be improved.

In a specific example, the multi-modal-based image detection method provided by the present disclosure may be applied to a face recognition scene, and accordingly, the reference frame image and the frame image to be processed may be face images of the same target object at the same time and different modalities. For example, at a time (for example, daytime) with clear light, a visible light camera (RGB camera) may be used as a reference camera, a near infrared camera (NIR camera) may be used as a target camera, and a face image of a target object is collected, where the reference frame image is a face image in an RGB modality, and the frame image to be processed is a face image in an NIR modality; or, at a time with complex light (for example, at night), the RGB camera may be used as a target camera, the NIR camera may be used as a reference camera, and a face image of the target object is acquired, where the reference frame image is a face image in an NIR modality, and the frame image to be processed is a face image in an RGB modality.

The face image is not a face image for a specific user and cannot reflect personal information of the specific user, and the face image is obtained by authorization of the user corresponding to the face image and is obtained in various public and legal compliance modes, and the obtaining process of the face image conforms to relevant laws and regulations.

S102: a reference detection frame is identified from among the reference frame images.

After the reference frame image and the frame image to be processed are obtained, further, the embodiment of the disclosure identifies the reference detection frame from the reference frame image.

Among them, the range frame of a specific position in the reference frame image may be referred to as a reference detection frame, for example: in a scene of face recognition, a face detection frame of a target object in a reference frame image may be referred to as a reference detection frame, or the reference detection frame may also be a detection frame obtained by detecting another target, which is not limited to this.

For example, a pre-trained image detection algorithm may be adopted to identify a reference frame image to obtain the reference detection frame, where the image detection algorithm may be a convolutional neural network structure based on deep learning, fig. 2 is a schematic structural diagram of a convolutional neural network provided according to an embodiment of the present disclosure, as shown in fig. 2, in a face recognition scene, the reference frame image may be input into the convolutional neural network, a plurality of computation layers (E1-E6) of the convolutional neural network perform face secondary classification and frame coordinate regression according to the reference frame image, determine a final face detection result according to the face secondary classification score sorting, and return to a face frame coordinate predicted by a model, where the face frame is the reference detection frame. As shown in fig. 2, a plurality of computation layers E4 to E6 of the convolutional neural network can all implement face category regression and face position regression, and in the detection process, the computation layers E4 to E6 are sequentially passed through until category regression and coordinate regression are implemented, for example: and the regression is realized at E4 or E4, and the method is not limited.

In some embodiments, in the case that the reference frame image is an RGB mode image, the face detection algorithm is any possible face detection algorithm for detecting an RGB image; or, in the case that the reference frame image is an NIR mode image, the face detection algorithm is any possible face detection algorithm for detecting an NIR image; alternatively, the face detection algorithm may also be an RGB-NIR hybrid face detection algorithm, which supports a multi-modal face data input detection capability, without limitation.

S103: and mapping the reference detection frame into the frame image to be processed to obtain a target detection frame, wherein the target detection frame is used for image detection.

After the reference detection frame is identified from the reference frame image, further, the embodiment of the present disclosure may map the reference detection frame to the frame image to be processed, that is, map the reference detection frame to the corresponding position of the frame image to be processed, and the obtained range frame may be referred to as a target detection frame, so that the position of the target detection frame in the frame image to be processed corresponds to the position of the reference detection frame in the reference image frame, so as to detect and identify the object in the target detection frame.

For example, in the case that the reference detection frame is an RGB modal face detection frame, the target detection frame is an NIR modal face detection frame, so that a face in an NIR modal image can be detected; in other embodiments, when the reference detection frame is an NIR modality face detection frame, the target detection frame is an RGB modality face detection frame, so that a face in an RGB modality image can be detected.

In some embodiments, in the operation of mapping the reference detection frame to the frame image to be processed to obtain the target detection frame, a reference camera parameter of the reference camera and a target camera parameter of the target camera may be obtained first, where the reference camera parameter and the target camera parameter may include an internal parameter (e.g., an internal parameter matrix, a distortion parameter matrix, etc.), an external parameter (e.g., a rotation matrix, a translation matrix, etc.) and any other possible parameter of the reference camera and the target camera, respectively, which is not limited herein.

Further, a parameter mapping relationship between the reference imaging parameters and the target imaging parameters is determined, for example: the mapping relationship between the intrinsic parameters is determined, and/or the mapping relationship between the extrinsic parameters is determined, which is not limited.

Further, the reference detection frame is mapped into the frame image to be processed according to the parameter mapping relationship to obtain a target detection frame, that is, the target detection frame is determined according to the parameter mapping relationship between the reference camera and the target camera. In this embodiment, because the camera parameters are fixed, the interference of external factors on determining the target detection frame can be reduced, and the accuracy of the target detection frame is improved. In addition, the camera parameters are easy to acquire and calculate, so that the calculation amount can be reduced, and the multi-modal image detection efficiency can be improved.

It should be noted that the face image in this embodiment is not a face image for a specific user, and cannot reflect personal information of a specific user, and the acquisition of the face image is authorized by the user corresponding to the face image, and is a face image acquired in various public and legal compliance manners, and the acquisition process of the face image conforms to relevant laws and regulations.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

It is to be understood that the above examples are only illustrative for determining the target detection frame, and in practical applications, the target detection frame may be determined in any other possible manner, which is not limited thereto.

In the embodiment, the reference frame image and the frame image to be processed are obtained, the reference frame image and the frame image to be processed have different modals, the reference detection frame is identified from the reference frame image, the reference detection frame is mapped into the frame image to be processed to obtain the target detection frame, the target detection frame is used for image detection, accurate mapping of the detection frames in the images with different modals can be achieved, when the target detection frame obtained through mapping is used for image detection, the image detection accuracy and recall rate in a complex scene can be effectively improved, and the effectiveness of an image identification system is improved.

Fig. 3 is a schematic diagram according to a second embodiment of the present disclosure.

As shown in fig. 3, the multi-modality based image detection method includes:

s301: and acquiring a reference frame image and a frame image to be processed, wherein the reference frame image and the frame image to be processed have different modes.

S302: a reference detection frame is identified from among the reference frame images.

S303: and acquiring reference camera shooting parameters of the reference camera and acquiring target camera shooting parameters of the target camera.

For the description of S301 to S303, reference may be made to the above embodiments, which are not described herein again.

S304: and acquiring reference depth information of the reference camera and acquiring target depth information of the target camera.

In the embodiment of the present disclosure, the reference camera and the target camera may be cameras that support a three-dimensional model, for example, and in the operation of determining the parameter mapping relationship between the reference imaging parameter and the target imaging parameter, reference depth information of the reference camera and target depth information of the target camera may also be acquired.

The reference depth information and the target depth information may be determined according to parameters of the reference camera and the target camera, which is not limited thereto.

S305: and determining the coordinates of the reference pixels in the reference frame image according to the reference camera shooting parameters and the reference depth information in combination with a world coordinate system.

The coordinates of each pixel point in the reference frame image may be referred to as reference pixel coordinates. On the basis that the camera is a three-dimensional model, the following relationship can be determined:

the camera depth pixel coordinate is the camera internal parameter world coordinate;

the world coordinates are coordinates of the image in a world coordinate system of the camera, and may also be referred to as physical coordinates.

Therefore, in the case of determining the reference imaging parameters (internal parameters, external parameters), the reference depth information, and the world coordinate system, the reference pixel coordinates in the reference frame image can be determined.

S306: and determining the coordinates of target pixels in the frame image to be processed according to the target shooting parameters and the target depth information in combination with a world coordinate system.

The coordinates of each pixel point in the frame image to be processed may be referred to as target pixel coordinates.

In the case of determining the target imaging parameter, the target depth information, and the world coordinate system, the target pixel coordinate in the frame image to be processed may be determined, and the calculation method of the target pixel coordinate is similar to the calculation method of the reference pixel coordinate, which is not described herein again.

S307: and determining the mapping relation between the reference pixel coordinate and the target pixel coordinate and using the mapping relation as a parameter mapping relation.

Further, mapping is carried out between the reference pixel coordinate and the target pixel coordinate, a mapping relation is determined, and the mapping relation is used as a parameter mapping relation.

In this embodiment, a reference camera and a target camera of a three-dimensional model may be used, so that a target pixel coordinate and a reference pixel coordinate may be determined, and a mapping relationship between the target pixel coordinate and the reference pixel coordinate is used as a parameter mapping relationship, so that the parameter mapping relationship may be combined with depth information of an image, thereby improving accuracy of mapping processing, and facilitating improvement of accuracy of a target detection frame.

S308: and determining a first pixel coordinate corresponding to the reference detection frame, wherein the first pixel coordinate is the coordinate of a reference pixel point contained in the reference detection frame.

In the operation of mapping the reference detection frame to the frame image to be processed, the present embodiment may determine the first pixel coordinate corresponding to the reference detection frame.

The coordinates of the reference pixel points included in the reference detection frame may be referred to as first pixel coordinates, that is, the pixel coordinates of the pixel points within the range of the reference detection frame are determined.

S309: and determining a second pixel coordinate according to the first pixel coordinate and the parameter mapping relation, wherein the second pixel coordinate is the coordinate of a target pixel point in the frame image to be processed.

Further, in the mapping process, the first pixel coordinate of the reference pixel point can be mapped into the frame image to be processed according to the parameter mapping relationship, so that the coordinate of the corresponding pixel point in the frame image to be processed can be obtained, the corresponding pixel point can be called a target pixel point, and the coordinate of the corresponding pixel point can be called a second pixel coordinate, so that the second pixel coordinate in the frame image to be processed can be determined according to the first pixel coordinate and the parameter mapping relationship, and the target pixel point in the frame image to be processed can be determined according to the second pixel coordinate.

S310: and taking a detection frame containing the coordinates of the target pixel points in the frame image to be processed as a target detection frame.

That is, the range frame determined from the coordinates (second pixel coordinates) of the target pixel point may be referred to as a target detection frame. Therefore, the embodiment can determine the coordinate information of the target pixel point in the frame image to be processed according to the coordinate information of the pixel point in the reference detection frame and the reference mapping information, and further determine the target detection frame according to the coordinate of the target pixel point, so that the accuracy of the target detection frame can be improved.

In the embodiment, the reference frame image and the frame image to be processed are obtained, the reference frame image and the frame image to be processed have different modals, the reference detection frame is identified from the reference frame image, the reference detection frame is mapped into the frame image to be processed to obtain the target detection frame, the target detection frame is used for image detection, accurate mapping of the detection frames in the images with different modals can be achieved, when the target detection frame obtained through mapping is used for image detection, the image detection accuracy and recall rate in a complex scene can be effectively improved, and the effectiveness of an image identification system is improved. In addition, a reference camera and a target camera of the three-dimensional model can be adopted, so that the target pixel coordinate and the reference pixel coordinate can be determined, and the mapping relation between the target pixel coordinate and the reference pixel coordinate is used as a parameter mapping relation, so that the parameter mapping relation can be combined with the depth information of the image, and the accuracy of mapping processing is improved. In addition, in the embodiment, the coordinate information of the target pixel point in the frame image to be processed can be determined according to the coordinate information of the pixel point in the reference detection frame and the reference mapping information, and then the target detection frame is determined according to the coordinate of the target pixel point, so that the accuracy of the target detection frame can be improved.

Fig. 4 is a schematic diagram according to a third embodiment of the present disclosure.

As shown in fig. 4, the multi-modality based image detection method includes:

s401: and acquiring a reference frame image and a frame image to be processed, wherein the reference frame image and the frame image to be processed have different modes.

S402: a reference detection frame is identified from among the reference frame images.

S403: and acquiring reference camera shooting parameters of the reference camera and acquiring target camera shooting parameters of the target camera.

S404: and determining a parameter mapping relation between the reference imaging parameter and the target imaging parameter.

S405: and mapping the reference detection frame into the frame image to be processed according to the parameter mapping relation so as to obtain a target detection frame.

For the description of S401 to S405, reference may be made to the above embodiments, which are not described herein again.

S406: and acquiring the detection frame coordinates of the target detection frame.

After the target detection frame is obtained, further, the detection frame coordinates of the target detection frame can be obtained.

For example, the coordinates of the detection frame may be determined according to the coordinates of the pixel points in the target detection frame, or the coordinates of the detection frame may also be determined in any other possible manner, which is not limited to this.

S407: and generating target shooting parameters according to the coordinates of the detection frame.

The target imaging parameter may be a parameter of the camera, for example, an Auto Exposure parameter (AE) of a visible light camera (RGB camera), or may also be a parameter of a near infrared camera (NIR camera), or may also be any other possible imaging parameter, which is not limited in this respect.

S408: and controlling the target camera to capture the next frame image of the frame image to be processed based on the target camera shooting parameters.

Further, the target camera is controlled to capture a next frame image of the frame image to be processed based on the target camera parameters, that is, the parameters of the target camera are automatically adjusted to the target camera parameters, and then the next frame image of the frame image to be processed is captured, and more specifically, a detection frame of the next frame image can be captured.

For example, the target camera is an RGB camera, the RGB camera is provided with a FaceAE module, the coordinates of the detection frame can be input to the FaceAE module, and further the FaceAE module can adjust the automatic exposure parameters of the RGB camera and capture the next frame image of the frame image to be processed.

Therefore, the present embodiment can perform parameter adjustment on the RGB camera to realize adjustment of the position of the detection frame, that is: automatic exposure is realized, and therefore imaging quality of a detection frame area in an image is improved. In addition, under the condition of complex light, the target detection frame of the RGB camera can be determined in a detection frame mapping mode, and then the automatic exposure parameters are adjusted according to the coordinates of the target detection frame, so that the problem that the RGB camera cannot be self-adjusted in the initial state can be effectively solved.

In the embodiment, the reference frame image and the frame image to be processed are obtained, the reference frame image and the frame image to be processed have different modals, the reference detection frame is identified from the reference frame image, the reference detection frame is mapped into the frame image to be processed to obtain the target detection frame, the target detection frame is used for image detection, accurate mapping of the detection frames in the images with different modals can be achieved, when the target detection frame obtained through mapping is used for image detection, the image detection accuracy and recall rate in a complex scene can be effectively improved, and the effectiveness of an image identification system is improved. In addition, parameter adjustment can also be carried out to the RGB camera so as to realize the adjustment to detecting frame position, promptly: automatic exposure is realized, and therefore imaging quality of a detection frame area in an image is improved. In addition, under the condition of complex light, the target detection frame of the RGB camera can be determined in a detection frame mapping mode, and then the automatic exposure parameters are adjusted according to the coordinates of the target detection frame, so that the problem that the RGB camera cannot be self-adjusted in the initial state can be effectively solved.

In addition, fig. 5 is a schematic structural diagram of an image detection system provided according to an embodiment of the present disclosure, and as shown in fig. 5, the image detection system mainly includes a visible light camera (RGB), a near infrared camera (NIR), a calculation processing board card, and basic peripherals. The basic peripheral equipment may include a data hard disk, a near-infrared light supplement lamp, a touch screen, a buzzer, an external keyboard, and the like, and the computing Processing board is provided with a Central Processing Unit (CPU), a Random Access Memory (RAM), a Read-Only Memory (ROM), a power interface, a Wireless Fidelity (Wi-Fi) module, and peripheral interfaces such as a General-purpose input/output interface (GPIO), an SD Memory Card (SD-Card) interface, a Serial communication interface (RS232 Serial port), a Universal Serial Bus interface (USB), and a communication outlet (RJ45 network port). And the software of the image detection system comprises Linux embedded operating system software, sensor driving software and face detection system software. The software architecture of the face detection system comprises an RGB face detection algorithm, an NIR face detection algorithm, a multi-mode face detection frame mapping module, an RGB camera face automatic exposure faceAE module and the like, and is not limited to the above.

Fig. 6 is a schematic diagram of an image detection process according to an embodiment of the present disclosure, as shown in fig. 6, where the multi-mode face detection frame mapping module may map a reference detection frame into a frame image to be processed, and in a normal scene (for example, daytime), when an RGB camera supporting FaceAE is used, the RGB camera captures video data and transmits the video data to an RGB face detection algorithm, and the RGB face detection algorithm detects coordinates of a face position frame and feeds the coordinates back to the FaceAE for automatic lens parameter adjustment. Simultaneously mapping RGB face position frame coordinates to NIR camera video images at the same time, thereby obtaining face detection frame coordinates in multiple modes; under a complex light scene (for example, at night), the RGB camera captures video data and transmits the video data to the RGB face detection algorithm, at the moment, the RGB face detection algorithm cannot detect a face in lens parameter initialization, at the moment, the NIR camera captures video data and transmits the video data to the NIR face detection algorithm, and compared with the RGB camera, the NIR camera has small influence on light and can detect the face. And simultaneously mapping the NIR face position frame coordinates to the RGB camera video images at the same moment, thereby obtaining the face detection frame coordinates in multiple modes. And then inputting the coordinates of the RGB face frame into an RGB camera faceAE for parameter optimization.

Fig. 7 is a schematic diagram according to a fourth embodiment of the present disclosure.

As shown in fig. 7, the multi-modality based image detection apparatus 70 includes:

a first obtaining module 701, configured to obtain a reference frame image and a frame image to be processed, where modalities of the reference frame image and the frame image to be processed are different;

an identifying module 702, configured to identify a reference detection frame from among the reference frame images; and

a mapping module 703, configured to map the reference detection frame into the frame image to be processed to obtain a target detection frame, where the target detection frame is used for image detection.

Alternatively, in some embodiments of the present disclosure, the reference frame image is captured by a reference camera, and the frame image to be processed is captured by a target camera, as shown in fig. 8, fig. 8 is a schematic diagram of a fifth embodiment according to the present disclosure, and the multi-modality based image detection apparatus 80 includes: a first obtaining module 801, an identifying module 802, and a mapping module 803, wherein the mapping module 803 includes:

the obtaining submodule 8031 is configured to obtain reference camera shooting parameters of the reference camera and obtain target camera shooting parameters of the target camera;

a determining submodule 8032, configured to determine a parameter mapping relationship between the reference imaging parameter and the target imaging parameter; and

the mapping submodule 8033 is configured to map the reference detection frame into the frame image to be processed according to the parameter mapping relationship, so as to obtain a target detection frame.

Optionally, in some embodiments of the present disclosure, wherein the determining submodule 8032 is specifically configured to:

acquiring reference depth information of a reference camera and acquiring target depth information of a target camera;

determining reference pixel coordinates in the reference frame image according to the reference camera parameters and the reference depth information in combination with a world coordinate system;

determining target pixel coordinates in the frame image to be processed according to the target shooting parameters and the target depth information in combination with a world coordinate system; and

and determining the mapping relation between the reference pixel coordinate and the target pixel coordinate and using the mapping relation as a parameter mapping relation.

Optionally, in some embodiments of the present disclosure, wherein the mapping sub-module 8033 is specifically configured to:

determining a first pixel coordinate corresponding to the reference detection frame, wherein the first pixel coordinate is a coordinate of a reference pixel point contained in the reference detection frame;

determining a second pixel coordinate according to the first pixel coordinate and the parameter mapping relation, wherein the second pixel coordinate is the coordinate of a target pixel point in the frame image to be processed; and

and taking a detection frame containing the coordinates of the target pixel points in the frame image to be processed as a target detection frame.

Optionally, in some embodiments of the present disclosure, as shown in fig. 8, the apparatus 80 further comprises:

a second obtaining module 804, configured to obtain a detection frame coordinate of the target detection frame;

a generating module 805, configured to generate target image capturing parameters according to the coordinates of the detection frame;

and a capturing module 806, configured to control the target camera to capture a next frame image of the frame image to be processed based on the target camera parameter.

It is understood that the multi-modality based image detection apparatus 80 in fig. 8 of the present embodiment and the multi-modality based image detection apparatus 70 in the above-described embodiment, the first obtaining module 801 and the first obtaining module 701 in the above-described embodiment, the identifying module 802 and the identifying module 702 in the above-described embodiment, and the mapping module 803 and the mapping module 703 in the above-described embodiment may have the same functions and structures.

It should be noted that the foregoing explanation of the multi-modality based image detection method is also applicable to the multi-modality based image detection apparatus of the present embodiment, and is not repeated herein.

In the embodiment, the reference frame image and the frame image to be processed are obtained, the modalities of the reference frame image and the frame image to be processed are different, the reference detection frame is identified from the reference frame image, the reference detection frame is mapped into the frame image to be processed to obtain the target detection frame, the target detection frame is used for image detection, accurate mapping of the detection frames in the images of different modalities can be achieved, when the target detection frame obtained through mapping is used for image detection, the image detection accuracy and recall rate in a complex scene can be effectively improved, and the effectiveness of an image identification system is improved.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

Fig. 9 is a block diagram of an electronic device for implementing the multi-modality based image detection method according to the embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, for example, an image detection method based on a multi-modality.

For example, in some embodiments, the multimodal based image detection method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM903 and executed by the computing unit 901, one or more steps of the multimodal based image detection method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the multi-modality based image detection method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the multimodal-based image detection method of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable multimodal based image detection apparatus such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A multimodal-based image detection method, comprising:

Acquiring a reference frame image and a frame image to be processed, where the modalities of the reference frame image and the frame image to be processed are different;

identifying a reference detection frame from the reference frame image; and

The reference detection frame is mapped into the frame image to be processed to obtain a target detection frame, and the target detection frame is used for image detection.

2. The method according to claim 1, wherein the reference frame image is captured by a reference camera, and the to-be-processed frame image is captured by a target camera, wherein the mapping of the reference detection frame to the to-be-processed frame image Among them, to get the target detection frame, including:

Obtain the reference camera parameters of the reference camera, and obtain the target camera parameters of the target camera;

determining a parameter mapping relationship between the reference camera parameters and the target camera parameters; and

According to the parameter mapping relationship, the reference detection frame is mapped into the to-be-processed frame image to obtain the target detection frame.

3. The method according to claim 2, wherein the determining the parameter mapping relationship between the reference imaging parameters and the target imaging parameters comprises:

Obtain reference depth information of the reference camera, and obtain target depth information of the target camera;

Determine the reference pixel coordinates in the reference frame image according to the reference camera parameters, the reference depth information and the world coordinate system;

Determine the target pixel coordinates in the frame image to be processed according to the target camera parameters, the target depth information and the world coordinate system; and

A mapping relationship between the reference pixel coordinates and the target pixel coordinates is determined and used as the parameter mapping relationship.

4. The method according to claim 3, wherein, according to the parameter mapping relationship, mapping the reference detection frame into the to-be-processed frame image to obtain the target detection frame, comprising:

determining the first pixel coordinates corresponding to the reference detection frame, where the first pixel coordinates are the coordinates of the reference pixel points included in the reference detection frame;

Determine second pixel coordinates according to the first pixel coordinates in combination with the parameter mapping relationship, where the second pixel coordinates are the coordinates of the target pixel in the frame image to be processed; and

A detection frame including the coordinates of the target pixel in the frame image to be processed is used as the target detection frame.

5. The method according to claim 2, after said mapping the reference detection frame into the to-be-processed frame image to obtain a target detection frame, further comprising:

obtaining the detection frame coordinates of the target detection frame;

generating target camera parameters according to the coordinates of the detection frame;

The target camera is controlled to capture the next frame image of the frame image to be processed based on the target camera parameters.

6. The method according to any one of claims 1-5, wherein the reference frame image and the to-be-processed frame image are images captured at the same time.

7. A multimodal-based image detection device, comprising:

a first acquisition module, configured to acquire a reference frame image and a frame image to be processed, the modalities of the reference frame image and the frame image to be processed are different;

an identification module for identifying a reference detection frame from the reference frame image; and

The mapping module is configured to map the reference detection frame into the frame image to be processed to obtain a target detection frame, and the target detection frame is used for image detection.

8. The apparatus according to claim 7, wherein the reference frame image is captured by a reference camera, and the to-be-processed frame image is captured by a target camera, wherein the mapping module comprises:

an acquisition sub-module for acquiring the reference camera parameters of the reference camera, and acquiring the target camera parameters of the target camera;

a determining submodule for determining a parameter mapping relationship between the reference camera parameters and the target camera parameters; and

The mapping sub-module is configured to map the reference detection frame to the frame image to be processed according to the parameter mapping relationship, so as to obtain the target detection frame.

9. The apparatus according to claim 8, wherein the determining submodule is specifically used for:

10. The apparatus according to claim 9, wherein the mapping submodule is specifically used for:

11. The apparatus of claim 8, further comprising:

a second acquisition module, configured to acquire the detection frame coordinates of the target detection frame;

a generating module for generating target camera parameters according to the coordinates of the detection frame;

A capture module, configured to control the target camera to capture the next frame image of the frame image to be processed based on the target camera parameters.

12. The apparatus according to any one of claims 7-11, wherein the reference frame image and the to-be-processed frame image are images captured at the same time.

13. An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the execution of any of claims 1-6 Methods.

14. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of any of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-6.