Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The scheme provided by the embodiment of the application relates to the technologies of artificial intelligence, such as computer vision, machine learning and the like, and all other embodiments obtained by a person skilled in the art without creative labor based on the embodiment of the invention belong to the protection scope of the invention.
In the embodiment of the invention, the construction process of a depth of field estimation model specifically relates to Machine Learning (ML) in artificial intelligence, wherein the Machine Learning is a multi-field cross subject and relates to a plurality of subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.
The embodiment of the invention provides a scene type judging method and device, electronic equipment and a storage medium. Specifically, the embodiments of the present invention provide a scene type determining method suitable for a scene type determining device, which may be integrated in an electronic device.
The electronic device may be a terminal or other devices, including but not limited to a mobile terminal and a fixed terminal, for example, the mobile terminal includes but is not limited to a smart phone, a smart watch, a tablet computer, a notebook computer, a smart car, and the like, wherein the fixed terminal includes but is not limited to a desktop computer, a smart television, and the like.
The electronic device may also be a device such as a server, and the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto.
The scene type judging method of the embodiment of the invention can be realized by a server, and can also be realized by a terminal and the server together.
The following describes the method by taking an example in which the terminal and the server jointly implement the scene type determination method.
As shown in fig. 1, the scene type determination system provided by the embodiment of the present invention includes a terminal 10, a server 20, and the like; the terminal 10 and the server 20 are connected through a network, for example, a wired or wireless network connection, wherein the terminal 10 may exist as a terminal for transmitting an image to be determined to the server 20.
The server 20 may be configured to obtain a depth-of-field estimation model, extract a depth feature map of an image to be determined through the depth-of-field estimation model, calculate target depth-of-field information of the image to be determined based on depth-of-field information of each pixel in the depth feature map, and determine a target scene type corresponding to the image to be determined based on a preset correspondence between the depth-of-field information and the scene type and the target depth-of-field information.
The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.
The embodiments of the present invention will be described from the perspective of a scene type determination device, which may be specifically integrated in a server or a terminal.
As shown in fig. 2, a specific flow of the scene type determination method of this embodiment may be as follows:
201. the method comprises the steps of obtaining a depth of field estimation model, wherein the depth of field estimation model is obtained based on an unmarked sample image and a depth of field estimation sample image corresponding to the sample image through training, the depth of field estimation sample image is obtained by extracting a sample depth feature map from the sample image for the depth of field estimation model, and then carrying out image reconstruction based on the sample depth feature map, and the sample depth feature map comprises depth of field information of each pixel point in the sample image.
The sample image may be an image without labeled depth information and/or a label for indicating a scene type of the sample image, the content in the sample image may be a person, a plant, or the like, and the scene of the image may also be indoor or outdoor.
In order to realize the depth of field estimation of the image by the depth of field estimation model, a Computer Vision technology (Computer Vision, CV) is also applied in the construction process, the Computer Vision is a science for researching how to make a machine see, and further, the Computer Vision refers to the machine Vision of identifying, tracking, measuring and the like of a target by using a camera and a Computer to replace human eyes, and further, the image processing is carried out, so that the Computer processing becomes an image which is more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.
Wherein the depth of field estimation model is obtained by pre-training. Through the pre-training process, parameters and the like of the depth of field estimation model can be adjusted, so that the depth of field estimation model can achieve better depth of field estimation performance. Specifically, before step 201, the method may further include:
acquiring a depth of field estimation model to be trained and a sample image, wherein the sample image does not have a label for indicating the scene type of the sample image, and the depth of field estimation model comprises a feature extraction layer and an image reconstruction layer;
performing feature extraction and depth of field analysis on pixel points in the sample image through a feature extraction layer to obtain a sample depth feature map corresponding to the sample image, wherein the sample depth feature map comprises feature information and depth of field information of each pixel point in the sample image;
through an image reconstruction layer, image reconstruction is carried out based on the characteristic information of each pixel point in the sample depth characteristic map, and a reconstructed image corresponding to the sample image is obtained;
determining the loss of the depth of field estimation model according to the sample image and the reconstructed image;
and adjusting parameters of the depth of field estimation model based on the loss to obtain the trained depth of field estimation model.
The feature extraction layer may specifically include the number of layers of the feature extraction layer used for extracting feature information in the depth estimation model, the number of input channels of the feature extraction layer, and the like. For example, if the feature extraction layers of the depth estimation model are convolutional layers, the feature extraction layers of the depth estimation model may include the number of convolutional layers, the size of convolutional kernels in the convolutional layers, and/or the number of input channels corresponding to each convolutional layer, and so on.
It is understood that there may be only one feature extraction layer for extracting features in the depth estimation model, or there may be multiple feature extraction layers for performing feature extraction on the sample image. When a plurality of feature extraction layers exist in the depth of field estimation model, with the increase of extraction operations of the feature extraction layers, content information about a sample image in a sample depth feature map extracted by each feature extraction layer is less and more, and feature information is more and more.
In an example, in order to make a depth-of-field estimation sample image obtained after pixel point reconstruction closer to a sample image and reduce an error in determining a loss of a depth-of-field estimation model, a depth feature map corresponding to different feature extraction modules may be combined to reconstruct a pixel point, optionally, the feature extraction layer includes a depth-of-field analysis module and at least two feature extraction modules connected in sequence, and the step "performing feature extraction and depth-of-field analysis on the pixel point in the sample image through the feature extraction layer to obtain the sample depth feature map corresponding to the sample image" may include:
performing feature extraction on pixel points in the sample image through a feature extraction module to obtain at least two intermediate feature maps of the sample image, wherein the intermediate feature maps comprise feature information of the pixel points;
and determining the depth information of the pixel points based on the characteristic information of the pixel points in the intermediate characteristic graph through a depth analysis module, and obtaining a sample depth characteristic graph corresponding to the sample image based on the depth information and the intermediate characteristic graph.
Correspondingly, the step of reconstructing an image based on the feature information of each pixel point in the sample depth feature map through the image reconstruction layer to obtain a reconstructed image corresponding to the sample image includes:
and through the image reconstruction layer, performing image reconstruction based on the characteristic information of the pixel points in the intermediate characteristic map and the sample depth characteristic map to obtain a reconstructed image corresponding to the sample image.
For example, if the feature extraction layer extracts features in the sample image in a convolution manner, and then synthesizes a sample depth feature map according to the features, the inverse operation may be to decompose the features in the sample depth feature map, and then perform deconvolution, so as to realize image restoration; for another example, if the feature extraction layer extracts features in the sample image by an upsampling method and then synthesizes the sample depth feature map according to the features, the inverse operation may be to decompose the features in the sample depth feature map, perform downsampling, restore the sample image, obtain a depth-of-field estimated sample image, and so on.
The process of reconstructing an image of a pixel point by combining intermediate depth feature maps corresponding to different feature extraction modules may be as shown in fig. 3. When each image reconstruction layer carries out image reconstruction, image reconstruction is realized according to the middle characteristic layer of the corresponding characteristic extraction module and the middle reconstructed image output by the image reconstruction layer on the upper layer.
In the practical application process, the sample image can be a single image shot by a monocular camera or a certain image in a continuous photo or video shot by the monocular camera or a binocular camera, when the sample image is shot by the monocular camera or the single image, the relative pose needs to be predicted by performing pose transformation on a shot object in the sample image, and the loss of the model is solved according to the relative pose. Optionally, the step of determining the loss of the depth estimation model according to the sample image and the reconstructed image may include:
carrying out pose transformation on the shooting object in the sample image based on the pose transformation parameters to obtain a first pose estimation image corresponding to the sample image;
performing pose restoration on the first pose estimation image based on the pose transformation parameters and the reconstructed image to obtain a restored image;
calculating image loss based on image difference between the sample image and the restored image;
based on the image loss, a loss of the depth estimation model is determined.
When the pose of the shooting object in the sample image is transformed, the random pose transformation of the shooting object in the sample image can be performed by a pose transformation algorithm or a pose transformation network, or a reference image of the shooting object in the sample image shot from another arbitrary angle can be obtained, and the pose of the shooting object is estimated according to the sample image and the reference image.
Specifically, the process of calculating the pose estimation error can be represented by the following formula:
wherein L ispRepresenting pose estimation error, ItRepresentative of a sample image, It′→tThe calculation process of (a) is as follows:
It′→t=It′<proj(Dt,Tt→t′,K)>
wherein D istRepresenting a reconstructed image, K represents a preset camera internal reference, and for simplicity, it is assumed that the internal references K calculated in advance for all sample images are the same, and the camera internal reference may be different from the camera internal reference when the sample images are actually photographed, and a technician may set the parameters according to the use condition of the depth estimation model, which is not limited herein. I ist′Representing the first pose estimation image, Tt→t′The pose transformation parameter representing the pose between the sample image and the first pose estimation image, for example, the pose transformation parameter may be a pose transformation matrix representing the pose change of the photographic object, or the like. I ist′→tCan represent It′Depth-based image DtAnd It′Relative to ItPose transformation parameter T oft→t′Reconstructed ItI.e. the restored image.
Where pe is an error calculation function, and the specific calculation of pe can be implemented according to a structural similarity algorithm (SSIM) and/or a manhattan distance (L1 norm); proj () is in the 2D coordinate system at It′Upper projection depth DtAs a result of the following, it was found that,<>is a sampling operation.
In another embodiment, in order to improve the accuracy of the depth information obtained by the depth estimation model, an edge loss may be calculated for the sample image and the photographic subject in the depth estimation sample image, and the loss of the depth estimation model may be calculated according to the edge loss. Optionally, the step of determining the loss of the depth estimation model according to the sample image and the reconstructed image may include:
performing edge detection on each shot object in the sample image and the depth estimation sample image to obtain a first edge of each shot object in the sample image and a second edge of each shot object in the depth estimation sample image;
and determining the edge loss of the depth estimation model according to the difference between the first edge and the second edge corresponding to the same shooting object.
Specifically, the calculation process of the edge loss can be represented by the following formula:
wherein L issIndicating edge loss.
The step of performing edge detection on each object in the sample image and the depth-of-field estimation sample image may include:
filtering the sample image and the depth estimation sample image to make the sample image and the depth estimation sample image become smooth images;
calculating the amplitude and the direction of the gradient by using the finite difference of the first-order partial derivatives of the smoothed sample image and the depth-of-field estimation sample image;
carrying out non-maximum suppression on the gradient amplitude;
and detecting and connecting the edges of the sample image and the depth estimation sample image by adopting a threshold algorithm.
In another example, the process of edge detection may also be implemented in combination with an edge detection algorithm, such as a Canny edge detection algorithm, a Marr-Hildreth algorithm, and the like, and the implementation manner of edge detection is not limited, which is not limited by the embodiment of the present invention.
Optionally, in order to improve the accuracy of the depth estimation model, the loss of the depth estimation model may be calculated in combination with the edge loss and the image loss, for example, the loss calculation mode of the depth estimation model may be represented by the following formula:
L=μLp+λLs
wherein, mu and lambda are coefficients of image loss and edge loss respectively, and the coefficients can be continuously corrected in the model training process.
202. And extracting a depth feature map of the image to be judged through the depth estimation model, wherein the depth feature map comprises depth information of each pixel point in the image to be judged.
In an embodiment, the depth feature map may be an image generated according to the extracted feature information of the image to be determined after feature extraction is performed on the image to be determined by the depth estimation model.
The extracted features of the depth estimation model for the image to be judged can be local features of the image, such as contrast, detail definition, shadow, highlight and the like, and the local features of the image can reflect the local particularity of the image; but also global features of the image such as line features, color features, texture features, structural features, and the like.
Specifically, a convolution network may be set in the depth-of-field estimation model to extract the image features, and the step "extracting the depth feature map of the image to be determined through the depth-of-field estimation model" may include:
extracting local features and global features of the image to be judged based on the convolution network of the depth of field estimation model;
and performing feature synthesis according to the local features and the global features to obtain a depth feature map of the image to be judged.
The convolutional network is a network structure capable of extracting features of an image, for example, the convolutional network may include a convolutional layer, and the convolutional layer may extract features of the image through a convolution operation.
In an optional example, the convolutional network may include a plurality of convolutional layers, each convolutional layer has at least one convolution unit, different convolution units may extract different features, when feature extraction is performed on the convolutional layers, the image to be determined is scanned by the convolution units, and different features are learned by different convolution kernels to extract original image features.
In another example, the depth of field estimation model includes a feature extraction layer, and step 202 may include: and performing feature extraction and depth analysis on pixel points in the image to be judged through a feature extraction layer of the depth-of-field estimation model to obtain a sample depth feature map corresponding to the image to be judged.
The depth of field analysis process may be conversion according to the feature values of the pixel points to obtain the relative depth of field information of the image to be determined.
It can be understood that, if the depth estimation model is trained by using the sample image labeled with the actual depth value, the target depth information output by the depth estimation model may be the actual depth value of the image.
In some examples, the step of "extracting the depth feature map of the image to be determined through the depth of field estimation model" may also be to apply a Scale-invariant feature transform (SIFT) feature extraction method, a Histogram of Oriented Gradients (HOG) feature extraction method, or the like, in the depth of field estimation model.
203. And calculating to obtain target depth-of-field information of the image to be judged based on the depth-of-field information of each pixel point in the depth feature map.
When the target depth of field information is calculated, the depth of field information of each pixel point in the image to be judged can be directly summed to obtain the depth of field information.
In another example, the average value may be calculated as the target depth information after directly adding the depth information of each pixel point in the image to be determined according to the number of the pixel points in the image to be determined.
Optionally, different calculation weights may be set for the photographic subject in the image to be judged, and then the depth of field information corresponding to different pixel points is subjected to weighted calculation, for example, the pixel point weight of the target photographic subject in the image to be judged may be set to be the highest, and the pixel point weights of other photographic subjects may be adjusted according to the area ratios of the other photographic subjects, and so on.
204. And determining the target scene type corresponding to the image to be judged based on the preset corresponding relation between the depth of field information and the scene type and the target depth of field information.
The correspondence between the depth information and the scene type may be set by a technician according to the actual application scene.
In the practical application process, if the scene type of a certain video needs to be judged, the scene type of each video frame in the video can be judged first, and meanwhile, in order to enable the result on the video to be more continuous and smooth, the scene type of the video can be determined according to the scene type of each video frame in the video. Specifically, the step of extracting the depth feature map of the image to be determined through the depth-of-field estimation model may include:
respectively taking each video frame in the video to be judged as an image to be judged, and extracting a depth feature map of each image to be judged through a depth estimation model;
correspondingly, the embodiment of the invention further comprises:
calculating the similarity between the images to be judged in the video to be judged;
determining at least one group of video frames in the video to be judged according to the similarity, wherein the similarity between the video frames in each group of video frames is greater than a preset similarity threshold;
if continuous video frames in the video to be judged exist in each group of video frames, determining the continuous video frames as a group of target video frames;
determining the scene type of each group of target video frames according to the target scene type corresponding to each image to be judged in each group of target video frames;
and adding corresponding scene type labels for each group of target video frames according to the scene type of each group of target video frames.
Each video frame in the video to be judged can have time information such as a timestamp, and when the continuous video frames are determined, whether two adjacent video frames are continuous or not can be determined directly according to the timestamps of the video frames;
or each video frame in the video to be judged can have a corresponding sequence number, and when the continuous video frames are determined, the video frames are continuous if the numbers of two adjacent video frames are continuous according to the number of each video frame;
alternatively, the determination may be directly performed according to the similarity between the video frames, and if the similarity between the adjacent video frames is greater than a determination threshold, for example, greater than 99%, the two video frames may be considered to be adjacent.
In one example, when at least one group of video frames in the video to be determined is determined, two adjacent video frames in the video to be determined may be sequentially compared, if the similarity between the two video frames is greater than a preset similarity threshold, the two video frames are classified as the same group of video frames, and if the similarity between the two video frames is not greater than the preset similarity threshold, the two video frames are classified as different groups of video frames respectively.
For example, A, B, C, D, E video frames exist in the video to be judged, the similarity between A, B is larger than a preset similarity threshold, A, B is grouped into one group, then the similarity between B, C is calculated, when the similarity between B, C is larger than the preset similarity threshold, C is also grouped into one group, the similarity between C, D is not larger than the preset similarity threshold, D is grouped into two groups, the similarity between D, E is not larger than the preset similarity threshold, and E is grouped into three groups. After grouping is finished, similarity calculation is carried out on the video frames in one group, two groups and three groups until the grouping of the video frames is not changed any more.
In another example, in order to improve the efficiency of computation and save computation resources, grouping video frames in a video to be determined may also be implemented in a clustering manner, and specifically, the step "determining at least one group of video frames in the video to be determined according to the similarity" may include:
determining a preset number of video frames as a central video frame of a cluster center point from all video frames of a video to be judged;
acquiring the similarity between a video frame in a video to be judged and each central video frame;
dividing video frames in a video to be judged into clustering clusters where corresponding target center video frames are located, wherein the similarity between the video frames and the corresponding target center video frames is not lower than a preset similarity threshold;
selecting a new central video frame of each cluster based on the video frames in each cluster, and returning to the step of acquiring the similarity between the video frame in the video to be judged and each central video frame until the clustering end condition is met;
and respectively determining the video frames in each cluster as a group of video frames.
In the embodiment of the present invention, the cluster is a set of a group of video frames generated based on the clustering process.
The central video frames are cluster centers determined when the video frames in the video to be judged are clustered, the number of the central video frames can be a preset fixed number, and the central video frames selected in the first clustering are generally selected randomly. Each time a new center video frame in a cluster is selected, another video frame may be selected in each cluster as the new center video frame.
In another example, the clustering process may be implemented in the form of a clustering model, and the clustering model may cluster the videos to be determined by adjusting the number of central video frames multiple times, so as to determine the most accurate number of central video frames.
Optionally, the clustering end condition may be that the video frame in each cluster does not change, or that the center video frame corresponding to each cluster does not change, or that the number of times of returning to the step of obtaining the similarity between the video frame in the video to be determined and each center video frame in the clustering process is preset, and so on.
In the actual application process, if the scenes or the contents of the sample images are relatively single, the depth of field estimation model may not be well learned for other types of scenes or contents, and if images containing other scenes or contents appear, the depth of field estimation model may generate wrong predictions for the images. To address this issue, the sample image may be augmented to include all possible scenes and content in the sample image. However, with this solution, the number of sample images required is greatly increased, and the time and computational resources required to train the depth estimation model are also greatly increased.
In one example, the type of the scene to which the image to be judged belongs can be determined jointly by calculating the area ratio of the target shooting object in the image to be judged and combining the area ratio and the target depth information. By adopting the solution, the field depth estimation model is not required to be trained by additional sample images, and the accuracy of determining the field depth type can be further improved. Optionally, as shown in fig. 3, before step 204, the method may further include:
determining a target shooting object in the image to be judged, and calculating the target area ratio of the target shooting object in the image to be judged.
The target shooting object is a main shooting object in all shooting objects of the image to be judged. For example, the target photographic subject may be a photographic subject located at the center position of the image, or the target photographic subject may be a photographic subject located at the focal position of the image, or the target photographic subject may be a photographic subject having the largest area ratio in the image to be judged, or the like.
Correspondingly, after determining the target area ratio of the target shooting object, the step "determining the target scene type corresponding to the image to be judged based on the preset corresponding relationship between the depth information and the scene type and the target depth information" may include:
and determining the target scene type corresponding to the image to be judged based on the preset depth of field information and area ratio, the corresponding relation between the preset depth of field information and area ratio and the scene type, and the target depth of field information and the target area ratio of the image to be judged.
The correspondence between the depth information and the area ratio and the type of the scene may be a correspondence set by a developer. The developer can automatically adjust the corresponding relation between the depth of field information, the area ratio and the scene type according to different division rules of the scene type and the like, and the depth of field estimation model has applicability to the division rules of different scene types through the corresponding relation which can be automatically set.
The step of determining a target photographic object in the image to be judged and calculating a target area ratio of the target photographic object in the image to be judged specifically may include:
determining a shot object belonging to a preset object type in an image to be judged;
if the number of the shot objects is determined to be one, the shot objects are determined to be target shot objects, and the target area ratio of the target shot objects in the image to be judged is calculated;
if the number of the shot objects is determined to be at least two, the area ratio of each shot object in the image to be judged is calculated, a target shot object is determined from the shot objects according to the area ratio, and the area ratio corresponding to the target shot object is determined as the target area ratio.
The preset object type can be preset by a user according to factors such as a shooting purpose or a shooting scene of an image to be judged or a video to be judged, for example, if the video to be judged is a news character interview, the preset object type can be set as a character; if the video to be judged is a natural recording film, the preset object type can be set as a plant, and the like.
In another example, if there is no photographic subject belonging to the preset subject type in the image to be judged, the target area ratio may be determined to be zero.
The shooting object belonging to the preset object type in the image to be judged can be determined through different identification models, for example, the face in the image to be judged can be identified through a face identification model. Specifically, the face model may be a retinaface face detector, an MT-CNN face detection network, or the like.
Therefore, the embodiment of the invention trains by using the sample image without the label, reduces the dependence on the manually labeled data, and saves the human resources.
According to the method described in the foregoing embodiment, the following will further describe in detail by taking the image to be determined as a video frame in the video to be determined.
In this embodiment, the system of fig. 1 will be explained.
As shown in fig. 4, the specific process of the scene type determination method of this embodiment may be as follows:
401. and the server pre-trains the depth of field estimation model to be trained based on the sample image.
The pre-training process may specifically include:
acquiring a depth of field estimation model to be trained and a sample image, wherein the sample image does not have a label for indicating the scene type of the sample image, and the depth of field estimation model comprises a feature extraction layer and an image reconstruction layer;
performing feature extraction and depth of field analysis on pixel points in the sample image through a feature extraction layer to obtain a sample depth feature map corresponding to the sample image, wherein the sample depth feature map comprises feature information and depth of field information of each pixel point in the sample image;
through an image reconstruction layer, image reconstruction is carried out based on the characteristic information of each pixel point in the sample depth characteristic map, and a reconstructed image corresponding to the sample image is obtained;
determining the loss of the depth of field estimation model according to the sample image and the reconstructed image;
and adjusting parameters of the depth of field estimation model based on the loss to obtain the trained depth of field estimation model.
402. The server obtains a to-be-judged video sent by the terminal and obtains a depth of field estimation model
The terminal can send the video to be judged to the server in a wired mode after generating the video to be judged, or the terminal sends the video to be judged to the cloud server in a wireless mode, and when the server needs to judge the depth of field type, the video to be judged can be obtained from the cloud server.
403. And the server extracts a depth feature map of the image to be judged in the video to be judged through the depth estimation model, wherein the depth feature map comprises depth information of each pixel point in the image to be judged.
The depth feature map may be an image generated according to feature information of the extracted image to be determined after feature extraction is performed on the image to be determined by the depth estimation model.
It can be understood that the depth information in the depth feature map is converted according to the feature values of the pixel points to obtain the relative depth information of the image to be determined. If the sample image marked with the actual depth value is used for training the depth estimation model, the target depth information output by the depth estimation model can be the actual depth value of the image.
404. And the server calculates to obtain the target depth of field information of the image to be judged based on the depth of field information of each pixel point in the depth feature map.
When the target depth of field information is calculated, the depth of field information of each pixel point in the image to be judged can be directly summed to obtain the depth of field information.
In another example, the average value may be calculated as the target depth information after directly adding the depth information of each pixel point in the image to be determined according to the number of the pixel points in the image to be determined.
Optionally, different calculation weights may be set for the photographic subject in the image to be judged, and then the depth of field information corresponding to different pixel points is subjected to weighted calculation, for example, the pixel point weight of the target photographic subject in the image to be judged may be set to be the highest, and the pixel point weights of other photographic subjects may be adjusted according to the area ratios of the other photographic subjects, and so on.
The specific calculation method of the depth of field information is not limited, and technicians can set the calculation method of the target depth of field information according to actual application conditions.
405. The server determines a target shooting object in the image to be judged and calculates the target area ratio of the target shooting object in the image to be judged.
In the actual application process, if the scenes or the contents of the sample images are relatively single, the depth of field estimation model may not be well learned for other types of scenes or contents, and if images containing other scenes or contents appear, the depth of field estimation model may generate wrong predictions for the images.
In one example, the type of the scene to which the image to be judged belongs can be determined jointly by calculating the area ratio of the target shooting object in the image to be judged and combining the area ratio and the target depth information. By adopting the solution, the field depth estimation model is not required to be trained by additional sample images, and the accuracy of determining the field depth type can be further improved.
The step of determining a target photographic object in the image to be judged and calculating a target area ratio of the target photographic object in the image to be judged specifically may include:
determining a shot object belonging to a preset object type in an image to be judged;
if the number of the shot objects is determined to be one, the shot objects are determined to be target shot objects, and the target area ratio of the target shot objects in the image to be judged is calculated;
if the number of the shot objects is determined to be at least two, the area ratio of each shot object in the image to be judged is calculated, a target shot object is determined from the shot objects according to the area ratio, and the area ratio corresponding to the target shot object is determined as the target area ratio.
406. The server determines a target scene type corresponding to the image to be judged based on preset depth of field information and area ratio, the corresponding relation between the depth of field information and the scene type, and the target depth of field information and the target area ratio of the image to be judged.
For example, the preset depth information and area ratio correspond to the type of scene in the following relationship: the medium shot is when the depth of field information is greater than 0.09 and the area ratio is less than 0.1, the close shot is when the depth of field information is greater than 0.1 and less than 0.5 and the area ratio is greater than 0.3 and less than 0.5, and the like.
In one example, if the target depth of field information and the target area ratio of the image to be determined are 0.3 and 0.4, respectively, the target scene type of the image to be determined is a close scene.
407. The server calculates the similarity between the images to be judged in the video to be judged, and determines at least one group of video frames in the video to be judged according to the similarity, wherein the similarity between the video frames in each group of video frames is greater than a preset similarity threshold.
Optionally, the step of determining at least one group of video frames in the video to be determined according to the similarity may include: determining a preset number of video frames as a central video frame of a cluster center point from all video frames of a video to be judged;
acquiring the similarity between a video frame in a video to be judged and each central video frame;
dividing video frames in a video to be judged into clustering clusters where corresponding target center video frames are located, wherein the similarity between the video frames and the corresponding target center video frames is not lower than a preset similarity threshold;
selecting a new central video frame of each cluster based on the video frames in each cluster, and returning to the step of acquiring the similarity between the video frame in the video to be judged and each central video frame until the clustering end condition is met;
and respectively determining the video frames in each cluster as a group of video frames.
408. And the server determines the scene type of each group of target video frames according to the target scene type corresponding to each image to be judged in each group of target video frames.
In an example, the target scene type corresponding to each image to be determined in each group of target video frames may be counted, and the scene type corresponding to the image to be determined with the largest number of types may be used as the scene type of the group of target video frames.
For example, if there are 3 target scenes corresponding to the images to be determined in a group of target video frames as medium scenes, 7 target scenes corresponding to the images to be determined as near scenes, and 4 target scenes corresponding to the images to be determined as far scenes, the target scenes of the group of target video frames are long scenes.
409. And the server adds corresponding scene type labels to each group of target video frames according to the scene type of each group of target video frames.
For example, if the scene type of the set of target video frames is a long scene, a scene type label of the long scene is set for the whole set of target video frames. When the video frames of the long shot need to be called, the whole group of target video frames can be directly acquired, and the continuity of the obtained target video frames is ensured.
In another example, in order to make the result of each frame more robust, after the scene type of each group of target video frames is determined, if the scene type corresponding to the image to be determined exists in the target video frames and is different from the scene type of the target video frames, the scene type corresponding to the image to be determined is modified into the scene type of the target video frames.
Therefore, the embodiment of the invention trains by using the sample image without the label, reduces the dependence on the manually labeled data, and saves the human resources.
In order to better implement the method, correspondingly, the embodiment of the invention further provides a scene type judging device.
Referring to fig. 5, the apparatus includes:
the model obtaining unit 501 is configured to obtain a depth-of-field estimation model, where the depth-of-field estimation model is obtained by training based on an unmarked sample image and a depth-of-field estimation sample image corresponding to the sample image, and the depth-of-field estimation sample image is obtained by extracting a sample depth feature map from the sample image for the depth-of-field estimation model and then performing image reconstruction based on the sample depth feature map, where the sample depth feature map includes depth-of-field information of each pixel point in the sample image;
the feature extraction unit 502 is configured to extract a depth feature map of the image to be determined through the depth estimation model, where the depth feature map includes depth information of each pixel point in the image to be determined;
the depth-of-field estimation unit 503 is configured to calculate target depth-of-field information of the image to be determined based on depth-of-field information of each pixel point in the depth feature map;
the type determining unit 504 is configured to determine a target scene type corresponding to the image to be determined based on a preset correspondence between the depth of field information and the scene type and the target depth of field information.
In an optional example, the image to be judged is a video frame in a video to be judged, the video to be judged includes at least two video frames, and the feature extraction module is configured to extract a depth feature map of each image to be judged by using each video frame in the video to be judged as an image to be judged through the depth estimation model;
as shown in fig. 6, after the type determining unit 504, a video type determining unit 505 is further included, configured to calculate similarity between images to be determined in the video to be determined;
determining at least one group of video frames in the video to be judged according to the similarity, wherein the similarity between the video frames in each group of video frames is greater than a preset similarity threshold;
if continuous video frames in the video to be judged exist in each group of video frames, determining the continuous video frames as a group of target video frames;
determining the scene type of each group of target video frames according to the target scene type corresponding to each image to be judged in each group of target video frames;
and adding corresponding scene type labels for each group of target video frames according to the scene type of each group of target video frames.
In an optional example, the video type determining unit 505 may be configured to determine, from video frames of a video to be determined, a preset number of video frames as a center video frame of a cluster center point;
acquiring the similarity between a video frame in a video to be judged and each central video frame;
dividing video frames in a video to be judged into clustering clusters where corresponding target center video frames are located, wherein the similarity between the video frames and the corresponding target center video frames is not lower than a preset similarity threshold;
selecting a new central video frame of each cluster based on the video frames in each cluster, and returning to the step of acquiring the similarity between the video frame in the video to be judged and each central video frame until the clustering end condition is met;
and respectively determining the video frames in each cluster as a group of video frames.
In an optional example, before the type determining unit 504, an area calculating unit 506 is further included, configured to determine a target photographic object in the image to be determined, and calculate a target area ratio of the target photographic object in the image to be determined;
determining a target scene type corresponding to the image to be judged based on the corresponding relation between the preset depth of field information and the scene type and the target depth of field information, wherein the method comprises the following steps:
and determining the target scene type corresponding to the image to be judged based on the preset depth of field information and area ratio, the corresponding relation between the preset depth of field information and area ratio and the scene type, and the target depth of field information and the target area ratio of the image to be judged.
In an optional example, the area calculating unit 506 may be configured to determine a photographic object belonging to a preset object type in the image to be determined;
if the number of the shot objects is determined to be one, the shot objects are determined to be target shot objects, and the target area ratio of the target shot objects in the image to be judged is calculated;
if the number of the shot objects is determined to be at least two, the area ratio of each shot object in the image to be judged is calculated, a target shot object is determined from the shot objects according to the area ratio, and the area ratio corresponding to the target shot object is determined as the target area ratio.
In an optional example, before the model obtaining unit 501, the model training unit 507 is further configured to obtain a depth of field estimation model to be trained and a sample image, where the sample image does not have a label indicating a scene type of the sample image, and the depth of field estimation model includes a feature extraction layer and an image reconstruction layer;
performing feature extraction and depth of field analysis on pixel points in the sample image through a feature extraction layer to obtain a sample depth feature map corresponding to the sample image, wherein the sample depth feature map comprises feature information and depth of field information of each pixel point in the sample image;
through an image reconstruction layer, image reconstruction is carried out based on the characteristic information of each pixel point in the sample depth characteristic map, and a reconstructed image corresponding to the sample image is obtained;
determining the loss of the depth of field estimation model according to the sample image and the reconstructed image;
and adjusting parameters of the depth of field estimation model based on the loss to obtain the trained depth of field estimation model.
In an optional example, the feature extraction layer comprises a depth of field analysis module and at least two sequentially connected feature extraction modules;
the model training unit 507 may be configured to perform feature extraction on pixel points in the sample image through a feature extraction module to obtain at least two intermediate feature maps of the sample image, where the intermediate feature maps include feature information of the pixel points;
determining depth-of-field information of the pixel points based on the characteristic information of the pixel points in the intermediate characteristic map through a depth-of-field analysis module, and obtaining a sample depth characteristic map corresponding to the sample image based on the depth-of-field information and the intermediate characteristic map;
and through the image reconstruction layer, carrying out image reconstruction based on the characteristic information of the pixels with your in the intermediate characteristic map and the sample depth characteristic map to obtain a reconstructed image corresponding to the sample image.
In an optional example, the model training unit 507 may be configured to perform pose transformation on the shooting object in the sample image based on the pose transformation parameter to obtain a first pose estimation image corresponding to the sample image;
performing pose restoration on the first pose estimation image based on the pose transformation parameters and the reconstructed image to obtain a restored image;
calculating image loss based on image difference between the sample image and the restored image;
based on the image loss, a loss of the depth estimation model is determined.
In an optional example, the model training unit 507 may be configured to perform edge detection on each object in the sample image and the depth estimation sample image to obtain a first edge of each object in the sample image and a second edge of each object in the depth estimation sample image;
and determining the edge loss of the depth estimation model according to the difference between the first edge and the second edge corresponding to the same shooting object.
In an optional example, the feature extraction unit 502 may be configured to perform feature extraction and depth of field analysis on a pixel point in a sample image through a feature extraction layer of a depth of field estimation model, so as to obtain a sample depth feature map corresponding to the sample image.
Therefore, by the scene type judging device, the unmarked sample image can be used for training, the dependence on manually marked data is reduced, and the human resources are saved.
In addition, an embodiment of the present invention further provides an electronic device, where the electronic device may be a terminal or a server, and as shown in fig. 7, a schematic structural diagram of the electronic device according to the embodiment of the present invention is shown, specifically:
the electronic device may include Radio Frequency (RF) circuitry 701, memory 702 including one or more computer-readable storage media, input unit 703, display unit 704, sensor 705, audio circuitry 706, Wireless Fidelity (WiFi) module 707, processor 708 including one or more processing cores, and power supply 709. Those skilled in the art will appreciate that the terminal structure shown in fig. 7 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the RF circuit 701 may be used for receiving and transmitting signals during a message transmission or communication process, and in particular, for receiving downlink information of a base station and then sending the received downlink information to the one or more processors 708 for processing; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuitry 701 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 701 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.
The memory 702 may be used to store software programs and modules, and the processor 708 executes various functional applications and data processing by operating the software programs and modules stored in the memory 702. The memory 702 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal, etc. Further, the memory 702 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 702 may also include a memory controller to provide the processor 708 and the input unit 703 access to the memory 702.
The input unit 703 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in a particular embodiment, the input unit 703 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 708, and can receive and execute commands sent by the processor 708. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 703 may include other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 704 may be used to display information input by or provided to the user and various graphical user interfaces of the terminal, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 704 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is communicated to the processor 708 to determine the type of touch event, and the processor 708 provides a corresponding visual output on the display panel according to the type of touch event. Although in FIG. 7 the touch-sensitive surface and the display panel are two separate components to implement input and output functions, in some embodiments the touch-sensitive surface may be integrated with the display panel to implement input and output functions.
The terminal may also include at least one sensor 705, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or the backlight when the terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, and can be used for applications of recognizing the posture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured in the terminal, detailed description is omitted here.
Audio circuitry 706, a speaker, and a microphone may provide an audio interface between the user and the terminal. The audio circuit 706 can transmit the electrical signal converted from the received audio data to a loudspeaker, and the electrical signal is converted into a sound signal by the loudspeaker and output; on the other hand, the microphone converts the collected sound signal into an electric signal, which is received by the audio circuit 706 and converted into audio data, which is then processed by the audio data output processor 708, and then transmitted to, for example, another terminal via the RF circuit 701, or the audio data is output to the memory 702 for further processing. The audio circuitry 706 may also include an earbud jack to provide communication of peripheral headphones with the terminal.
WiFi belongs to short-distance wireless transmission technology, and the terminal can help a user to send and receive e-mails, browse webpages, access streaming media and the like through the WiFi module 707, and provides wireless broadband internet access for the user. Although fig. 7 shows the WiFi module 707, it is understood that it does not belong to the essential constitution of the terminal, and may be omitted entirely as needed within the scope not changing the essence of the invention.
The processor 708 is a control center of the terminal, connects various parts of the entire handset using various interfaces and lines, and performs various functions of the terminal and processes data by operating or executing software programs and/or modules stored in the memory 702 and calling data stored in the memory 702, thereby performing overall monitoring of the handset. Optionally, processor 708 may include one or more processing cores; preferably, the processor 708 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 708.
The terminal also includes a power source 709 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 708 via a power management system that may be configured to manage charging, discharging, and power consumption. The power supply 709 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
Although not shown, the terminal may further include a camera, a bluetooth module, and the like, which will not be described herein. Specifically, in this embodiment, the processor 708 in the terminal loads the executable file corresponding to the process of one or more application programs into the memory 702 according to the following instructions, and the processor 708 runs the application programs stored in the memory 702, thereby implementing various functions as follows:
acquiring a depth of field estimation model, wherein the depth of field estimation model is obtained based on a non-labeled sample image and a depth of field estimation sample image corresponding to the sample image through training, the depth of field estimation sample image is obtained by extracting a sample depth feature map from the sample image for the depth of field estimation model and then carrying out image reconstruction based on the sample depth feature map, and the sample depth feature map comprises depth of field information of each pixel point in the sample image;
extracting a depth feature map of the image to be judged through a depth estimation model, wherein the depth feature map comprises depth information of each pixel point in the image to be judged;
calculating to obtain target depth-of-field information of the image to be judged based on the depth-of-field information of each pixel point in the depth feature map;
and determining the target scene type corresponding to the image to be judged based on the preset corresponding relation between the depth of field information and the scene type and the target depth of field information.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present invention provide a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any of the scene type determination methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:
acquiring a depth of field estimation model, wherein the depth of field estimation model is obtained based on a non-labeled sample image and a depth of field estimation sample image corresponding to the sample image through training, the depth of field estimation sample image is obtained by extracting a sample depth feature map from the sample image for the depth of field estimation model and then carrying out image reconstruction based on the sample depth feature map, and the sample depth feature map comprises depth of field information of each pixel point in the sample image;
extracting a depth feature map of the image to be judged through a depth estimation model, wherein the depth feature map comprises depth information of each pixel point in the image to be judged;
calculating to obtain target depth-of-field information of the image to be judged based on the depth-of-field information of each pixel point in the depth feature map;
and determining the target scene type corresponding to the image to be judged based on the preset corresponding relation between the depth of field information and the scene type and the target depth of field information.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
Since the instructions stored in the storage medium may execute the steps in any scene type determination method provided in the embodiment of the present invention, beneficial effects that can be achieved by any scene type determination method provided in the embodiment of the present invention may be achieved, which are detailed in the foregoing embodiments and will not be described herein again.
According to an aspect of the application, there is also provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the method provided in the various alternative implementations in the above embodiments.
The method, the apparatus, the electronic device and the storage medium for determining scene type provided by the embodiment of the present invention are described in detail above, a specific example is applied in the present document to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.