CN113051430B

CN113051430B - Model training method, device, electronic equipment, medium and product

Info

Publication number: CN113051430B
Application number: CN202110324886.5A
Authority: CN
Inventors: 朱文涛; 李江东; 吕廷迅; 班鑫
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2024-03-26
Anticipated expiration: 2041-03-26
Also published as: CN113051430A

Abstract

The present disclosure provides video analytics model training methods, apparatus, devices, media, and products that take as input an image set of a sample video, rather than an entire sample video, during training of a machine learning model. The image set of the sample video contains the second video image and the first video image extracted from the sample video. The second video image is a set image, and since the number of images contained in the image set of the sample video is smaller than the number of all images contained in the sample video, the machine learning model is faster to train. Since the second video image may be independent of the sample video, mask parameters are determined, the mask parameters are used to record the positions of the valid images and the positions of the invalid images in the image set, and the image set and the mask parameters are used as input machine learning models, so that the machine learning models obtain analysis results of the sample video based on the first video image. The training machine learning model is more accurate.

Description

Model training method, device, electronic equipment, medium and product

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, medium, and product for training a video analysis model.

Background

The video class client can recommend videos of interest to the user, or can display classification labels (such as classification labels of the process programs, movies, cartoons and the like) of the videos to the user, so that the user can find corresponding videos based on the classification labels of the videos.

In the related art, the video of interest to the user or the classification label of the video is determined based on the video content information of the video, and the method for obtaining the video content information of the video may be to input the video into a machine learning model to obtain the video content information output by the machine learning model. The machine learning model is trained by using a plurality of videos as inputs.

The speed of training the machine learning model in the related art is slow.

Disclosure of Invention

The present disclosure provides a video analysis model training method, apparatus, device, medium, and product to at least solve the problem of slow speed of training a machine learning model in the related art. The technical scheme of the present disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided a video analysis model training method, including:

Extracting video images from a plurality of frames of video images contained in a sample video at intervals to extract and obtain a first video image;

determining an image set corresponding to the sample video, wherein the image set comprises the first video image and a first number of second video images, the second video images are set images, the sum of the number of the first video images and the first number is a preset number, and the number of the images contained in the image sets respectively corresponding to different sample videos in the sample video set is the preset number;

determining mask parameters based on the image set, wherein the mask parameters are used for recording the positions of effective images and the positions of ineffective images in the image set, the first video image is the effective image, and the second video image is the ineffective image;

and taking the image set and the mask parameters as input of a machine learning model, taking the manual labeling classification labels corresponding to the sample video as training targets, and training to obtain a video analysis model.

With reference to the first aspect, in a first possible implementation manner, the determining mask parameters based on the image set includes:

Determining that the first video image is located at a first position in the set of images;

determining that the second video image is located at a second position in the set of images;

determining the mask parameter based on the first position and the second position, wherein elements in the mask parameter, which are positioned at positions corresponding to the first position, are first characters, and elements in the mask parameter, which are positioned at positions corresponding to the second position, are second characters; the image of the first character representation located at the first position of the image set is a valid image, and the image of the second character representation located at the second position of the image set is an invalid image.

With reference to the first aspect, in a second possible implementation manner, the step of extracting video images from multiple frames of video images included in the sample video at intervals to obtain the first video image includes:

receiving sample video from a storage device, wherein the sample video is video compressed and encoded by the storage device;

obtaining a target key frame from a plurality of key frames contained in the sample video;

acquiring a sampling interval;

extracting a non-key frame from the non-key frames corresponding to the target key frame contained in the sample video at intervals of the sampling interval to obtain at least one non-key frame, wherein the non-key frame is compressed based on the target key frame;

And decoding the target key frame and the at least one non-key frame to obtain the first video image.

With reference to the first aspect, in a third possible implementation manner, the step of providing the sampling interval includes:

calculating the sampling interval based on the frame rate of the sample video and the number of processed images per second; or alternatively, the first and second heat exchangers may be,

and acquiring the preset sampling interval.

With reference to the first aspect, in a fourth possible implementation manner, the step of providing the image set and the mask parameter corresponding to the sample video as input of a machine learning model, and taking the manually labeled classification label corresponding to the sample video as a training target, and training to obtain a video analysis model includes:

inputting the image set and the mask parameters into a machine learning model;

the machine learning model performs the steps of:

obtaining a feature information set corresponding to the image set, wherein the feature information set comprises feature information corresponding to the first video image and feature information corresponding to the second video image;

screening out feature information corresponding to the first video image from the feature information set based on the mask parameters;

Determining a classification label based on the feature information corresponding to the first video image, wherein the machine learning model is used for outputting the feature information corresponding to the first video image or the classification label;

and training the machine learning model based on the comparison result of the classification label and the manually marked classification label to obtain the video analysis model.

With reference to the first aspect, in a fifth possible implementation manner, the step of providing the feature information set corresponding to the image set includes:

obtaining intra-frame characteristic information corresponding to each image in the image set;

and acquiring the feature information set based on the intra-frame feature information corresponding to each image in the image set, wherein the feature information corresponding to the first video image comprises a plurality of inter-frame feature information among the first video images, the feature information corresponding to the second video image comprises a plurality of inter-frame feature information among the second video images, and/or the inter-frame feature information between the second video images and the first video images.

With reference to the first aspect, in a sixth possible implementation manner, providing a feature information of an effective image in the feature information set that is the same as a position of the effective image in the image set, and the step of screening feature information corresponding to the first video image from the feature information set based on the mask parameter includes:

Determining a third position of the feature information of the effective image characterized by the mask parameters in the feature information set;

and obtaining the characteristic information at the third position in the characteristic information set to obtain the characteristic information corresponding to the first video image.

With reference to the first aspect, in a seventh possible implementation manner, the method for training a video analysis model further includes:

extracting video images from a plurality of frames of video images contained in the video to be detected at intervals to extract and obtain a third video image;

inputting the third video image into the video analysis model, and obtaining an analysis result corresponding to the video to be detected through the video analysis model, wherein the analysis result comprises characteristic information of the third video image or a classification label of the video to be detected.

With reference to the first aspect, in an eighth possible implementation manner, after the step of determining mask parameters, the method further includes:

storing the image set corresponding to the sample video and the mask parameters into a database;

and before the step of obtaining the video analysis model by training, taking the image set and the mask parameters as input of a machine learning model and taking the manual labeling classification labels corresponding to the sample video as training targets, the method further comprises the following steps of:

And obtaining the image set and the mask parameters corresponding to the sample video from the database.

According to a second aspect of embodiments of the present disclosure, there is provided a video analysis model training apparatus, including:

the first acquisition module is configured to extract video images from a plurality of frames of video images contained in the sample video at intervals so as to extract and obtain a first video image;

the first determining module is configured to determine an image set corresponding to the sample video, wherein the image set comprises the first video image and a first number of second video images, the second video images are set images, the sum of the number of the first video images and the first number is a preset number, and the numbers of the images contained in the image sets respectively corresponding to different sample videos in the sample video set are all the preset numbers;

a second determining module configured to determine mask parameters based on the image set, where the mask parameters are used for recording positions of valid images and positions of invalid images in the image set, the first video image is a valid image, and the second video image is an invalid image;

the training module is configured to take the image set and the mask parameters as input of a machine learning model, take the manual labeling classification labels corresponding to the sample videos as training targets, and train to obtain a video analysis model.

With reference to the second aspect, in a first possible implementation manner, the second determining module is specifically configured to:

a first determining unit configured to determine that the first video image is located at a first position in the image set;

a second determining unit configured to determine that the second video image is located at a second position in the image set;

a third determining unit configured to determine the mask parameter based on the first position and the second position, wherein an element located at a position corresponding to the first position in the mask parameter is a first character, and an element located at a position corresponding to the second position is a second character; the image of the first character representation located at the first position of the image set is a valid image, and the image of the second character representation located at the second position of the image set is an invalid image.

With reference to the second aspect, in a second possible implementation manner, the first obtaining module is specifically configured to:

a receiving unit configured to receive a sample video from a storage device, the sample video being a video compression-encoded by the storage device;

A first acquisition unit configured to acquire a target key frame from a plurality of key frames contained in the sample video;

a second acquisition unit configured to acquire a sampling interval;

a third obtaining unit, configured to extract a non-key frame from non-key frames corresponding to the target key frame included in the sample video at intervals of the sampling interval, so as to obtain at least one non-key frame, where the non-key frame is compressed based on the target key frame;

a decoding unit configured to decode the target key frame and the at least one non-key frame to obtain the first video image.

With reference to the second aspect, in a third possible implementation manner, there is provided a second acquiring unit specifically configured to:

a calculating subunit configured to calculate the sampling interval based on a frame rate of the sample video and the number of processed images per second; or alternatively, the first and second heat exchangers may be,

and an acquisition subunit configured to acquire the sampling interval set in advance.

With reference to the second aspect, in a fourth possible implementation manner, there is provided the training module specifically configured to:

an input unit configured to input the set of images and the mask parameters to a machine learning model;

The machine learning model includes the following modules:

the feature extraction module is configured to obtain a feature information set corresponding to the image set, wherein the feature information set comprises feature information corresponding to the first video image and feature information corresponding to the second video image;

the effective feature extraction module is configured to screen feature information corresponding to the first video image from the feature information set based on the mask parameters;

a tag prediction module configured to determine a classification tag based on feature information corresponding to the first video image, the machine learning model being configured to output the feature information corresponding to the first video image or the classification tag;

and the training unit is configured to train the machine learning model based on the comparison result of the classification label and the manual labeling classification label so as to obtain the video analysis model.

With reference to the second aspect, in a fifth possible implementation manner, there is provided the feature extraction module specifically configured to:

the intra-frame feature extraction module is configured to obtain intra-frame feature information corresponding to each image in the image set;

The inter-frame feature extraction module is configured to obtain a feature information set based on intra-frame feature information corresponding to each image in the image set, wherein the feature information corresponding to the first video image comprises inter-frame feature information among a plurality of first video images, the feature information corresponding to the second video image comprises inter-frame feature information among a plurality of second video images, and/or the inter-frame feature information between the second video images and the first video images.

With reference to the second aspect, in a sixth possible implementation manner, a position of feature information of an effective image in the feature information set is provided to be the same as a position of the effective image in the image set, and the effective feature extraction module is specifically configured to:

an extraction position module configured to determine a third position of feature information of the valid image characterized by the mask parameter in the feature information set;

and the characteristic extraction module is configured to obtain characteristic information at the third position in the characteristic information set so as to obtain characteristic information corresponding to the first video image.

With reference to the second aspect, in a seventh possible implementation manner, there is provided a method further including:

The second acquisition module is configured to extract video images from a plurality of frames of video images contained in the video to be detected at intervals so as to extract a third video image;

the analysis module is configured to input the third video image into the video analysis model, obtain an analysis result corresponding to the video to be detected through the video analysis model, and the analysis result comprises characteristic information of the third video image or a classification label of the video to be detected.

With reference to the second aspect, in an eighth possible implementation manner, there is provided a method further including:

a storage module configured to store the image set and the mask parameters corresponding to the sample video to a database;

and a third acquisition module configured to acquire the image set corresponding to the sample video and the mask parameters from the database.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the video analytics model training method of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the video analysis model training method as described in the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product directly loadable into an internal memory of a computer, for example a memory comprised by an electronic device as described in the third aspect, and comprising software code, the computer program being capable of implementing the video analytics model training method as described in the first aspect after being loaded and executed via the computer.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

in the video analysis model training method provided by the embodiment of the disclosure, video images can be extracted from multiple frames of video images contained in a sample video at intervals so as to extract and obtain a first video image; determining an image set corresponding to a sample video, wherein the image set comprises a first video image and a first number of second video images, the second video images are set images, the sum of the number of the first video images and the first number is a preset number, and the numbers of the images contained in the image sets respectively corresponding to different sample videos in the sample video set are all preset numbers; determining mask parameters based on the image set, wherein the mask parameters are used for recording the positions of effective images and the positions of ineffective images in the image set, the first video image is the effective image, and the second video image is the ineffective image; and taking the image set and the mask parameters as input of a machine learning model, taking the manual labeling classification labels corresponding to the sample video as training targets, and training to obtain a video analysis model. Since the number of images contained by the set of images is less than the number of images contained by the sample video, training the machine learning model with the set of images of the sample video is faster. In the process of training the machine learning model, image sets corresponding to a plurality of sample videos respectively need to be input to the machine learning model at the same time, so that the number of images contained in the image sets corresponding to the plurality of sample videos respectively is the same, for example, the number is preset, and in the embodiment of the disclosure, if the number of first video images extracted from the sample videos is smaller than the preset number, the second video images need to be used for filling. Since the second video image may be independent of the sample video, mask parameters need to be input into the machine learning model to cause the machine learning model to derive an analysis result of the sample video based on the first video image. The training machine learning model is more accurate.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a block diagram illustrating hardware to which embodiments of the present disclosure apply, according to an example embodiment;

FIG. 2 is a flowchart illustrating a video analytics model training method, according to an exemplary embodiment;

FIG. 3a is a schematic diagram illustrating one implementation of extracting target key frames from a sample video according to one exemplary embodiment;

FIG. 3b is a schematic diagram illustrating yet another implementation of extracting target key frames from a sample video according to an example embodiment;

FIG. 4 is a block diagram of a machine learning model corresponding to a first type of video analytics model, as shown in accordance with an exemplary embodiment;

FIG. 5 is a flowchart illustrating a method of obtaining analysis results for a video under test based on a video analysis model, according to an example embodiment;

FIG. 6 is a block diagram of a video analytics model training device, shown in accordance with an exemplary embodiment;

fig. 7 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The embodiment of the disclosure provides a video analysis model training method, a device, equipment, a medium and a product, and before introducing the technical scheme provided by the embodiment of the disclosure, the network environment and hardware applied by the embodiment of the disclosure are introduced.

Fig. 1 is a block diagram illustrating hardware to which embodiments of the present disclosure apply, according to an example embodiment. The hardware applied by the embodiment of the disclosure comprises: a first electronic device 11, a second electronic device 12, a third electronic device 13, at least one storage device 14.

For example, any one of the first electronic device 11, the second electronic device 12, and the third electronic device 13 may be any electronic product that can perform man-machine interaction with a user through one or more manners of a keyboard, a touch pad, a touch screen, a remote controller, a voice interaction or a handwriting device, for example, a mobile phone, a tablet computer, a palm computer, a personal computer, a wearable device, a smart television, and the like.

For example, any one of the first electronic device 11, the second electronic device 12, and the third electronic device 13 may be a server, or may be a server cluster formed by a plurality of servers, or may be a cloud computing service center.

For example, any at least two electronic devices of the first electronic device 11, the second electronic device 12, and the third electronic device 13 may be the same electronic device; alternatively, the first electronic device 11, the second electronic device 12, and the third electronic device 13 are all independent electronic devices.

Storage device 14 may be, for example, a hard disk or a database or server. Illustratively, the storage device 14 may be a CDN (Content Delivery Network ) server.

For example, the storage device 14 may be integrated with the first electronic device 11 or independent of the first electronic device 11.

It should be noted that three storage devices 14 are shown in fig. 1, and the number of storage devices 14 may be based on actual conditions, and the number of storage devices 14 is not limited by the embodiments of the present disclosure. In fig. 1, a storage device is taken as an example of a server, a database, and a hard disk.

The storage device 14 is illustratively for storing a plurality of videos.

For example, different storage devices 14 may correspond to different products, e.g., the same enterprise has multiple products, e.g., application APP1, application APP2, application APP3, and the storage devices to which different applications correspond may be different, e.g., application APP1 corresponds to storage device 1, application APP2 corresponds to storage device 2, application APP3 corresponds to storage device 3.

For example, videos stored in different storage devices may be uploaded by users of their respective clients.

The data formats of the video stored by the different storage devices 14 may vary, and exemplary data formats of the video include, but are not limited to: file Path, file object, URL (uniform resource locator, uniform resource locator System), jar (Java ARchive) package, kafka, gRPC (Google Remote procedure call )

For example, the geographic locations where different storage devices 14 are located may be different.

In an alternative implementation, the first electronic device 11 may obtain the video from one or more storage devices 14 separately, i.e. the first electronic device 11 may read the video from the storage devices 14 storing the video in different data formats separately.

For example, the first electronic device 11 may store data reading manners corresponding to different data formats, as shown in fig. 1, and fig. 1 illustrates that the first electronic device 11 may read video from the storage device 14 storing video with different data formats, respectively. The data formats stored by the three storage devices 14 shown in fig. 1 are data format 1, data format 2, and data format 3, respectively. Then the first electronic device reads the video from the storage device 14 storing the video of data format 1 in a manner corresponding to data format 1; the manner in which the first electronic device reads video from the storage device 14 in which video of data format 2 is stored corresponds to data format 2; the manner in which the first electronic device reads video from the storage device 14 in which video of data format 3 is stored corresponds to data format 3.

In an alternative implementation, the first electronic device 11 may read video from different storage devices 14 (video stored with the same data format by different storage devices 14).

The first electronic device 11 may extract, for each video, video images at intervals from a plurality of frames of video images contained in the video, to obtain video images.

The video obtained by the first electronic device 11 may be a sample video or a video to be tested.

The sample video is used for training a machine learning model, and the trained machine learning model can be utilized to output an analysis result of the video to be tested.

The embodiment of the disclosure refers to a video image extracted from a sample video as a first video image, and refers to a video image extracted from a video to be detected as a third video image.

Illustratively, the first electronic device 11 further includes at least one of a cache message queue 111 and a data loader 112.

Illustratively, the first electronic device 11 stores a plurality of image sets corresponding to the plurality of videos, respectively, to the cache message queue 111. The image set corresponding to the video comprises video images extracted from the video. One video corresponds to one image set.

It will be appreciated that over time, the number of videos processed by the first electronic device 11 increases, and the collection of images stored by the cache message queue 111 increases.

The data loader 112 is for reading image sets corresponding to a plurality of videos from the memory of the first electronic device 11.

Illustratively, after the data loader 12 sends the image set corresponding to the video to the second electronic device 12 or the third electronic device 13, the first electronic device 11 does not store the image set corresponding to the video.

In the embodiment of the disclosure, the multiple image sets corresponding to the multiple videos respectively may be applied to two application scenarios, where the first application scenario is an application scenario in which a machine learning model is trained to obtain a video analysis model; the second application scene is an application scene for online reasoning of the video based on the trained video analysis model.

The following description is made in connection with different application scenarios.

Application scenario one: the machine learning model is trained to obtain a video analytics model.

In the first application scenario, the video obtained by the first electronic device is referred to as a sample video.

The first electronic device may obtain manual labeling classification labels corresponding to the plurality of sample videos, and store the manual labeling classification labels in the buffer message queue.

The second electronic device may obtain, from the cache message queue of the first electronic device, manual labeling classification labels corresponding to the plurality of sample videos, respectively.

The second electronic device may obtain, through the data loader of the first electronic device, a plurality of sample videos respectively corresponding to the manually labeled classification tags.

The second electronic device 12 may obtain image sets corresponding to the plurality of sample videos from the first electronic device 11, respectively, use the image sets corresponding to the plurality of sample videos as input of the machine learning model 121, use the manually labeled classification labels corresponding to the plurality of sample videos as training targets, and train to obtain the video analysis model.

For example, the first electronic device 11 stores the image sets corresponding to the plurality of sample videos in the buffer message queue 111, and the second electronic device 12 may obtain the image sets corresponding to the plurality of sample videos from the buffer message queue 111.

It can be appreciated that, since the buffer message queue 111 already stores the image sets corresponding to the plurality of sample videos respectively, the second electronic device 12 may directly obtain the image sets corresponding to the plurality of sample videos respectively from the buffer message queue 111 in the process of training the machine learning model, that is, the process of extracting sample data (the process of obtaining the image sets corresponding to the plurality of sample videos respectively is referred to as the process of extracting the sample data) and the training process of the machine learning model are separated in the embodiment of the disclosure, so that in the process of training the machine learning model, there is no need to wait for extracting the sample data, and the reading speed of the image sets corresponding to the plurality of sample videos respectively is improved, thereby improving the speed of training the machine learning model.

The second electronic device 12 may obtain the image set corresponding to the same sample video from the buffered message queue 111 multiple times. The first electronic device 11 does not need to process the same sample video multiple times to obtain an image set corresponding to the sample video, and the first electronic device 11 needs to process the sample video once and store the obtained image set corresponding to the sample video in the buffer message queue 111.

For example, the first electronic device 11 sends the image sets corresponding to the plurality of sample videos to the second electronic device 12 in real time through the data loader 112, and since the first electronic device 11 does not store the image sets corresponding to the plurality of sample videos, if the second electronic device 12 needs the image sets corresponding to the same sample video, the first electronic device 11 needs to process the sample video again.

The buffered message queue 111 and the data loader 112 are shown in fig. 1, but the first electronic device is not limited to comprising both the buffered message queue 111 and the data loader 112. Illustratively, the first electronic device includes at least one of a cache message queue 111 and a data loader 112.

It will be appreciated that the second electronic device 12 may train the machine learning model using a GPU (Graphics Processing Unit, graphics processor), and that the amount of data that can be processed at one time is limited due to the limited memory of the GPU, and therefore the amount of data that is simultaneously input to the machine learning model is limited, while the larger the number of image sets input to the machine learning model, the more accurate the trained video analysis model.

In the embodiment of the disclosure, the image set corresponding to the sample video is used as the input of the machine learning model, and compared with the sample video used as the input of the machine learning model, the data volume of the images contained in the image set corresponding to the sample video is smaller or even far smaller than the number of all video images contained in the sample video, so that the number of the image sets of the sample video which is simultaneously input into the machine learning model in the embodiment of the disclosure is more, so that the video analysis model obtained through training is more accurate.

The effect of how many of the number of image sets that are simultaneously input to the machine learning model on the training process of the machine learning model is described below.

It can be understood that the manual labeling classification labels corresponding to the plurality of sample videos respectively may be wrong, and the situations that the manual labeling classification labels of the sample videos are wrong may be as follows.

First case: when the sample video is marked manually, the marked manual mark is wrong in classifying the labels.

For example, the actual classification label for video a is a cartoon classification label, but is artificially labeled as a non-cartoon classification label.

Second case: one sample video may correspond to a plurality of actual classification labels, and the sample video is assumed to correspond to a classification label A and a classification label B, and the classification label A corresponding to the sample video is needed in the process of training a machine learning model, but the sample video is manually marked as the classification label B during marking.

For example, the actual classification label corresponding to one sample video is a cartoon classification label or a pass-through classification label, and it is assumed that the pass-through classification label of the sample video is required when training a machine learning model, but is manually marked as the cartoon classification label.

If the number of the image sets input to the machine learning model is small, if the manual labeling classification labels of the sample videos corresponding to one or more image sets in the image sets are wrong, the influence of the sample video with the wrong manual labeling classification labels on the machine learning model is large, so that the updated parameters of the machine learning model are more inaccurate than before training, and the convergence rate of the machine learning model is slower.

For example, the number of image sets simultaneously input to the machine learning model is 3, and if the manually labeled classification labels of the sample videos corresponding to two image sets in the 3 image sets are wrong, parameters after the machine learning model is updated based on the 3 image sets are more inaccurate than parameters before training.

It can be understood that, in a large number of sample videos, the number of sample videos with wrong manual labeling classification labels is smaller, if the number of image sets input to the machine learning model at the same time is larger, for example, hundreds or thousands or tens of thousands of image sets, if the manual labeling classification labels of sample videos corresponding to a few image sets are wrong, the parameters in the machine learning model are not greatly affected, even not affected, and the robustness of the parameters in the machine learning model is stronger.

In summary, the more the number of image sets simultaneously input to the machine learning model, the more accurate the parameters of the machine learning model after updating, the faster the convergence speed of the machine learning model, and the more accurate the analysis result output by the video analysis model is obtained by training the machine learning model.

And (2) an application scene II: and carrying out online reasoning on the video based on the trained video analysis model.

In the second application scenario, the video obtained by the first electronic device 12 is referred to as a video to be tested.

The second electronic device 12 may send the trained video analytics model to the third electronic device 13.

For example, the third electronic device 13 may be installed with a torchserv service platform. The video analytics model may be loaded to a torchserv service platform.

The first electronic device 11 may obtain the video to be tested from the storage device 13, and obtain an image set corresponding to the video to be tested.

For example, the first electronic device 11 may store the image set corresponding to the one or more videos to be tested in the cache message queue 111, so that the third electronic device 13 obtains the image set of the video to be tested from the cache message queue 111.

For example, the first electronic device 11 may send the image set of the video to be tested to the third electronic device 14 in real time through the data loader 112.

After receiving the image set corresponding to the video to be detected, the third electronic device 13 inputs the image set corresponding to the video to be detected into the video analysis model 131, so as to obtain an analysis result output by the video analysis model.

The analysis result of the video to be tested may be feature information of a video image extracted from the video to be tested, or a classification label of the video to be tested.

Illustratively, the classification labels of the videos to be tested are obtained based on the feature information of the video images extracted from the videos to be tested.

In the embodiment of the disclosure, the image set corresponding to the video is used as the input of the video analysis model, and compared with the video to be detected is used as the input of the video analysis model, the processing speed of the video analysis model is faster because the image set corresponding to the video is smaller than the data amount of the video.

The embodiment of the disclosure can separate an image collection process for acquiring the video to be tested from an on-line reasoning process (namely, a process for acquiring the analysis result of the video to be tested by a video analysis model). The third video image corresponding to the video to be detected is obtained before the online reasoning process is performed, and the image set corresponding to the video to be detected is directly obtained from the first electronic device in the online reasoning process without waiting for the process of obtaining the image set of the video to be detected, so that the reading speed of the image set corresponding to the video to be detected is improved, and the online reasoning speed is further improved.

In an alternative implementation, the third electronic device 13 sends the analysis result of the video to be tested to the corresponding device.

For example, the corresponding device may calculate the similarity (such as cosine similarity, euclidean distance, pearson correlation coefficient, or Tanimoto coefficient) of the plurality of videos based on the analysis results corresponding to the plurality of videos, so as to achieve the purpose that the client recommends the video of interest to the user.

If the analysis result is the feature information of the video image extracted from the video to be tested, the corresponding device may obtain the classification label of the video to be tested based on the analysis result of the video to be tested, so as to achieve the purpose of displaying the classification label of the video to be tested for the user.

In an alternative implementation, if the analysis result is the feature information of the video image extracted from the video to be tested, the third electronic device 13 obtains the classification tag of the video to be tested based on the analysis result of the video to be tested. The third electronic device 13 sends the classification tag of the video to be tested to the corresponding device.

The corresponding device can achieve the purpose that the client displays the classification label of the video to be tested for the user based on the classification label of the video to be tested.

Those skilled in the art will appreciate that the above-described electronic devices and storage devices are by way of example only, and that other existing or future-occurring electronic devices or storage devices, as applicable to the present disclosure, are also encompassed within the scope of the present disclosure and are hereby incorporated by reference.

The data acquisition method provided by the embodiment of the present disclosure is described below with reference to the above network environment and hardware.

Fig. 2 is a flowchart illustrating a video analysis model training method, which may be used in the first electronic device 11 and the second electronic device 12, as shown in fig. 2, according to an exemplary embodiment, and includes the following steps S21 to S24 in the implementation process.

In step S21, video images are extracted from a plurality of frames of video images included in the sample video at intervals to extract a first video image.

In step S22, an image set corresponding to the sample video is determined.

The image set comprises the first video image and a first number of second video images, the second video images are set images, the sum of the number of the first video images and the first number is a preset number, and the numbers of the images contained in the image sets respectively corresponding to different sample videos in the sample video set are all the preset numbers.

Illustratively, the first number may be any integer of 0, 1, 2, 3, ….

In step S23, based on the image set, mask parameters are determined, where the mask parameters are used to record a position of an effective image and a position of an ineffective image in the image set, the first video image is the effective image, and the second video image is the ineffective image.

In step S24, the image set and the mask parameters are used as inputs of a machine learning model, and the manually labeled classification labels corresponding to the sample video are used as training targets, so as to obtain the video analysis model through training.

It will be appreciated that, in order to satisfy the time sensitivity of human vision, the content of successive video images in the video is very similar, and the information that can be expressed by similar video images is substantially the same, in the embodiment of the present disclosure, the video includes multiple sets of video images, where one set of video images includes multiple frames of video images that are continuous and substantially similar in picture in the video, any one frame of video image in the set of video images is referred to as a non-redundant video image, and the other video images are referred to as redundant video images.

It will be appreciated that the information expressed by the plurality of frames of non-redundant video images in the video may represent the information expressed by the video, and the first video image is, for example, a non-redundant video image in the video.

The number of first video images may be one or more frames, for example.

For the first application scenario, that is, the application scenario in which the machine learning model is trained to obtain the video analysis model, the first electronic device 11 may execute steps S21 to S23 for each sample video in the sample video set, so as to obtain the image set and the mask parameters respectively corresponding to the plurality of sample videos. Step S24 may be performed by the second electronic device 12 to train the resulting video analytics model.

For example, the first electronic device 11 and the second electronic device 12 may be separate electronic devices, or the first electronic device 11 and the second electronic device 12 may be the same electronic device.

For example, after obtaining the mask parameters corresponding to the sample video, the first electronic device 11 may store the mask parameters corresponding to the sample video in the buffer message queue.

For example, the second electronic device may obtain mask parameters corresponding to the plurality of sample videos from the buffer message queue.

For example, the data loader may obtain mask parameters corresponding to the sample video from the memory of the first electronic device 11, and send the mask parameters to the second electronic device.

And for the application scene II, namely, the application scene which carries out online reasoning on the video to be tested based on the trained video analysis model, the first electronic equipment can execute the steps S21 to S22 to obtain an image set corresponding to the video to be tested. An online reasoning process is performed by the third electronic device.

In the embodiment of the present disclosure, the process of obtaining the image set of the sample video is the same as the process of obtaining the image set of the video to be tested, and the process of obtaining the image set of the video to be tested is described below by taking the process of obtaining the image set of the sample video as an example, which is not described in detail in the embodiment of the present disclosure.

Illustratively, the sample video set includes a plurality of sample videos, the plurality of sample videos contained in the sample video set being obtained from one or more storage devices 14.

It can be appreciated that in the process of training the machine learning model, multiple image sets respectively corresponding to the sample videos need to be input into the machine learning model at the same time. Since the image sets respectively corresponding to the plurality of sample videos need to be input at the same time, the image sets respectively corresponding to the plurality of sample videos need to contain the same number of images.

The number of images included in the image sets corresponding to different sample videos in the sample video set is a preset number.

The preset number may be a set value, for example.

The preset number is the maximum value of the numbers of the first video images corresponding to the sample videos in the sample video set respectively. Assuming that the sample video set includes sample video 1, sample video 2, and sample video 3, a preset number=max { the number of first video images extracted from sample video 1, the number of first video images extracted from sample video 2, the number of first video images extracted from sample video 3 }.

It can be understood that if the number of the first video images extracted from the sample video is equal to the preset number, the image set corresponding to the sample video does not include the second video images; if the number of the first video images extracted from the sample video is smaller than the preset number, the first number of the second video images are collected from the images corresponding to the sample video.

Wherein first number = preset number-number of first video images extracted from the sample video.

The second video image is described below.

The second video image may be a set image, for example.

For example, the second video images included in the image sets corresponding to different sample videos may be the same; for example, the second video image included in the image set corresponding to the different sample videos may be different. The process of obtaining the second video image is illustrated below.

In an alternative implementation, if the total number of the first video images extracted from the sample video is smaller than the preset number, the second video image may be obtained in the following two ways.

The first way is: and copying the extracted first video image in the sample video to obtain a second video image.

Assuming that the preset number is 20 and the number of first video images extracted from the sample video is 4 frames, an operation of copying the 4 frames of first video images may be performed 4 times to obtain 16 frames of second video images.

The second way is: a second video image is obtained filled with preset settings.

Illustratively, the pixel values in the second video image are all preset values, such as 0, or 1.

Assuming that the preset number is 20 and the number of first video images extracted from the sample video is 4 frames, 16 frames of second video images with pixel values being preset values can be obtained.

In an alternative implementation manner, the video images included in the image sets respectively corresponding to the different sample videos have the same size, and if the video images are different, the video images need to be processed into the video images with the same size.

It can be appreciated that, since the second video image may be included in the image set corresponding to the sample video, in order for the machine learning model to obtain the analysis result based on the first video image, and not based on the second video image, it is also necessary to obtain the mask parameter corresponding to the sample video.

Illustratively, the mask parameters may take a variety of forms, and embodiments of the present disclosure provide, but are not limited to, the following: any of vector, function, linked list.

Because the mask parameters corresponding to the sample video record the positions of the effective images and the positions of the ineffective images in the image set corresponding to the sample video, the machine learning model can obtain the characteristic information of the first video image based on the mask parameters, so that an analysis result is output based on the characteristic information of the first video image, and the video analysis model obtained through training is more accurate.

For example, the classification labels respectively corresponding to the plurality of sample videos output by the machine learning model may be compared with the manually labeled classification labels respectively corresponding to the plurality of sample videos to obtain the loss function. The machine learning model is trained by the loss function.

Illustratively, the loss function may be at least one of a cross-soil loss function, a multi-tag loss function, triplet margin loss, a metric function (such as precision, recovery, F1).

It can be appreciated that, in a large number of sample videos, the number of sample videos with wrong manual labeling and classification labels is small, if the number of image sets input to the machine learning model at the same time is large, for example, hundreds or thousands or tens of thousands of sample data, if the manual labeling and classification labels of a few sample videos are wrong, the parameters in the machine learning model are not greatly influenced, or even not influenced.

Under the condition that the data volume which can be processed by the machine learning model at one time is fixed, the number of images contained in the image set of the sample video is smaller than the total number of images contained in the sample video, namely, the data volume of the image set of the sample video is smaller than the data volume of the sample video, so that more image sets can be input into the machine learning model, namely, the throughput of the machine learning model is improved, and the trained machine learning model is more accurate.

In the video analysis model training method provided by the embodiment of the disclosure, video images are extracted from multi-frame video images contained in a sample video at intervals so as to extract and obtain a first video image; determining an image set corresponding to a sample video, wherein the image set comprises a first video image and a first number of second video images, the second video images are set images, the sum of the number of the first video images and the first number is a preset number, and the numbers of the images contained in the image sets respectively corresponding to different sample videos in the sample video set are all preset numbers; determining mask parameters based on the image set, wherein the mask parameters are used for recording the positions of effective images and the positions of ineffective images in the image set, the first video image is the effective image, and the second video image is the ineffective image; and taking the image set and the mask parameters as input of a machine learning model, taking the manual labeling classification labels corresponding to the sample video as training targets, and training to obtain a video analysis model. Since the number of images contained by the set of images is less than the number of images contained by the sample video, training the machine learning model with the set of images of the sample video is faster. In the process of training the machine learning model, image sets corresponding to a plurality of sample videos respectively need to be input to the machine learning model at the same time, so that the number of images contained in the image sets corresponding to the plurality of sample videos respectively is the same, for example, the number is preset, and in the embodiment of the disclosure, if the number of first video images extracted from the sample videos is smaller than the preset number, the second video images need to be used for filling. Since the second video image may be independent of the sample video, mask parameters need to be input into the machine learning model to cause the machine learning model to derive an analysis result of the sample video based on the first video image. The training machine learning model is more accurate.

There are various implementations of step S21 in the embodiments of the present disclosure, and the embodiments of the present disclosure provide, but are not limited to, the following three.

The first implementation of step S21 includes steps a11 to a14.

In step a11, a sample video is received from a storage device, where the sample video is a video compression-encoded by the storage device.

By way of example, the sample video may be obtained by the first electronic device 11 from the storage device 14, the sample video having been compression encoded by the time the storage device 14 sent the sample video to the first electronic device 14.

In step a12, a sampling interval is acquired.

For example, the sampling interval may be the duration of the interval between two adjacent first video images, or the number of video images spaced between two adjacent first video images.

In step a13, a key frame is extracted from a plurality of key frames included in the sample video at intervals of the sampling interval to obtain at least one target key frame.

Illustratively, each frame contained in the sample video is a key frame. If the compression coding mode of the sample video is that each video image is independently coded, that is, frames contained in the sample video are key frames, at least one target key frame can be extracted from the sample video by adopting the mode of the step A13.

Illustratively, the sample video contains key frames and non-key frames, and step a13 extracts a target key frame from a plurality of key frames contained in the sample video.

Illustratively, the position of the first target key frame extracted, i.e., the extracted frame start position, may be random. For example, the frame start position may be preset, for example, the frame start position is the first key frame or the second key frame or the third key frame of the sample video.

In step a14, the at least one target key frame is decoded to obtain a first video image.

In the embodiment of the disclosure, when the sample video is decoded, instead of integrally decoding the sample video, a key frame is extracted from the sample video at intervals of sampling, and the extracted target key frame is decoded to obtain a first video image. The time to decode the sample video is saved.

In an alternative implementation, in an embodiment of the present disclosure, the sample video may be decoded first, and then steps a12 to a13 are performed on the decoded sample video.

In order to better understand the method for extracting the target key frame from the sample video provided by the embodiments of the present disclosure, the following examples are described.

FIG. 3a is a schematic diagram illustrating one implementation of extracting target key frames from a sample video, according to an example embodiment.

In fig. 3a, the number of video images with a sampling interval being the interval between two adjacent first video images is illustrated as an example, assuming that the sampling interval is 5, the sample video includes 20 key frames, and the frame start position is the first key frame of the video. Then, as shown in fig. 3a, the key frame indicated by the black arrow (represented by filled-in meshed quadrangle) is the extracted target key frame. The number of key frames of the interval between two adjacent target key frames obtained by extraction is 5.

The second implementation of step S21 includes steps a21 to a23.

In step a21, a sample video is received from a storage device, where the sample video is a video compression-encoded by the storage device.

In step a22, at least one target key frame is randomly extracted from a plurality of key frames contained in the sample video.

Since it is randomly decimated, the number of key frames that are spaced between different adjacent target key frames may be different and may be the same.

Illustratively, each frame contained in the sample video is a key frame. If the compression coding mode of the sample video is that each video image is independently coded, that is, frames contained in the sample video are key frames, at least one target key frame can be randomly extracted from the sample video by adopting the mode of the step A22.

Illustratively, the sample video contains key frames and non-key frames, and step a22 randomly extracts a target key frame from a plurality of key frames contained in the sample video.

In step a23, the at least one target key frame is decoded to obtain a first video image.

In the embodiment of the disclosure, when the sample video is decoded, the sample video is not decoded as a whole, but the target key frame randomly extracted from the sample video is decoded to obtain the first video image. The time to decode the sample video is saved.

In an alternative implementation, in an embodiment of the present disclosure, the sample video may be decoded first, and then step a22 may be performed on the decoded sample video.

For a better understanding of the method for extracting the target key frame from the sample video according to the embodiments of the present disclosure, the following examples are described.

FIG. 3b is a schematic diagram illustrating yet another implementation of extracting target key frames from a sample video, according to an example embodiment.

Assuming that the video includes 20 key frames, the frame start position is the first key frame of the sample video. Then, as shown in fig. 3b, the key frame indicated by the black arrow (represented by filled-in meshed quadrangle) is the target key frame.

As shown in fig. 3b, the target key frames extracted from the sample video are in turn: target key frame 31, target key frame 32, target key frame 33, target key frame 34.

1 key frame is spaced between the target key frame 31 and the target key frame 32, 4 key frames are spaced between the target key frame 32 and the target key frame 33, and 8 key frames are spaced between the target key frame 33 and the target key frame 34.

The third implementation of step S21 includes steps a31 to a34.

In step a31, a sample video is received from a storage device, where the sample video is a video compression-encoded by the storage device.

In step a32, a target key frame is obtained from a plurality of key frames contained in the sample video.

The number of target key frames may be one or more.

Illustratively, a key frame is extracted from a plurality of key frames contained in the sample video at sampling intervals to obtain a target key frame. The sampling interval may be the same as or different from the sampling interval of step a 32.

Illustratively, the target key frame is randomly extracted from a plurality of key frames contained in the sample video.

Illustratively, each keyframe contained in the sample video is determined to be a target keyframe.

Illustratively, the starting position of extracting the target key frame from the plurality of key frames contained in the sample video may be random or preset.

In step a32, a sampling interval is acquired.

The sampling interval may be, for example, the duration of the interval between two adjacent non-key frames or the total number of key frames and non-key frames that are spaced between two adjacent non-key frames.

In step a33, a non-key frame is extracted from the non-key frames corresponding to the target key frame included in the sample video at intervals of the sampling interval, so as to obtain at least one non-key frame.

The non-key frames are compressed based on the target key frames.

Non-key frames corresponding to the target key frames are described below by way of example.

Assuming that a video includes video image 1, video image 2, and video image 3, it is assumed that when the video is compressed, a key frame corresponding to video image 1, a non-key frame corresponding to video image 2, and a non-key frame corresponding to video image 3 are determined.

Then, the key frame 1 corresponding to the video image 1 is obtained by performing intra-frame compression coding on the video image 1; the non-key frame 1 corresponding to the video image 2 is obtained based on the inter-frame compression coding of the video image 2 and the video image 1; the non-key frames 2 corresponding to the video image 3 are obtained based on the inter-frame compression encoding of the video image 3 and the video image 1.

The above-mentioned non-key frame 1 and non-key frame 2 correspond to the key frame 1. I.e., non-key frame 1 and non-key frame 2 are compressed based on key frame 1.

In step a34, the target key frame and the at least one non-key frame are decoded to obtain the first video image.

For example, a key frame in the sample video and a non-key frame corresponding to the key frame may be referred to as a group of frame sets, and then the sample video may be divided into multiple groups of frame sets.

For example, for each frame set to which each target key frame obtained in step a32 belongs, a non-key frame is extracted at intervals of sampling by using the target key frame as a starting position, so as to obtain at least one non-key frame in step a33.

In the embodiment of the disclosure, when the sample video is decoded, the sample video is not decoded as a whole, but the target key frame extracted from the sample video and the at least one non-key frame are decoded to obtain the first video image. The time to decode the sample video is saved.

In an alternative implementation, in an embodiment of the present disclosure, the sample video may be decoded first, and then steps a32 to a33 are performed on the decoded sample video.

In an alternative implementation, the first electronic device may obtain the video in different formats or the video in the same format from different storage devices, where the video may be a sample video or a video to be tested. The process by which the first electronic device obtains video from the storage device is described below. I.e. the method of retrieving video from a storage device may comprise the steps of:

step one, determining a data format of video stored by a storage device.

And step two, generating a video acquisition request based on the data format.

Illustratively, the video acquisition requests corresponding to different data formats are different. For example, if the storage device is an http (Hypertext Transfer Protocol ) server, the data format of the video stored by the http server may be a blob key or URL (uniform resource locator, uniform resource location system), then the format of the video acquisition request is that of the http request; if the storage device is a hard disk, the data format of the video stored in the hard disk may be File Path, and then the format of the video acquisition request is the format of a CPU (Central Processing Unit/Processor) read instruction.

And step three, sending the video acquisition request to the storage equipment.

And step four, receiving the video fed back by the storage device.

The video can be a sample video or a video to be detected.

The above-described implementations support obtaining video from a variety of heterogeneous data sources, which refer to storage devices that store different data formats.

In the embodiment of the disclosure, a specified file reading scheme is set for each data format, that is, for each data format, a video acquisition request of a corresponding format is generated, that is, video is read from the storage device in a corresponding manner.

In the embodiment of the disclosure, videos can be obtained from various heterogeneous data sources, so that a large number of sample videos can be obtained, and a video analysis model obtained based on training of the large number of sample videos is more accurate.

In an alternative implementation, there are a plurality of implementations of obtaining a sampling interval for each sample video in either the implementation of the first step S21 or the implementation of the third step S21, and embodiments of the present disclosure provide, but are not limited to, the following three.

The first implementation way to obtain the sampling interval is: the sampling interval is calculated based on the frame rate of the sample video and the number of processed images per second.

The frame rate of a sample video refers to the frequency at which the sample video continuously appears on the display in units of video images, i.e., the number of video images contained per second. The number of processed images per second refers to the number of video images that can be processed per second.

It can be understood that the content of multiple frames of continuous video images in the video is very similar, and in order to meet the time sensitivity of human vision, the number of processed images per second needs to reach a certain value, so that the video is smoother when the user watches the video. In the embodiment of the disclosure, the video includes a plurality of sets of video images, wherein one set of video images includes a plurality of frames of video images which are continuous in video and have substantially similar frames of video, any frame of video image in the set of video images is called a non-redundant video image, and other video images are called redundant video images. In general, the above-mentioned "a set of video images" includes the number of video images as the frame rate of video per second of processed image.

Illustratively, sampling interval = frame rate of video/number of processed images per second. At this time, one video image can be extracted from each group of video images contained in the video, the video images extracted from the video are not redundant with each other, and the extracted video images can represent the content of the whole video.

The machine learning model is trained based on the image set of the sample video obtained in the first implementation mode, and the obtained machine learning model is more accurate. The analysis result obtained by the video analysis model based on the image set of the video to be detected obtained by the first implementation mode is more accurate.

Illustratively, the sampling intervals for videos with different frame rates are different.

The second implementation way to obtain the sampling interval is: and calculating the sampling interval based on the duration of the sample video, the frame rate of the sample video and the preset number.

The image sets respectively corresponding to the different sample videos contain the same number of images and are the preset number.

By way of example, the formula may be based on: sampling interval = duration of video × frame rate of video/preset number, resulting in a sampling interval. The frame rate of the video is the total number of video images contained in the video.

If the number of images included in the image sets respectively corresponding to the different sample videos is the same, in order to obtain the "full view" of the information expressed by the sample videos, the images can be uniformly extracted from the sample videos based on the above formula.

Illustratively, the sampling interval of the sample video varies with the total number of video images contained. For sample video containing a smaller total number of video images, there may be redundant video images in the decimated multi-frame first video image.

The third implementation way of obtaining the sampling interval is: and acquiring the preset sampling interval.

Illustratively, the sampling intervals for different videos are the same.

In an alternative implementation, the implementation of step S23 is various, and the embodiments of the present application provide, but are not limited to, the following ways.

The first implementation of step S23 includes the following steps B11 to B13.

In step B11, it is determined that the first video image is located at a first position in the set of images.

In step B12, it is determined that the second video image is located at a second position in the set of images.

In step B13, the mask parameters are determined based on the first position and the second position.

The element in the mask parameter at the position corresponding to the first position is a first character, and the element in the mask parameter at the position corresponding to the second position is a second character; the image of the first character representation located at the first position of the image set is a valid image, and the image of the second character representation located at the second position of the image set is an invalid image.

The first character or the second character may be any number or letter or special symbol, and the first character and the second character are different.

Because the position of the character (the first character or the second character) in the mask parameter represents the position of the image (the first video image or the second video image) indicated by the character in the image set, the corresponding relation between the character and the image does not need to be carried in the mask parameter, so that the data volume of the mask parameter is less, and the throughput of the machine learning model is improved.

Illustratively, the mask parameters may further include a correspondence between the characters and the images.

Exemplary, image collection representations are numerous, and embodiments of the present disclosure provide, but are not limited to: any of a matrix, a function, a linked list. The mask parameters may take a variety of forms, and embodiments of the present disclosure provide, but are not limited to: any of a matrix, a function, a linked list.

The relationship between the image set and the mask parameters will be described below by taking the image set as a matrix and the mask parameters as mask vectors as examples.

Let the number of first video images contained in the image set corresponding to the sample video be M. The first number of second video images is located after the M first video images. M is a positive integer greater than or equal to 1.

Illustratively, the order of the M first video images in the image set is: the M first video images are ordered sequentially from early to late in terms of where they appear in the sample video.

For example, the order of the M first video images in the image set may be randomly ordered.

Let the preset number be K, the first number be N, n+m=k. The matrix form of the image set H corresponding to the sample video is as follows:

the image set H corresponding to the sample video is a matrix of kxl. Wherein L is a positive integer greater than or equal to 1.

Assume that M first video images in an image set of a sample video are in sequence: a is that ₁ 、A ₂ 、A ₃ ，…，A _M The N second video images in the image set of the sample video are sequentially: a is that _M+1 、A _M+2 、…，A _K Wherein matrix A _i ＝[a _i1 a _i2 … a _iL ]I=1, 2,..k. Wherein a is _ij May be a specific number or, alternatively, a vector. Where j=1, 2, …, L.

From the image set H of the sample video, it may be determined that the first image is located in the image set H at a first row and a first column, a second row and a first column, a third row and a first column, …, and the second image is located in the image set at a position of: row m+1 first column, …, row K first column.

Based on which a mask vector Q can be determined. The ith row and the first column in the image set H of the sample video correspond to the first row and the ith column of the mask vector Q.

The mask vector will be described by taking the first character as 1 and the second character as 0 as an example.

Mask vector q= [ 11 … 1 _1M 0 _1(M+1) … 0]Wherein the subscript "1M" of 1 in the mask vector Q indicates the first row and column M of the mask vector, and the subscript "1 (M+1)" of 0 indicates the first row and column M+1 of the mask vector. That is, the elements from the first row, the first column and the first column in the mask vector are all 1, and the elements from the first row, the first column and the first column in the M+1 and the first row, the first column and the first column in the K are all 0.

In an alternative implementation, N second video images in the image set of the sample video may precede M first video images.

In an alternative implementation, the N second video images and the M first video images in the image set of the sample video are arranged to intersect, for example, the image set of the sample video includes: the first video image, the second video image, the first video image, the second video image, ….

The machine learning model training process for obtaining the video analysis model training process in step S24 is explained below.

There are various ways to train a machine learning model to obtain a video analysis model, and embodiments of the present disclosure provide, but are not limited to, the following. When training a machine learning model, an image set and mask parameters respectively corresponding to a plurality of sample videos need to be input. The following steps, steps C11 to C15, are performed for each image set and mask parameter corresponding to the sample video.

In step C11, the set of images corresponding to the sample video and the mask parameters are input to the machine learning model.

The machine learning model performs the following steps C12 to C13, or steps C12 to C14.

In an alternative implementation, the types of video analytics models trained by embodiments of the present disclosure include, but are not limited to, the following two types.

First type: the video analysis model is used for outputting classification labels of the video.

Second type: the video analysis model is used for outputting characteristic information of the video.

If the first type of video analysis model is needed, the machine learning model needs to execute steps C12 to C14; if a second type of video analysis model is required, the machine learning model needs to perform steps C12 to C13.

The following describes steps C12 to C13, or steps C12 to C14, in conjunction with the structure of the machine learning model.

Illustratively, the structure of the machine learning model corresponding to the first type of video analysis model is shown in fig. 4, and includes: a feature extraction module 41, an effective feature extraction module 42, and a tag prediction module 43. Wherein, the feature extraction module is used for executing step C12. The effective feature extraction module is used for executing step C13. The tag prediction module is used for executing step C14. The second electronic device is configured to perform step C15 based on.

Illustratively, the machine learning model corresponding to the second type of video analytics model includes: and the effective characteristic extraction module is used for extracting the effective characteristic. Wherein, the feature extraction module is used for executing step C12. The effective feature extraction module is used for executing step C13. The second electronic device is configured to perform steps C14 and C15 based on the execution.

The second electronic device has a function of obtaining a classification tag based on the feature information of the first video image, for example.

In step C12, a set of feature information corresponding to the image set is obtained, where the set of feature information includes feature information corresponding to the first video image and feature information corresponding to the second video image.

In step C13, feature information corresponding to the first video image is screened out from the feature information set based on the mask parameter.

In step C14, a classification label is determined based on the feature information corresponding to the first video image.

The machine learning model is used for outputting characteristic information or the classification labels corresponding to the first video image.

In step C15, training the machine learning model based on the comparison of the classification label and the manually labeled classification label to obtain the video analysis model.

Because the mask parameters can screen the characteristic information corresponding to the first video image from the characteristic information set, the influence caused by supplementing the second video image in the image set corresponding to the sample video can be removed. Because the classification labels are determined based on the feature information corresponding to the first video image (but not the feature information corresponding to the second video image), the obtained classification labels are more accurate, and the machine learning model obtained through training is more accurate.

Illustratively, the process of training the machine learning model involves at least one of artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like in machine learning.

By way of example, the machine learning model may be any one of a neural network model, a logistic regression model, a linear regression model, a Support Vector Machine (SVM), and a Adaboost, XGboost, transformer-Encoder model.

The neural network model may be any one of a cyclic neural network-based model, a convolutional neural network-based model, and a transducer-encoder-based classification model, for example.

By way of example, the machine learning model may be a deep hybrid model of a cyclic neural network-based model, a convolutional neural network-based model, and a transducer-encoder-based classification model.

By way of example, the machine learning model may be any of an attention-based depth model, a memory network-based depth model, and a short text classification model based on deep learning.

The short text classification model based on deep learning is a Recurrent Neural Network (RNN) or a Convolutional Neural Network (CNN) or a variant based on the recurrent neural network or the convolutional neural network.

Illustratively, some simple domain adaptations may be made on an already pre-trained model to arrive at a machine learning model. Exemplary, "simple domain adaptation" includes, but is not limited to, secondary pre-training with large-scale unsupervised domain corpus again on an already pre-trained model, and/or model compression of an already pre-trained model by way of model distillation.

In an alternative implementation, there are a variety of implementations of step C12, and embodiments of the present disclosure provide, but are not limited to, the following.

In step D11, intra-frame feature information corresponding to each image in the image set is obtained.

Illustratively, the intra-frame feature information corresponding to an image is derived based on the image, without considering the relationship between adjacent images.

Illustratively, intra-frame feature information of a frame of video image characterizes picture features contained in the frame of video image.

In step D12, the feature information set is obtained based on the intra-frame feature information corresponding to each image in the image set.

The feature information corresponding to the first video image comprises a plurality of inter-frame feature information among the first video images, the feature information corresponding to the second video image comprises a plurality of inter-frame feature information among the second video images, and/or the inter-frame feature information between the second video images and the first video images.

Illustratively, the inter-frame feature information of the plurality of video images characterizes a distinguishing feature between the plurality of video images.

In an alternative implementation manner, the feature extraction module may include: an intra-frame feature extraction module and an inter-frame feature extraction module. The intra-frame feature extraction module is used for executing the step D11, and the inter-frame feature extraction module is used for executing the step D12.

Illustratively, the intra feature extraction module may be a torchvision pretrained model, with torchvision pretrained models including resnet and ResNeXt. Alternatively, the intra feature extraction module may be a lightweight network GhostNet to reduce GPU (Graphics Processing Unit, graphics processor) memory occupied during training of the machine learning model.

Illustratively, the inter-frame feature extraction module may be at least one of an optional bidirectional network, a GRU (Gated Recurrent Unit) network, an LSTM (Long Short-Term Memory artificial neural network) network, and a self-attention transducer network.

In an alternative implementation, there are a variety of implementations of step D12, and embodiments of the present disclosure provide, but are not limited to, the following two.

The first implementation manner of step D12 includes: and obtaining inter-frame characteristic information based on the intra-frame characteristic information of each two adjacent video images in the image set to obtain a characteristic information set.

The feature information set is described below in connection with the image set H of the sample video.

Suppose that a sample videoImage collectionThe intra-frame characteristic information corresponding to each image in the image set of the sample video is as follows in sequence: a is that ₁ Intra-frame characteristic information of (a) ₂ Intra-frame characteristic information of (a) ₃ Is defined by the intra-frame characteristic information of …, A _M Intra-frame characteristic information of (a) _M+1 Intra-frame characteristic information of (a) _M+2 Is defined by intra-frame characteristic information of (a), …, a _K Is included in the frame.

Wherein any two adjacent video images in the image set are A _p And A _p+1 . Wherein, the value of p is any one of the values from 1 to K-1.

Illustratively, the feature information set includes: based on A _p Intra-frame feature information and a _p+1 Inter-frame feature information obtained from intra-frame feature information of (a).

Illustratively, to preserve the "overview" of the sample video, the feature information set also includes intra-frame feature information of the target first video image in the image set. The following description will take, as an example, a first video image of a first video image in an image set as a target first video image. The inter-frame characteristic information contained in the characteristic information set is inter-frame characteristic information between other first video images in the image set and the target first video image.

Exemplary, feature information sets corresponding to sample video

The second implementation manner of step D12 includes: and obtaining inter-frame characteristic information based on the intra-frame characteristic information of the first image in the image set to the intra-frame characteristic information of the W-th image in the image set, wherein W sequentially takes values from 2 to K.

In order to further understand the implementation of the second step D12 by a person skilled in the art, the implementation of the second step D12 will be described below taking an example in which the image set includes 5 images. The second implementation of step D12 includes the following steps E1 to E4.

In step E1, inter-frame feature information is obtained based on the intra-frame feature information of a first image in the image set and the intra-frame feature information of a second image in the image set.

In step E2, inter-frame feature information is obtained based on the intra-frame feature information of the first image, the intra-frame feature information of the second image, and the intra-frame feature information of the third image in the image set.

In step E3, inter-frame feature information is obtained based on the intra-frame feature information of the first image, the intra-frame feature information of the second image, the intra-frame feature information of the third image, and the intra-frame feature information of the fourth image in the image set.

In step E4, inter-frame feature information is obtained based on the intra-frame feature information of the first image, the intra-frame feature information of the second image, the intra-frame feature information of the third image, the intra-frame feature information of the fourth image, and the intra-frame feature information of the fifth image in the image set.

Illustratively, to preserve the "overview" of the sample video, the feature information set also includes intra-frame feature information for a first video image in the image set.

Exemplary, sample video corresponding feature information set S ₂ The specific contents of (2) are as follows:

by way of example, the feature information set of the sample video may include: the method comprises the steps of respectively corresponding intra-frame characteristic information of each first video image and inter-frame characteristic information between at least two first video images; exemplary, the feature information set of the sample video includes: the intra-frame characteristic information of the target first video image in the image set, and the inter-frame characteristic information of other first video images and at least the target first video image. It can be understood that the "full view" of the sample image can be obtained by the intra-frame feature information of the target first video image and the inter-frame feature information of other first video images and at least the target first video image, so if the feature information set of the present video includes: and if the intra-frame characteristic information of the target first video image in the image set and the inter-frame characteristic information of other first video images and at least the inter-frame characteristic information of the target first video image are smaller, the data volume of the characteristic information set of the sample video is smaller, the information volume of the obtained sample video is more complete, and the machine learning model obtains the analysis result based on the characteristic information of the sample video more quickly and accurately.

In an alternative implementation, there are a variety of implementations of step C13, and embodiments of the present disclosure provide, but are not limited to, the following three implementations.

The first implementation of step C13 includes steps F11 to F12.

And the position of the feature information of the effective image in the feature information set is the same as the position of the effective image in the image set.

In step F11, a third position of the feature information of the effective image characterized by the mask parameter in the feature information set is determined.

In step F12, feature information at the third position in the feature information set is obtained, so as to obtain feature information corresponding to the first video image.

Taking the image set H of the sample video as an example, the feature information set S is determined based on mask parameters, e.g., mask matrix Q ₁ Or a feature information set S ₂ The feature information of the effective image in the first row and the first column are positioned in the first row and the first column of the Mth row. Thereby the effective feature matrix S containing the feature information corresponding to the first video image obtained in the step F12 _{Effective and effective} The content of (2) is as follows:

or (I)>

The second implementation manner of step C13 includes: and determining the product of the characteristic information set and the mask matrix as the characteristic information of the first video image.

The image set H of the sample video is still described as an example. Let the mask matrix be the matrix Q. The feature information set is S ₂ 。

S _{Effective and effective} =feature information set S ₂ ×Q＝A ₁ Intra-frame feature information of x 1+a ₁ And A ₂ Inter-frame characteristic information of ×1+, …, +a ₁ 、A ₂ 、A ₃ Sum of all _M Inter-frame characteristic information of (1+A) ₁ 、A ₂ 、A ₃ 、...、A _M And A _M+1 Inter-frame characteristic information of ×0+, …, +a ₁ 、A ₂ 、A ₃ 、...、A _M 、A _M+1 、A _M+2 、A _M+3 Sum of all _K Is defined as inter-frame characteristic information x 0.

The second implementation manner of step C13 includes: and determining the ratio of the product of the feature information set and the mask matrix to the second number as the feature information of the first video image. The second number refers to the number of active images recorded in the mask matrix.

S _{Effective and effective} =feature information set S ₂ X Q/second number.

In an alternative implementation, after step S23, the method further includes: storing the image set corresponding to the sample video and the mask parameters into a database; the method further comprises the following steps before the step S24: and obtaining the image set and the mask parameters corresponding to the sample video from the database.

The database may be, for example, a cache message queue or a data loader as shown in fig. 1.

In the embodiment of the disclosure, a large number of sample videos can be processed before training a machine learning model to obtain image sets and mask parameters (referred to as sample data extraction) corresponding to the sample videos, and a sample data extraction process and a machine learning model training process are separated, so that sample data extraction is not required to be waited in the machine learning model training process, reading speeds of the image sets and the mask parameters corresponding to the sample videos are improved, and the machine learning model training speed is improved.

The above is a description of the process of training the machine learning model to obtain the video analysis model, and the following describes the process of using the video analysis model.

Fig. 5 is a flowchart illustrating a method for obtaining an analysis result of a video to be tested based on a video analysis model according to an exemplary embodiment, which may be applied to the third electronic device 13, and which may include the following steps S51 to S52 in the implementation process.

In step S51, video images are extracted from the multi-frame video images included in the video to be detected at intervals to extract a third video image.

For example, the process of extracting the third video image from the video to be detected may refer to the process of extracting the first video image from the sample video, which is not described herein.

In step S52, the third video image is input to the video analysis model, and an analysis result corresponding to the video to be detected is obtained through the video analysis model, where the analysis result includes feature information of the third video image or a classification label of the video to be detected.

The number of third video images may be one or more, for example. The description of the third video image may refer to the description of the first video image, which is not repeated here.

Illustratively, the types of video analysis models that are trained are different, and the results that are output by the video analysis models are different.

For example, if the training results in a first type of video analysis model, the video analysis model may output a classification tag for the video under test. For example, if the second type of video analysis model is trained, the video analysis model may output characteristic information of the third video image.

For example, because the video analysis model has been trained, parameters in the video analysis model are determined, in the process of using the video analysis model, there is no need to include the same number of images in different image sets corresponding to the videos to be tested (for one video to be tested, all images that need to be input to the video analysis model are included in the image set), that is, the image set corresponding to the video to be tested may be composed of the third video image.

In the above case, the effective feature extraction module in the video analysis model may be removed. Such that the video analytics model includes a feature extraction module and a label prediction module, or such that the video analytics model includes a feature extraction module.

For example, the number of images included in the image sets corresponding to the different videos to be detected obtained in step S51 may be the same, for example, still be the preset number. Illustratively, step S52 includes the following steps G1 through G3.

In step G1, determining an image set corresponding to the video to be detected, where the image set corresponding to the video to be detected includes the third video image and a third number of second video images, the second video images are set images, a sum of the number of the third video images and the third number is a preset number, and the numbers of images included in the image sets respectively corresponding to different videos to be detected in the video set to be detected are all the preset numbers.

The plurality of videos to be tested included in the video set to be tested are obtained by the first electronic device from different storage devices or the same storage device.

The description of the second video image may be referred to the previous description of the second video image, and will not be repeated here.

In step G2, determining a mask parameter based on the set of images corresponding to the video to be detected, where the mask parameter is used to record a position of an effective image and a position of an ineffective image in the set of images corresponding to the video to be detected, the third video image is the effective image, and the second video image is the ineffective image.

The determining process of the mask parameters corresponding to the video to be detected is the same as the determining process of the mask parameters corresponding to the sample video, and will not be repeated here.

In step G3, the image set and the mask parameter corresponding to the video to be tested are input to a video analysis model, and the analysis result of the video to be tested is obtained through the video analysis model.

In the above case, there is no need to remove the valid feature extraction module in the video analysis model. That is, the video analysis model includes a feature extraction module, an effective feature extraction module, and a label prediction module, or such that the video analysis model includes a feature extraction module and an effective feature extraction module.

In the embodiment of the disclosure, the third video image corresponding to the video to be detected is used as the input of the video analysis model, and compared with the video to be detected is used as the input of the video analysis model, the processing speed of the video analysis model is higher, and the speed of obtaining the analysis result of the video to be detected is higher because the data amount input to the video analysis model is smaller.

The processing procedure of each module included in the video analysis model to the third video image of the video to be detected is the same as the processing procedure of each module included in the machine learning model to the image set of the sample video, and will not be repeated here.

In an alternative implementation, after step S51, the method further includes: storing a third video image corresponding to the video to be detected into a database; the method further comprises, before step S52: and obtaining a third video image corresponding to the video to be detected from the database.

According to the embodiment of the disclosure, the extraction process of the third video image corresponding to the video to be detected and the online reasoning process (namely, the process that the video analysis model obtains the analysis result based on the third video image) can be separated. The method includes the steps that an online reasoning process is carried out, a third video image corresponding to a video to be measured is obtained, the third video image corresponding to the video to be measured is obtained directly from a database in the online reasoning process, and the process of extracting the third video image from the video to be measured is not needed, so that the reading speed of the third video image corresponding to the video to be measured is improved, and the online reasoning speed is improved.

The method is described in detail in the embodiments of the present disclosure, and the method in the embodiments of the present disclosure may be implemented by using various types of devices, so that various devices are also disclosed in the present application, and specific embodiments are given below for details.

FIG. 6 is a block diagram of a video analytics model training device, according to an exemplary embodiment. Referring to fig. 6, the apparatus includes a first acquisition module 61, a first determination module 62, a second determination module 63, and a training module 64.

A first obtaining module 61, configured to extract video images from a plurality of frames of video images included in the sample video at intervals to obtain a first video image;

a first determining module 62, configured to determine an image set corresponding to the sample video, where the image set includes the first video image and a first number of second video images, the second video images are set images, a sum of the number of the first video images and the first number is a preset number, and the numbers of images included in image sets respectively corresponding to different sample videos in the sample video set are all the preset numbers;

a second determining module 63 configured to determine mask parameters based on the image set, where the mask parameters are used for recording a position of an effective image and a position of an ineffective image in the image set, the first video image is the effective image, and the second video image is the ineffective image;

The training module 64 is configured to use the image set and the mask parameter as input of a machine learning model, and use the manual labeling classification label corresponding to the sample video as a training target to train to obtain a video analysis model.

In an alternative implementation, the second determining module is specifically configured to:

In an alternative implementation manner, the first obtaining module is specifically configured to:

a second acquisition unit configured to acquire a sampling interval;

In an alternative implementation manner, the second obtaining unit is specifically configured to:

In an alternative implementation, the training module is specifically configured to:

the machine learning model includes the following modules:

In an alternative implementation, the feature extraction module is specifically configured to:

In an optional implementation manner, the position of the feature information of the effective image in the feature information set is the same as the position of the effective image in the image set, and the effective feature extraction module is specifically configured to:

In an alternative implementation, the method further includes:

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 7 is a block diagram of an electronic device, according to an example embodiment. The electronic device may include: at least one of the first electronic device, the second electronic device and the third electronic device.

Electronic devices include, but are not limited to: a processor 701, a memory 702, a network interface 703, an I/O controller 704, and a communication bus 705.

It should be noted that the structure of the electronic device shown in fig. 7 is not limited to the electronic device, and the electronic device may include more or less components than those shown in fig. 7, or may combine some components, or may be arranged with different components, as will be understood by those skilled in the art.

The following describes the respective constituent elements of the electronic device in detail with reference to fig. 7:

the processor 701 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 702, and calling data stored in the memory 702, thereby performing overall monitoring of the electronic device. The processor 701 may include one or more processing units; by way of example, the processor 701 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 701.

The processor 701 may be a central processing unit (Central Processing Unit, CPU), or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

The Memory 702 may include Memory such as a Random-Access Memory (RAM) 7021 and a Read-Only Memory (ROM) 7022, and may also include a mass storage device 7023 such as at least 1 disk Memory, and the like. Of course, the electronic device may also include hardware required for other services.

The memory 702 is configured to store instructions executable by the processor 701. The processor 701 has the following functions: the video analysis model training method described in any of the above embodiments is performed.

A wired or wireless network interface 703 is configured to connect the electronic device to a network.

The processor 701, memory 702, network interface 703, and I/O controller 704 may be interconnected by a communication bus 705, which may be an ISA (Industry Standard Architecture ) bus, PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc.

In an exemplary embodiment, the electronic device may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for performing the electronic resource transmission methods described above.

In an exemplary embodiment, a computer readable storage medium is also provided, which can be directly loaded into an internal memory of a computer, such as the memory 702, and contains software code, and the computer program can implement the steps shown in any embodiment of the video analysis model training method after being loaded and executed by the computer.

In an exemplary embodiment, a computer program product is also provided, which can be directly loaded into an internal memory of a computer, for example, the memory 702 included in the electronic device, and contains software codes, and the computer program can implement the steps shown in any embodiment of the video analysis model training method after being loaded and executed by the computer.

Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for training a video analysis model, comprising:

taking the image set and the mask parameters as input of a machine learning model, taking a manual annotation classification label corresponding to the sample video as a training target, and training to obtain the video analysis model;

The step of determining mask parameters based on the image set includes:

determining the mask parameter based on the first position and the second position, wherein elements in the mask parameter, which are positioned at positions corresponding to the first position, are first characters, and elements in the mask parameter, which are positioned at positions corresponding to the second position, are second characters; wherein a first character characterizes an image located at the first position of the image set as a valid image and a second character characterizes an image located at the second position of the image set as an invalid image;

the step of taking the image set and the mask parameters corresponding to the sample video as input of a machine learning model, taking the manual annotation classification label corresponding to the sample video as a training target, and training to obtain a video analysis model comprises the following steps:

inputting the image set and the mask parameters into a machine learning model;

the machine learning model performs the steps of:

2. The method of claim 1, wherein the step of extracting video images from a plurality of frames of video images included in the sample video at intervals to obtain the first video image comprises:

acquiring a sampling interval;

3. The video analytics model training method of claim 2, wherein the acquiring a sampling interval step includes:

and acquiring the preset sampling interval.

4. The method for training a video analysis model according to claim 1, wherein the step of obtaining the feature information set corresponding to the image set includes:

5. The method according to claim 1, wherein the feature information of the effective image in the feature information set is located at the same position as the effective image in the image set, and the step of screening the feature information corresponding to the first video image from the feature information set based on the mask parameter includes:

6. The video analytics model training method of claim 1, further comprising:

7. The video analytics model training method of claim 1, further comprising, after the step of determining mask parameters based on the set of images:

8. A video analysis model training apparatus, comprising:

the training module is configured to take the image set and the mask parameters as input of a machine learning model, take the manual annotation classification labels corresponding to the sample video as training targets, and train to obtain a video analysis model;

The second determination module is specifically configured to:

a third determining unit configured to determine the mask parameter based on the first position and the second position, wherein an element located at a position corresponding to the first position in the mask parameter is a first character, and an element located at a position corresponding to the second position is a second character; wherein a first character characterizes an image located at the first position of the image set as a valid image and a second character characterizes an image located at the second position of the image set as an invalid image;

the training module is specifically configured to:

the machine learning model includes the following modules:

9. The video analytics model training device of claim 8, wherein the first acquisition module is specifically configured to:

a second acquisition unit configured to acquire a sampling interval;

10. The video analytics model training device of claim 9, wherein the second acquisition unit is specifically configured to:

11. The video analytics model training device of claim 8, wherein the feature extraction module is specifically configured to:

12. The video analysis model training apparatus of claim 8, wherein the location of the feature information of the valid image in the feature information set is the same as the location of the valid image in the image set, and wherein the valid feature extraction module is specifically configured to:

13. The video analytics model training device of claim 8, further comprising:

14. The video analytics model training device of claim 8, further comprising:

15. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the video analytics model training method of any one of claims 1 to 7.

16. A computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the video analytics model training method of any one of claims 1 to 7.