CN110263213B

CN110263213B - Video pushing method, device, computer equipment and storage medium

Info

Publication number: CN110263213B
Application number: CN201910430442.2A
Authority: CN
Inventors: 苏舟; 王良栋; 孙振龙; 张博
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-05-22
Filing date: 2019-05-22
Publication date: 2023-07-18
Anticipated expiration: 2039-05-22
Also published as: CN110263213A

Abstract

The application relates to a video pushing method, which comprises the following steps: acquiring N candidate covers in a first video; obtaining the prediction confidence coefficient of each of the N candidate covers through a video cover determination model; the video cover determining model is a convolutional neural network model obtained by reinforcement learning according to K candidate covers in the second video and user operation data of the K candidate covers; acquiring video covers of the first video from the N candidate covers according to the prediction confidence degrees of the N candidate covers; pushing the first video to the terminal according to the video seal of the first video. Because the video cover determining model is a convolutional neural network model for strengthening training according to the operation of the user on the same video of different covers, the selection operation of the user on the video cover is comprehensively considered, and the accuracy of the subsequent video cover determining through the trained model is improved.

Description

Video pushing method, device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of video application, in particular to a video pushing method, a video pushing device, computer equipment and a storage medium.

Background

With the continuous development of computer network applications, video resources in video playing applications are increasing, and in order to facilitate users to accurately find video desired to be requested by themselves, video providers need to determine appropriate covers for each video.

In the related art, a server of a video provider may select one image frame from among image frames included in a video as a video cover of the video. The cover image frame can be obtained by respectively setting corresponding weights for various indexes in advance. For example, a developer designs an image quality classification model in advance, and performs model training through a high-quality picture and a low-quality picture which are marked in advance; after model training is completed, processing each image frame in each video through an image quality classification model to obtain the image quality of each image frame, and taking the image frame with the highest image quality as a video cover.

However, the scheme shown in the related art requires manual labeling of the image quality of the picture for training, and the accuracy of the trained model is subjectively affected by the labeling person, resulting in lower accuracy in determining the video cover through the trained model.

Disclosure of Invention

The embodiment of the application provides a video pushing method, a device, computer equipment and a storage medium, which can improve the accuracy of determining a video cover by a trained model, and the technical scheme is as follows:

in one aspect, a video pushing method is provided, the method including:

acquiring N candidate covers in a first video, wherein N is an integer greater than or equal to 2;

processing the N candidate covers through a video cover determining model respectively to obtain respective prediction confidence degrees of the N candidate covers, wherein the prediction confidence degrees are used for indicating the probability that the corresponding candidate covers are video covers; the video cover determining model is a convolutional neural network model obtained by reinforcement learning according to K candidate covers in the second video and user operation data of the K candidate covers; the user operation data is used for indicating user operation received by the second video, and the candidate covers corresponding to the user operation are K which is an integer greater than or equal to 2;

acquiring video covers of the first video from the N candidate covers according to the prediction confidence degrees of the N candidate covers;

Pushing the first video to a terminal according to the video seal of the first video.

In another aspect, a training method for determining a model of a video cover is provided, the method comprising:

k candidate covers in the second video are obtained, wherein K is an integer greater than or equal to 2;

extracting respective image features of the K candidate covers through a convolutional neural network model; the image features are the output of a feature extraction component in the convolutional neural network;

the K candidate covers are respectively used as video covers of the second video, the second video is pushed, and user operation data of the K candidate covers are obtained; the user operation data is used for indicating user operation received by the second video and candidate covers corresponding to the user operation;

performing reinforcement learning on network parameters of a confidence coefficient output assembly in the convolutional neural network model according to the image characteristics of each of the K candidate covers and the user operation data of each of the K candidate covers; the confidence coefficient output component is used for outputting prediction confidence coefficient according to the image features extracted by the feature extraction component, and the prediction confidence coefficient is used for indicating the probability that the corresponding candidate cover is a video cover;

And when the output result of the confidence output component converges, acquiring the convolutional neural network model as a video cover determination model for determining the video cover.

In yet another aspect, a method for displaying a video cover is provided, which is used in a terminal, and the method includes:

at a first moment, receiving a first video cover of a first video pushed by a server, wherein the first video cover is any cover in N candidate covers, and N is an integer greater than or equal to 2;

displaying a video playing inlet of the first video according to the first video cover;

at a second moment, receiving a second video cover of the first video pushed by the server; the second video cover is determined from the N candidate covers through a cover determination sub-model; the cover determination sub-model is a convolutional neural network model obtained by reinforcement learning according to the N candidate covers and the target user operation data of the N candidate covers; the target user operation data is used for indicating target user operation received by the first video and candidate covers corresponding to the target user operation; the target user operation is a user operation performed on the first video by each user in a target user group, and the designated user group is a user group where the corresponding user of the terminal is located;

And displaying the video playing inlet of the first video according to the second video cover.

In one aspect, a video pushing device is provided, the device comprising:

the candidate cover acquisition module is used for acquiring N candidate covers in the first video, wherein N is an integer greater than or equal to 2;

the confidence coefficient prediction module is used for respectively processing the N candidate covers through the video cover determination model to obtain respective prediction confidence coefficients of the N candidate covers, wherein the prediction confidence coefficients are used for indicating the probability that the corresponding candidate covers are video covers; the video cover determining model is a convolutional neural network model obtained by reinforcement learning according to K candidate covers in the second video and user operation data of the K candidate covers; the user operation data is used for indicating user operation received by the second video, and the candidate covers corresponding to the user operation are K which is an integer greater than or equal to 2;

the video cover acquisition module is used for acquiring the video covers of the first video from the N candidate covers according to the prediction confidence degrees of the N candidate covers;

and the video pushing module is used for pushing the first video to the terminal according to the video seal of the first video.

Optionally, the video cover determining model includes at least two cover determining sub-models; and the at least two cover determination sub-models correspond to respective user groups respectively;

the confidence prediction module is used for predicting the confidence coefficient of the user,

inquiring a target user group where a corresponding user of the terminal is located;

acquiring a cover determination sub-model corresponding to the target user group, wherein the cover determination sub-model corresponding to the target user group is a convolutional neural network model obtained by reinforcement learning according to K candidate covers in the second video and target user operation data of each of the K candidate covers; the target user operation data is used for indicating target user operation and candidate covers corresponding to the target user operation; the target user operation is a user operation performed on the second video by each user in the target user group;

and respectively processing the N candidate covers through the cover determination submodels corresponding to the target user group to obtain the prediction confidence degrees of the N candidate covers.

Optionally, the candidate cover acquisition module is configured to,

acquiring each key image frame in the first video;

Clustering the key image frames to obtain at least two clustering centers, wherein each clustering center comprises at least one key image frame corresponding to the same scene type;

and respectively extracting at least one key image frame from the at least two clustering centers to obtain the N candidate covers.

Optionally, when at least one key image frame is extracted from the at least two clustering centers respectively to obtain the N candidate covers, the candidate cover obtaining module is configured to,

removing the cluster centers with the number of the included key image frames smaller than the number threshold value from the at least two cluster centers to obtain N cluster centers;

and respectively extracting one key image frame from the N clustering centers to obtain the N candidate covers.

Optionally, the video cover determining model includes a feature extraction component and a confidence output component;

the feature extraction component is used for extracting the image features of the input candidate covers;

the confidence coefficient output component is used for outputting the prediction confidence coefficient of the input candidate covers according to the image features extracted by the feature extraction component.

Optionally, the feature extraction component is identical to the feature extraction component in the image classification model;

The image classification model is a convolutional neural network model obtained through sample image and classification label training of the sample image.

In yet another aspect, there is provided a training apparatus for determining a model of a video cover, the apparatus comprising:

the candidate cover acquisition module is used for acquiring K candidate covers in the second video, wherein K is an integer greater than or equal to 2;

the feature extraction module is used for extracting the image features of each of the K candidate covers through a convolutional neural network model; the image features are the output of a feature extraction component in the convolutional neural network;

the operation data acquisition module is used for respectively taking the K candidate covers as video covers of the second video, pushing the second video and acquiring user operation data of each of the K candidate covers; the user operation data is used for indicating user operation received by the second video and candidate covers corresponding to the user operation;

the reinforcement learning module is used for reinforcement learning of network parameters of the confidence coefficient output assembly in the convolutional neural network model according to the image characteristics of the K candidate covers and the user operation data of the K candidate covers; the confidence coefficient output component is used for outputting prediction confidence coefficient according to the image features extracted by the feature extraction component, and the prediction confidence coefficient is used for indicating the probability that the corresponding candidate cover is a video cover;

And the model acquisition module is used for acquiring the convolutional neural network model as a video cover determination model for determining the video cover when the output result of the confidence coefficient output component converges.

Optionally, the apparatus further includes:

the prediction confidence coefficient acquisition module is used for acquiring the prediction confidence coefficient of each of the K candidate covers output by the confidence coefficient output assembly before the model acquisition module;

and the convergence determining module is used for determining that the output result of the confidence output component converges when the sum of the prediction confidence coefficients of the K candidate covers converges.

Optionally, the confidence output component includes a vectorization function and an activation function, the predictive confidence acquisition module for,

obtaining vectorization results of each of the K candidate covers; the vectorization results of the K candidate covers are output results of the vectorization functions corresponding to the K candidate covers respectively;

and processing the vectorization results of each of the K candidate covers through the activation function to obtain the prediction confidence of each of the K candidate covers.

Optionally, the reinforcement learning module is configured to,

Acquiring actual confidence coefficients of the K candidate covers according to the user operation data of the K candidate covers;

obtaining a strategy function according to the actual confidence coefficient of each of the K candidate covers, wherein the strategy function is a function for maximizing the sum of confidence coefficients obtained according to the image characteristics of each of the K candidate covers, and the sum of the confidence coefficients is the sum of the prediction confidence coefficients of each of the K candidate covers; the matrix format of the variable parameters in the strategy function is the same as the matrix format of the network parameters of the confidence output component;

and obtaining the variable parameters in the strategy function as network parameters of the vectorization component.

Optionally, the operation data acquisition module is configured to,

respectively taking the K candidate covers as video covers of the second video, and pushing the second video;

acquiring user operation records of at least one user in a specified user group on the second video, wherein the user operation records correspond to respective candidate covers;

acquiring user operation data of each of the K candidate covers corresponding to the appointed user group according to the user operation record of the at least one user on the second video;

The obtaining the convolutional neural network model as a video cover determining model for determining a video cover when the output result of the confidence output component converges includes:

and when the output result of the confidence output component converges, acquiring the convolutional neural network model as a cover determination sub-model corresponding to the specified user group.

Optionally, the apparatus further includes: the grouping module is used for grouping the users according to the user operation records of the users on the videos before the operation data acquisition module acquires the user operation records of the users on the second videos in the specified user groups, so as to acquire at least one user group, wherein the specified user group is contained in the at least one user group.

Optionally, the apparatus further includes:

the probability acquisition module is used for acquiring the display probability of each of the K candidate covers in the next appointed length time period according to the output result of the confidence output assembly when the output result of the confidence output assembly is not converged;

the pushing module is used for pushing the second video to each terminal by taking the K candidate covers as video covers of the second video according to the display probability of each of the K candidate covers in the next specified length time period;

The operation data acquisition module is further used for acquiring new user operation data of each of the K candidate covers in the next specified length time period;

the reinforcement learning module is further configured to reinforcement learn the network parameters of the confidence coefficient output component according to the image features of the K candidate covers and the new user operation data of the K candidate covers.

In yet another aspect, a video cover display device is provided, for use in a terminal, the device comprising:

the first receiving module is used for receiving a first video cover of a first video pushed by the server at a first moment, wherein the first video cover is any one of N candidate covers, and N is an integer greater than or equal to 2;

the first display module is used for displaying a video playing inlet of the first video according to the first video cover;

the second receiving module is used for receiving a second video cover of the first video pushed by the server at a second moment; the second video cover is determined from the N candidate covers through a cover determination sub-model; the cover determination sub-model is a convolutional neural network model obtained by reinforcement learning according to the N candidate covers and the target user operation data of the N candidate covers; the target user operation data is used for indicating target user operation received by the first video and candidate covers corresponding to the target user operation; the target user operation is a user operation performed on the first video by each user in a target user group, and the designated user group is a user group where the corresponding user of the terminal is located;

And the second display module is used for displaying the video playing inlet of the first video according to the second video cover.

In yet another aspect, a computer device is provided, the computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by the processor to implement a video pushing method, a training method for determining a model of a video cover, or a video cover presentation method as described above.

In yet another aspect, a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions loaded and executed by a processor to implement a video pushing method, a training method for determining a model of a video cover, or a video cover presentation method as described above is provided.

The technical scheme that this application provided can include following beneficial effect:

the method comprises the steps that each candidate cover in a first video is processed through a video cover determining model, prediction confidence degrees corresponding to each candidate cover are obtained, the video covers of the first video are selected from the candidate covers according to the prediction confidence degrees, and because the video cover determining model is a convolutional neural network model for strengthening training according to operations carried out by users on the same video of different covers, the selection operation of the users on the video covers is comprehensively considered, and the accuracy of the follow-up video cover determining through trained models is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a frame diagram illustrating a model training and image frame determination in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a video push in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram of a model training process for determining a video cover, according to an example embodiment;

FIG. 4 is a flow diagram of a terminal record and uploading user operation records according to the embodiment shown in FIG. 3;

FIG. 5 is a flowchart illustrating a training and video pushing method for determining a model of a video cover, according to an example embodiment;

FIG. 6 is a schematic diagram of a video cover presentation process according to the embodiment of FIG. 5;

FIG. 7 is a diagram of video cover changes for the same video before and after model training in accordance with the embodiment of FIG. 5;

FIG. 8 is an overall frame diagram of an automated reinforcement learning based video cover generation and online selection method in accordance with the embodiment of FIG. 5;

FIG. 9 is a schematic diagram of a model training process involved in the embodiment of FIG. 5;

fig. 10 is a block diagram illustrating a structure of a video pushing apparatus according to an exemplary embodiment;

FIG. 11 is a block diagram illustrating the construction of a training device for determining a model of a video cover, according to an exemplary embodiment;

FIG. 12 is a block diagram illustrating the construction of a video cover presentation device according to an exemplary embodiment;

FIG. 13 is a schematic diagram of a computer device, according to an example embodiment;

fig. 14 is a schematic diagram of a computer device according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The application provides an efficient and high-accuracy model training and model application scheme, which can train and obtain a machine learning model for determining a video cover of a video from the video, and push the video based on the video cover determined by the machine learning model. For ease of understanding, several terms referred to in the embodiments of the present application are explained below.

(1) Video covers: and displaying pictures corresponding to the playing entrance of the video in the application interface of the video playing application program or in the webpage, namely, the video cover of the video. Typically, the video cover of a video is usually related to the content of the video, for example, the video cover of the video may be a certain image frame in the video.

(2) Confidence of image frames: in this application, the confidence of an image frame is related to the probability that the image frame is the video cover of a specified video, i.e., the greater the probability that the image frame is the video cover of a specified video, the higher the confidence of the image frame.

With the continuous development of network video applications, more and more videos are uploaded to a network by users or video providers, and accordingly, video resources which users can select to watch are more and more abundant. Whether the cover of a video is reasonable or not is an important factor for attracting users to click to play the video, and an uploading person of many videos in the network may not set the cover for the uploaded video or set the cover unreasonably manually, which requires that a server of a video provider can automatically set the reasonable cover for the video.

According to the scheme shown in the subsequent embodiments of the application, a new model training and application scheme for determining the video cover is provided, so that the model trained through the scheme can accurately determine the image frames suitable for generating the video cover from videos, and meanwhile, the model training and updating efficiency can be improved.

The solution of the various embodiments that follow in this application is a solution for training a machine learning model. FIG. 1 is a block diagram illustrating a model training and video cover determination according to an exemplary embodiment. As shown in fig. 1, in the model training stage, the model training device 110 trains out the machine learning model through the user operation of each user on different covers in the same video, and in the video cover determining stage, the cover determining device 120 determines the video cover from the candidate covers of the input video according to the trained machine learning model.

The model training device 110 and the cover determining device 120 may be computer devices with machine learning capabilities, for example, the computer devices may be fixed computer devices such as a personal computer and a server, or the computer devices may be mobile computer devices such as a tablet computer, an electronic book reader, or a portable medical device.

Alternatively, the model training device 110 and the cover determining device 120 may be the same device, or the model training device 110 and the cover determining device 120 may be different devices. Also, when the model training device 110 and the cover determining device 120 are different devices, the model training device 110 and the cover determining device 120 may be the same type of device, for example, the model training device 110 and the cover determining device 120 may both be servers; alternatively, the model training device 110 and the cover determining device 120 may be different types of devices, such as a personal computer for the model training device 110 and a server for the cover determining device 120. The specific types of model training device 110 and cover determining device 120 are not limited in the embodiments of the present application.

Fig. 2 is a flow diagram illustrating a video push in accordance with an exemplary embodiment. The model training device (such as a server) can perform reinforcement learning on the convolutional neural network model according to the K candidate covers in the second video and the user operation data of each of the K candidate covers to obtain a video cover determination model. The user operation data is used for indicating the user operation received by the second video, and the candidate covers corresponding to the user operation are K being an integer greater than or equal to 2.

As shown in fig. 2, when pushing the first video to each terminal, the server may acquire N candidate covers in the first video, N being an integer greater than or equal to 2 (S21); then, respectively processing the N candidate covers through the trained video cover determining model to obtain the prediction confidence degrees of the N candidate covers, wherein the prediction confidence degrees are used for indicating the probability that the corresponding candidate covers are video covers (S22); then, according to the prediction confidence of each of the N candidate covers, acquiring a video cover of the first video from the N candidate covers (S23); and pushing the first video to the terminal according to the video seal of the first video (S24).

In summary, in this embodiment of the present application, the convolutional neural network is trained according to the reinforcement learning manner by taking K image frames corresponding to the same video as user operation data when the video covers respectively in advance, to obtain a video cover determination model, and after model training is completed, when the first video is pushed, N candidate covers of the first video are processed by the trained video cover determination model, to obtain probabilities that each of the N candidate covers is a video cover, and the video cover of the first video is determined from the N candidate covers.

FIG. 3 is a schematic diagram illustrating a training process for determining a model of a video cover, according to an example embodiment. As shown in fig. 3, a developer first sets an initial convolutional neural network model that includes a feature extraction component and a confidence output component. The purpose of the model flow shown in fig. 3 includes training the network parameters in the confidence output component. Taking a server of which the model training device is a video provider as an example, as shown in fig. 3, for the second video, the server may acquire K candidate covers in the second video (S31), where K is an integer greater than or equal to 2; the server extracts the image features of each of the K candidate covers through a convolutional neural network model; wherein the image feature is an output of a feature extraction component in the convolutional neural network (S32); then, the server pushes the second video by taking the K candidate covers as video covers of the second video respectively to obtain user operation data of each of the K candidate covers; the user operation data is used for indicating the user operation received by the second video and the candidate covers corresponding to the user operation (S33); the server performs reinforcement learning on the network parameters of the confidence coefficient output component in the convolutional neural network model according to the image characteristics of the K candidate covers and the user operation data of the K candidate covers (S34); the confidence output component is configured to output a prediction confidence according to the image feature extracted by the feature extraction component, where the prediction confidence is used to indicate a probability that the corresponding candidate cover is a video cover. When the output result of the confidence output component converges, acquiring the convolutional neural network model as a video cover determination model for determining a video cover (S35); if the output result of the confidence output component is not converged, the server may return to step S33, and continue the video pushing and the reinforcement training until the output result of the confidence output component is converged.

The first video in the embodiment shown in fig. 2 and the second video in the embodiment shown in fig. 3 may be different videos, that is, after the server performs the strengthening training by pushing the second video, the trained model may be used to determine the cover of the other videos except for the second video.

Or, the first video and the second video may be the same video, that is, after the server performs the strengthening training through pushing the first video, the video cover of the first video may be determined through a trained model, and then the video cover determined by the model is pushed in the pushing process of the first video.

In the training process, the terminal side is required to have the capability of feeding back the user operation information. Fig. 4 is a schematic flow chart of terminal recording and uploading user operation recording according to an embodiment of the present application. As shown in fig. 4, the terminal may receive a push message of the second video sent by the server (S41), where any one of K candidate covers of the video cover included in the push message; the terminal displays a video playing inlet of the second video according to the video cover (S42); after receiving the trigger operation on the video playing portal, the terminal acquires a user operation record (S43), the user operation record being used for indicating the user operation performed on the second video in the current terminal; the terminal transmits the user operation record to the server (S44), so that the server acquires the user operation data of each of the K candidate covers according to the user operation record transmitted by each terminal.

In summary, in the embodiment of the present application, the convolutional neural network model including the feature extraction component and the confidence output component is used as an initial model, and network parameters of the confidence output component are used as training targets, and K candidate covers corresponding to the same video are respectively used as user operation data when the covers, and output results of the feature extraction component after processing the K candidate covers are respectively trained to obtain the network parameters of the confidence output component.

In the solution shown in fig. 3, the determining process of the network parameters of the confidence coefficient output component may be performed iteratively according to different time periods, and the probability of pushing each candidate cover as the video cover in the next network parameter determining process may be optimized according to the output of the confidence coefficient output component of the previous training.

Fig. 5 is a flowchart illustrating a training and video pushing method for determining a model of a video cover according to an exemplary embodiment, which may be used in computer devices, such as the model training device 110 and the cover determining device 120 shown in fig. 1, to train and obtain the video cover determining model related to the embodiment shown in fig. 2 or fig. 3, and to perform video pushing according to the determined model. Taking the above model training device 110 and cover determining device 120 as servers of video providers as an example, as shown in fig. 5, the method may include the steps of:

in step 501, the server acquires K candidate covers in the second video, where K is an integer greater than or equal to 2.

The K candidate covers may be at least two representative image frames in the second video, for example, the K candidate covers may be image frames respectively representing different scenes in the second video, or the K candidate covers may also be image frames respectively representing different people/objects in the second video.

Taking an example that the K candidate covers are image frames respectively representing different scenes in the second video, the server may acquire the K candidate covers in the second video according to the following scheme:

S501a, each key image frame in the second video is acquired.

The key image frames in the embodiment of the application are image frames corresponding to each scene in the second video respectively.

In one possible example, when acquiring each key frame in the second video, the server may first perform scene segmentation on the second video to obtain a plurality of scene segments, and then extract at least one image frame from each of the plurality of scene segments as each key image frame in the second video.

When at least one image frame is extracted from each of several scene clips, the server may first filter, for each scene clip, pure color image frames, blurred image frames, and repeated image frames in the scene clip, then rank the remaining image frames in the scene clip according to image quality (such as at least one of color saturation, sharpness, and image content complexity), and acquire at least one image frame ranked in the front as a key image frame corresponding to the scene clip.

S501b, clustering is carried out on each key image frame to obtain at least two clustering centers, and each clustering center comprises at least one key image frame corresponding to the same scene type.

In the embodiment of the application, the server can perform k-means clustering on each key image frame, wherein a k-means clustering algorithm (english: k-means clustering) is derived from a vector quantization method in signal processing, and is currently popular in the field of data mining as a cluster analysis method. The purpose of k-means clustering is: n points (which may be one observation or one instance of a sample) are partitioned into k clusters such that each point belongs to a cluster corresponding to his nearest mean (i.e. cluster center) as a criterion for clustering. In the embodiment of the application, the k-means clustering can be used for clustering the key image frames similar to the scene.

And S501c, respectively extracting at least one key image frame from the at least two clustering centers to obtain the K candidate covers.

In a possible example, the server may extract one image frame from each cluster center obtained by the clustering, that is, the number K of the cluster centers may be set to the number of image frames serving as the cover in the first video, so that key image frames of similar scenes may be aggregated into a class, so that a cover map with the largest difference is conveniently selected in a subsequent step, and the computing efficiency is improved. For example, for each key frame picture in each cluster center, the server ranks according to attributes such as color saturation, sharpness, content complexity, and the like, and selects an optimal picture in each cluster center to obtain K image frames to form a candidate cover image set.

Alternatively, the server may extract a plurality of image frames from each cluster center, for example, when the number of each cluster center is small (for example, less than 3), the server may extract two or three image frames from each cluster center as the K image frames.

In the embodiment of the present application, in order to further improve the subsequent calculation efficiency, after the clustering is completed, the server may further screen the clustering centers to further reduce the value of K, for example, in the embodiment of the present application, the server may reject the clustering centers, where the number of the included key image frames is smaller than the number threshold, from the at least two clustering centers, to obtain K clustering centers; and extracting a key image frame from the K clustering centers respectively to obtain K image frames.

The number threshold may be a value preset by a developer, or the number threshold may be a value determined by the server according to the number of image frames included in the first video. When the number threshold is a value determined by the server according to the number of image frames contained in the first video, the number threshold may be positively correlated with the number of image frames contained in the first video, that is, the larger the number of image frames contained in the first video is, the larger the number of the number threshold is, and conversely, the smaller the number of image frames contained in the first video is, the smaller the number of the number threshold is.

When the number of key image frames included in a certain cluster center in the first video is smaller, it can be considered that a scene segment related to the cluster center in the first video is shorter, and image frames in the scene segment are not suitable for representing the first video, so that when K image frames in the first video are acquired, the server firstly excludes the cluster center with smaller number (such as less than 5) of image frames in each cluster center after clustering, and extracts one image frame from the rest of cluster centers as the K image frames.

In addition to obtaining the K candidate covers by the above-mentioned key frame extraction and clustering, the server may obtain the K candidate covers by other manners, for example, the server may process some or all of the image frames in the first video through a pre-trained machine learning model (such as an image classification model), and select the K candidate covers according to the processing result.

In step 502, the server pushes the second video by using the K candidate covers as video covers of the second video. Correspondingly, the terminal receives a push message of the server to the second video, wherein the video cover contained in the push message is one of K candidate covers in the second video.

In one possible implementation, when the server pushes the second video to a terminal at a certain moment, one candidate cover may be randomly selected from the K candidate covers as the video cover of the first video, and a push message of the second video is sent to the terminal according to the video cover.

In one possible example, the server may directly take the selected video cover as the cover image of the second video.

Alternatively, in another possible example, the server may perform predetermined processing on the video cover to obtain a cover image of the second video, for example, the server may perform cutting, sharpening, or the like on the video cover to obtain a cover image of the second video.

That is, for the same terminal, before model training is completed, when the server pushes the second video to the terminal twice, the video covers carried in the push message may be different candidate covers in the second video; correspondingly, for different two terminals, when the server pushes the second video to the two terminals respectively, the video covers carried in the push message can also be different candidate covers in the second video.

The terminal may receive a push message of the second video sent by the server when displaying an interface or a webpage of an application program corresponding to the server. The application program may be a video playing application program (including a short video application program, etc.), or other application programs with a video playing function or a web page displaying function.

In step 503, the terminal displays the video playing entry of the second video according to the video cover.

After receiving the push message of the second video, the terminal may display, according to the push message, a video playing entry of the second video in an interface or a web page of the application program, for example, the video playing entry may be a picture link, or the video playing entry may be a picture control. And the picture in the picture link or the picture control is a video cover of the second video carried in the push message.

Step 504, after receiving the triggering operation on the video playing entry, the terminal obtains a user operation record, where the user operation record is used to indicate a user operation performed on the second video in the current terminal.

In this embodiment of the present application, after the terminal displays the video playing entry of the second video, the record of the operation performed by the user on the second video is further recorded, for example, whether to click on the video playing entry of the second video, the duration of playing the second video, whether to approve the second video, whether to forward the second video, and so on.

In step 505, the terminal sends the user operation record to the server, and the server receives the user operation record.

After the terminal records the user operation record, the user operation record of the second video can be uploaded to the server periodically or instantly. Correspondingly, the server receives the user operation record.

Wherein, each user operation record corresponds to the respective candidate covers.

In an exemplary embodiment, the user operation record may directly include the identification of the corresponding candidate cover. For example, when a user operation record is generated, the terminal may acquire the identifier of the candidate cover corresponding to the user operation, and add the acquired identifier of the candidate cover to the user operation record.

In another exemplary scheme, the user operation record may not directly include the identifier of the corresponding candidate cover, and after receiving a user operation record sent by the terminal, the server may acquire the identifier of the candidate cover corresponding to the user operation record, and store the acquired identifier of the candidate cover and the user operation record. For example, when a terminal generates a user operation record, an identifier of a push message of a corresponding video may be added to the user operation record, and after the server receives the user operation record, the server queries the server for an identifier of a candidate cover corresponding to the identifier of the push message.

In step 506, the server obtains the user operation data of each of the K candidate covers.

The user operation data is used for indicating user operation received by the second video and candidate covers corresponding to the user operation.

In this embodiment of the present application, the server may count the user operation records corresponding to the second video uploaded by each user terminal according to the specified length time period as a period, so as to obtain user operation data of each user for the second video when the K candidate covers are respectively used as covers of the second video in each specified length time period.

Wherein, in one possible example, the user operation data includes at least one of the following data:

when the corresponding candidate cover is used as the video cover of the second video, the click rate of the second video;

when the corresponding candidate cover is used as the video cover of the second video, the playing time length of the second video after being clicked each time;

when the corresponding candidate cover is used as the video cover of the second video, the praise rate of the second video is calculated;

and when the corresponding candidate cover is used as the video cover of the second video, the forwarded rate of the second video is determined.

In an exemplary scheme, when the server acquires the user operation data of each of the K candidate covers, the server may acquire the user operation data of all the users for the second video when the K candidate covers are used as the video covers of the second video respectively in each specified length period.

For example, in a certain specified length of time period, the server receives 1000 user operation records of the users on the second video respectively, and the server can generate K candidate covers as the covers of the second video respectively according to the 1000 user operation records of the users on the second video respectively, wherein each user is aimed at user operation data of the second video.

In another exemplary scheme, the server may further group the users according to user operations of the users on the videos, obtain at least one user group, and obtain, for each user group, user operation data corresponding to the user group, where the K candidate covers are respectively. For example, the server may obtain a user operation record of the second video by at least one user in the specified user group (the user operation record of the at least one user on the second video corresponds to the respective candidate covers), and obtain the user operation data of each of the K candidate covers corresponding to the specified user group according to the user operation record of the at least one user on the second video.

For example, the server groups each user according to the user operation records of each user in the system to obtain at least one user group, and in a certain specified length period, the server receives the user operation records of 1000 users to the second video respectively, and for the specified user group in the at least one user group, the server can obtain the user operation records of at least one (such as 100) users belonging to the specified user group in the 1000 users to the second video respectively, and obtain the user operation data corresponding to the specified user group and of each of the K candidate covers.

Step 507, the server extracts the image features of each of the K candidate covers through a convolutional neural network model; the image features are the outputs of feature extraction components in the convolutional neural network.

In the embodiment of the present application, in the convolutional neural network model, the network parameters of the other layers except the network parameters of the last full-connection layer may be parameters preset by a developer.

In addition to CNN models, embodiments of the present application may also be trained using other neural network models that include at least two fully connected layers, such as recurrent neural networks (Recurrent Neural Network, RNN) or deep neural networks (Deep Neural Networks, DNN), among others. In addition, the fully connected layer in the model can be replaced by other functions for realizing vectorization of image features.

In one possible implementation, the feature extraction component in the convolutional neural network model is the same as the feature extraction component in the image classification model; the image classification model is obtained through training of a sample image and a classification label of the sample image.

In the embodiment of the application, the feature extraction part in the existing image classification model can be multiplexed to be used as a feature extraction component in the convolutional neural network model, the feature extraction component in the convolutional neural network model is used for extracting the image features of each of the K candidate covers, and the output result of the feature extraction component is used as the feature data of each of the K candidate covers.

For example, taking a CNN model as an example, for each candidate cover in the K candidate covers, the CNN model is used to extract the feature vector of the candidate cover. The CNN model is a feedforward neural network model, and an artificial neuron of the CNN model can respond to surrounding units in a part of coverage range, so that the CNN model has excellent performance on large-scale image processing. CNN networks consist of one or more convolutional layers and a top fully-connected layer (corresponding to classical neural networks) and also include an associated weight and pooling layer (pooling layer). This structure enables the CNN model to better utilize the two-dimensional structure of the input data. Compared with other deep learning structures, the CNN model can give better results in terms of image and voice recognition and the like. Currently, the mainstream network structure model for classifying pictures is VggNet model, resNet model and the like. In the embodiment of the application, the image classification CNN model pre-trained by the public database can be used as the convolutional neural network model, and the network parameters of the convolutional neural network model are adjusted according to the subsequent reinforcement learning, so that the high-level semantic features of the image frames are reserved, and the network features represent task scenes more suitable for cover selection. The feature data of each of the K candidate covers may be an output result of the penultimate full-connection layer in the CNN model when the CNN model processes each of the K candidate covers.

The step 507 may be performed between the steps 501 to 508, and the order of execution between the step 502 and the step 507 is not limited.

Step 508, the server performs reinforcement learning on the network parameters of the confidence output component in the convolutional neural network model according to the image features of each of the K candidate covers and the user operation data of each of the K candidate covers.

The confidence output component is configured to output a prediction confidence according to the image features extracted by the feature extraction component, where the prediction confidence is configured to indicate a probability that the corresponding candidate cover is a video cover.

Optionally, the server may obtain actual confidence levels (also referred to as prize values) of the K candidate covers according to the user operation data of the K candidate covers; and obtaining a policy function according to the actual confidence coefficient of each of the K candidate covers, wherein the policy function is a function for maximizing the sum of confidence coefficients obtained according to the image characteristics of each of the K candidate covers, and the sum of the confidence coefficients is the sum of the prediction confidence coefficients of each of the K candidate covers. Wherein the matrix format of the variable parameters in the policy function is the same as the matrix format of the network parameters of the confidence output component.

For example, after a certain specified length of time period is finished, the server acquires the user operation data of each of the K candidate covers in the specified length of time period, and acquires the network parameters corresponding to the confidence coefficient output component in the specified length of time period according to the user operation data and the feature data of each of the K candidate covers.

In the embodiment of the present application, taking the CNN model as an example, for the ith candidate cover in the K candidate covers, let the display probability of the candidate cover be P _i Wherein, i is a positive integer less than or equal to K, then there are:

P _i ＝σ(Wh _i )；

wherein h is _i And W is the network parameter of the hidden layer back full-connection layer (namely the last full-connection layer) and sigma is a sigmoid function (namely an activation function) for the output result of the CNN penultimate full-connection layer.

In addition, taking the example that the user operation data includes the corresponding candidate covers as the video covers of the second video, and the corresponding image frames as the covers of the second video, the server counts the play time length after each time the second video is clicked, and the click rate and the play time length of the second video when each candidate cover in the K candidate covers is taken as the video cover in the specified length time period.

Because the click rate of the video can reflect the attraction degree of the cover to the user, and the play time length can reflect the semantic matching degree of the cover to the video, the bonus function is expressed as:

R＝R _click +R _duration ；

wherein R is _click Is a function taking click rate of video as input, R _duration The method is characterized in that the playing time after each video click is taken as an input function, and the task goal of the method is to find a strategy function P (theta) in a reinforcement learning mode, so that the sum of the confidence degrees of K image frames obtained by calculation of the reward function is maximized, wherein P (theta) is defined by a CNN network. Wherein, the objective function can be expressed as:

J(θ)＝E _P(θ) [R]；

through the reinforcement learning training process, the characteristic representation of the cover obtained by the strategy function can be enabled to maximize the user behavior on the fitting line.

In step 509, the server obtains the predicted confidence levels of the K candidate covers output by the confidence level output component.

In the embodiment of the present application, the output result of the trained function is used as the prediction confidence of the corresponding candidate cover and also as the probability that the corresponding image is actually the video cover of the second video.

In one possible example, the confidence output component includes a vectorization function and an activation function, and the server obtains the predicted confidence of each of the K candidate covers output by the confidence output component, where the process may be as follows:

The server obtains the vectorization results of each of the K candidate covers; the vectorization results of the K candidate covers are output results of the vectorization function corresponding to the K candidate covers respectively; and the server processes the vectorization results of each of the K candidate covers through the activation function to obtain the prediction confidence of each of the K candidate covers.

For example, in the embodiment of the present application, taking the CNN model as an example, the server may take the network parameter obtained by the training as the network parameter w of the last full-connection layer in the CNN model, and bring the network parameter into the above formula P _i ＝σ(Wh _i ) Obtaining respective display probabilities of the K candidate covers, namely respective prediction confidence coefficients of the K candidate covers, and accumulating the respective prediction confidence coefficients of the K candidate covers to obtain the sum of the respective prediction confidence coefficients of the K candidate covers in the specified length time period.

Step 510, the server determines whether the sum of the prediction confidence coefficients of the K candidate covers is converged, if yes, step 511 is entered, otherwise, step 502 is returned.

The server may obtain the sum of the prediction confidence coefficients corresponding to the K candidate covers in at least one specified length period before the specified length period, and determine whether the sum of the prediction confidence coefficients of the K candidate covers converges according to the sum of the prediction confidence coefficients of the K candidate covers in the specified length period and the sum of the prediction confidence coefficients of the K candidate covers in at least one specified length period before the specified length period.

For example, when the difference between the sum of the prediction confidence degrees in the specified length time period and the sum of the prediction confidence degrees in the specified length time period before the specified length time period is smaller than the difference threshold, the sum of the prediction confidence degrees of the K candidate covers can be considered to be converged.

In step 511, the server obtains the convolutional neural network model as a video cover determination model for determining a video cover.

In an exemplary aspect, when the respective user operation data of the K candidate covers are user operation data obtained according to user operation records of all users, the video cover determination model may be a model for all users.

In another exemplary aspect, when the respective user operation data of the K candidate covers are specified user operation data obtained from a user operation record of at least one user in the specified user group, the server may acquire the convolutional neural network model as the cover determination sub-model corresponding to the specified user group when the output result of the confidence output component converges.

In the embodiment of the present application, when the output result of the confidence coefficient output assembly does not converge, the server may further obtain, according to the output result of the confidence coefficient output assembly, a display probability of each of the K candidate covers in a next specified length period; according to the display probability of each of the K candidate covers in the next specified length time period, the K candidate covers are respectively used as cover image frames of the second video, and the second video is pushed to each terminal; acquiring new user operation data of each of the K candidate covers within the next specified length time period; and performing reinforcement learning on the network parameters of the confidence coefficient output assembly according to the new user operation data of each of the K candidate covers and the image characteristics of each of the K candidate covers.

In this embodiment of the present application, when the server pushes the K candidate covers as the video covers of the first video, the server may adjust the pushing according to the operation behavior of the userAnd the strategy reduces the accumulation requirement on user operation data, thereby achieving the effect of accelerating the convergence speed of the model. For example, taking the CNN model as an example, after a specified length of time, the server determines that the sum of the confidence levels of the predictions of the K candidate covers does not reach the convergence state, and at this time, the server calculates the above formula P according to the training result corresponding to the specified length of time _i ＝σ(Wh _i ) Obtaining respective display probabilities of the K candidate covers, and pushing the covers according to the calculated respective display probabilities of the K candidate covers in the next specified length time period, namely, for the ith candidate cover, the higher the display probability of the ith candidate cover, the more likely the server is to set the ith candidate cover as the video cover of the second video when pushing the second video in the next specified length time period.

In step 512, the server obtains N candidate covers in the first video, where N is an integer greater than or equal to 2.

Optionally, the server may acquire each key image frame in the first video; clustering the key image frames to obtain at least two clustering centers, wherein each clustering center comprises at least one key image frame corresponding to the same scene type; and respectively extracting at least one key image frame from the at least two clustering centers to obtain the N candidate covers.

Optionally, when at least one key image frame is extracted from at least two cluster centers to obtain N candidate covers, the server may reject cluster centers, where the number of the key image frames included in the at least two cluster centers is smaller than the number threshold, to obtain N cluster centers, and extract one key image frame from the N cluster centers, respectively, to obtain the N candidate covers.

The step of obtaining N candidate covers in the first video by the server is similar to the step of obtaining K candidate covers in the second video, and will not be described herein.

In step 513, the server processes the N candidate covers through the video cover determination model, to obtain the prediction confidence degrees of the N candidate covers, respectively.

In one possible implementation, the video cover determination model may be a model for performing video cover determination for all user terminals.

In another possible implementation, the video cover determination model includes at least two cover determination sub-models; and the at least two cover determination sub-models correspond to respective user groups respectively; the server can inquire the target user group where the corresponding user of the terminal is located; acquiring a cover determination sub-model corresponding to the target user group, wherein the cover determination sub-model corresponding to the target user group is a convolutional neural network model obtained by reinforcement learning according to K candidate covers in the second video and target user operation data of each of the K candidate covers; the target user operation data is used for indicating target user operation and candidate covers corresponding to the target user operation; the target user operation is a user operation performed on the second video by each user in the target user group; and respectively processing the N candidate covers through the cover determination submodels corresponding to the target user group to obtain the prediction confidence degrees of the N candidate covers.

Step 514, obtaining the video cover of the first video from the N candidate covers according to the prediction confidence degrees of the N candidate covers.

In this embodiment of the present application, the server may use, as the video cover of the first video, a cover with the highest prediction confidence coefficient of the N candidate covers.

Step 515, pushing the first video to the terminal according to the video seal of the first video.

After training to obtain the video cover determining model, the server can determine a video cover for each video according to the video cover determining model, and push the video according to the determined video covers.

Optionally, before determining the video cover of the first video from the N candidate covers according to the prediction confidence degrees of each of the N candidate covers, the server may further acquire an image classification of each of the N candidate covers; determining a matching cover in the N candidate covers, wherein the matching cover is a candidate cover with image classification matched with the video description information of the first video; when the cover image frame of the first video is determined from the N candidate covers according to the prediction confidence degrees of the N candidate covers, the server may acquire the candidate cover with the highest prediction confidence degree from the matching covers as the video cover of the first video.

In one possible implementation, the server may also select the cover of the video in combination with the matching degree of the candidate cover to the video in determining the cover of the first video. For example, the server may obtain video profile information of the first video, classify the N candidate covers according to respective images, then calculate a matching degree between the image classification and the video profile, and obtain, as the video cover of the first video, one candidate cover with the highest confidence degree from among the N candidate covers, the candidate covers with the matching degree higher than a matching degree threshold.

For example, the first video is a video of a car-assessment program, the profile information of the video of the car-assessment program is "XX car-assessment person is trying to drive a Y car", the server calculates matching degrees of 5 candidate covers of the first video with the profile information, wherein 2 candidate covers not including cars are lower than a matching degree threshold, and the other three candidate covers including cars are higher than the matching degree threshold, and the server acquires the candidate cover with the highest prediction confidence as the video cover of the first video from among the three candidate covers including cars.

In one possible example, if the video cover determination model obtained by the training includes a cover determination sub-model corresponding to a specified user group (corresponding to other user groups also having respective cover determination sub-models), when pushing the first video to the terminal of the user in the specified user group, the server may determine, according to the cover determination sub-model, a video cover from N candidate covers of the first video, and push the first video to the terminal of the user in the specified user group according to the determined video cover.

By the above scheme, for the same first video, the covers of the first video displayed by the terminals of the users belonging to the same user group are the same video cover, and the covers of the first video displayed by the terminals of the users belonging to different user groups may be different video covers. And because the grouping of each user is determined by the user operation of the user on each video, the above scheme can select the candidate covers possibly preferred by the user from a plurality of candidate covers in the first video as video covers aiming at users with different preferences. For example, a user may prefer to click on and view a video with a video cover of men, and the server may divide the user into a particular user group according to the user's preference, and then push video covers of other videos for the user according to the object model corresponding to the user group may also be more prone to push video covers with men to the user.

The first video and the second video may be the same video or different videos. When the first video and the second video are the same video, a video cover of a video playing inlet displayed on the first video may be changed by a certain terminal before and after training of the target model is completed. For example, please refer to fig. 6, which illustrates a video cover showing flow chart according to an embodiment of the present application. As shown in fig. 6, the step of the terminal displaying the video playing entry of the first video may be as follows:

S61, at a first moment before training of the target model is completed, the terminal receives a first video cover of a first video pushed by a server, wherein the first video cover is any cover (such as a randomly selected cover) of N candidate covers, and N is an integer greater than or equal to 2.

S62, the terminal displays a video playing inlet of the first video according to the first video cover;

s63, at a second moment after the model training is completed in the follow-up video cover determination, the terminal can receive a second video cover of the first video pushed by the server; the second video cover is determined from the N candidate covers by a cover determination sub-model; the cover determining sub-model is a convolutional neural network model obtained by reinforcement learning according to the N candidate covers and the target user operation data of the N candidate covers; the target user operation data is used for indicating the target user operation and candidate covers corresponding to the target user operation; the target user operation is a user operation performed by each user in a target user group on the first video, and the appointed user group is a user group where the corresponding user of the terminal is located;

s64, the terminal displays the video playing inlet of the first video according to the second video cover.

For example, please refer to fig. 7, which illustrates a schematic diagram of video cover changes of the same video before and after model training according to an embodiment of the present application. As shown in fig. 7, at a first moment before model training is completed, the terminal displays a video playing inlet 71 of a first video, wherein a video cover of the video playing inlet 71 is a video cover 1, and the video cover 1 is any one of N candidate covers of the first video; at a second moment after model training is completed, the terminal displays a page which also contains the first video, the server extracts the designated candidate covers from the N candidate covers of the first video through the model corresponding to the user group according to the user group where the terminal corresponds to the user, takes the designated candidate covers as the video covers 2, and pushes the video covers 2 to the terminal, as shown in fig. 7, the terminal displays a video playing inlet 71 of the first video in the page, and at this time, the video covers of the video playing inlet 71 are changed into the video covers 2.

The scheme shown in the application provides an automatic generation and online selection method for a video cover based on reinforcement learning. Reinforcement learning is a machine learning algorithm that emphasizes making selections based on current state to obtain maximum expected benefit. The method and the device can be used for probing the candidate covers in a video recommendation scene, calculating the prediction confidence of the candidate covers according to the clicking behaviors of the current user, and deciding the next probing behaviors according to the prediction confidence.

Referring to fig. 8, an overall framework diagram of a method for automatically generating and selecting video covers on the basis of reinforcement learning according to an embodiment of the present application is shown, and as shown in fig. 8, the overall flow of the technical side related to the framework includes:

81 Offline mining video candidate covers and saving picture indexes.

82 For each candidate cover, the CNN convolutional neural network is used for extracting the feature vector of the picture, namely, a CNN model is used for extracting 1-dimensional feature representation for each candidate cover.

83 On-line probing the candidate covers of the video, and accumulating click rate and play duration data of different candidate covers within a period of time. Wherein the probability of the cover heuristic is P= { P _i Where i represents the serial number of the cover, and,

∑ _i P _i ＝1。

84 Based on the sum of the predictive confidence of the reinforcement learning online learning candidate covers. According to the click rate and playing time length data of different candidate covers, and a reward function formula R=R _click +R _duration And calculating the actual confidence coefficient R, and further calculating the sum of the prediction confidence coefficients of the candidate covers.

85 The confidence scores of the candidate covers are converged, and the cover with the highest confidence is selected as the final display cover.

Taking an initial model as a CNN model as an example, please refer to fig. 9, which illustrates a model training flow diagram according to an embodiment of the present application. As shown in fig. 9, the model training process may be as follows:

S901, reading video data of a video 1;

s902, performing scene segmentation on the video 1, and extracting key image frames of each scene in the video 1;

s903, clustering key image frames of each scene in the video 1 to obtain K clustering centers;

s904, extracting an image frame from each cluster center to obtain K candidate covers;

s905, respectively inputting the K candidate covers into an initial model to obtain the characteristic data of each of the K candidate covers, which are output by the last-last full-connection layer;

s906, in a specified length time period, taking K candidate covers as covers of the video 1 respectively, and pushing the video 1 to each terminal;

s907, generating user operation data of each of the K candidate covers according to the operation records of the users in each terminal on the video 1, wherein the user operation data comprises click rate, play time length and the like;

s908, obtaining an objective function corresponding to the specified length time period through reinforcement training according to the characteristic data and the user operation data of each of the K candidate covers; the objective function maximizes the sum of confidence levels of the K candidate covers obtained through calculation of the user operation data;

s909, judging whether the sum of the confidence levels of the K candidate covers is converged;

S910, if the sum of the confidence coefficients of the K candidate covers is converged, setting the network parameter of the last full-connection layer in the initial model as a target parameter; the target parameter is also a parameter matrix in the target function.

S911, if the sum of the confidence coefficients of the K candidate covers is not converged, calculating respective display probabilities of the K candidate covers according to the target parameters;

s912, returning to the step S906, in the next time period with the designated length, according to the respective display probabilities of the K candidate covers, the K candidate covers are respectively used as the covers of the video 1, and the video 1 is pushed to each terminal.

In summary, according to the scheme shown in the embodiment of the present application, the convolutional neural network model including the feature extraction component and the confidence output component is used as an initial model, network parameters of the confidence output component are used as training targets, K candidate covers corresponding to the same video are respectively used as user operation data when the video covers, and output results of the feature extraction component after processing the K candidate covers are respectively trained by a reinforcement learning method to obtain a video cover determination model.

In addition, the scheme disclosed by the embodiment of the application automatically generates the cover map candidate set comprising a plurality of image frames, so that a user can conveniently and quickly position a target video, and the video click rate is improved.

In addition, the scheme shown in the embodiment of the application obtains the confidence degree of the plurality of candidate covers serving as the covers based on the operation behaviors of the user, wherein the confidence degree can reflect the attraction degree of the covers to the user and the matching degree with the video theme, and reflect the partial order relation among the plurality of candidate covers serving as the covers in the same video.

In addition, according to the scheme disclosed by the embodiment of the application, through the reinforcement learning end-to-end learning process, the earlier-stage feature design and the extraction work are avoided, and the cover which better accords with the user preference is obtained.

In addition, according to the scheme disclosed by the embodiment of the application, the heuristic strategy is adjusted according to the real-time clicking behaviors of the user while different cover diagrams are tested and displayed, so that the accumulation requirement on the clicking rate data of the user is reduced, and the convergence speed of the model is accelerated.

In addition, in the embodiment of the application, when the server pushes the K candidate covers as the video covers, the pushing strategy can be adjusted according to the operation behaviors of the user, so that the accumulation requirement on the operation data of the user is reduced, and the effect of accelerating the convergence speed of the model is achieved.

The scheme is based on reinforcement learning, and covers which are most attractive to users are automatically selected according to clicking playing behaviors of the users in the video recommendation system. The video cover display method has the beneficial effects that under a video recommendation scene, a candidate set for video cover display can be expanded, and the cover most suitable for display can be automatically selected without manually marking data and characteristic engineering design, so that the click rate and the play time of a video are improved.

In the above embodiments, only the server of the video provider is taken as an example for the model training device, and in other exemplary solutions, the model training device may also be another device other than the server, for example, a management device connected to the server, or a stand-alone personal computer device, or the like, or the model training device may also be a cloud computing center, or the like. The specific form of the model training device is not limited in the application.

Through the scheme shown in the embodiment of the application, the training and using method of the model can be applied to artificial intelligence (Artificial Intelligence, AI) for automatically determining the video covers for users so as to push the proper video covers to each user, or respectively push the video covers possibly preferred by the users for different users.

Fig. 10 is a block diagram showing a structure of a video pushing apparatus according to an exemplary embodiment. The image recognition means may be used in a computer device to perform all or part of the steps performed by the server in the embodiments shown in fig. 2 or fig. 5. The video pushing device may include:

a candidate cover acquisition module 1001, configured to acquire N candidate covers in the first video, where N is an integer greater than or equal to 2;

the confidence coefficient prediction module 1002 is configured to process the N candidate covers through a video cover determination model, to obtain respective prediction confidence coefficients of the N candidate covers, where the prediction confidence coefficients are used to indicate a probability that the corresponding candidate covers are video covers; the video cover determining model is a convolutional neural network model obtained by reinforcement learning according to K candidate covers in the second video and user operation data of the K candidate covers; the user operation data is used for indicating user operation received by the second video, and the candidate covers corresponding to the user operation are K which is an integer greater than or equal to 2;

a video cover acquisition module 1003, configured to acquire a video cover of the first video from the N candidate covers according to respective prediction confidence degrees of the N candidate covers;

The video pushing module 1004 is configured to push the first video to a terminal according to a video seal of the first video.

the confidence prediction module 1002 is configured to, in response to the confidence level,

Optionally, the candidate cover acquisition module 1001 is configured to, in use,

acquiring each key image frame in the first video;

Optionally, when at least one key image frame is extracted from the at least two cluster centers, respectively, and the N candidate covers are obtained, the candidate cover obtaining module 1001 is configured to,

FIG. 11 is a block diagram illustrating the structure of a training device for determining a model of a video cover, according to an exemplary embodiment. The apparatus may be used in a computer device to perform all or part of the steps performed by a server in the embodiments shown in fig. 3 or 5. The apparatus may include:

The candidate cover acquisition module 1101 is configured to acquire K candidate covers in the second video, where K is an integer greater than or equal to 2;

the feature extraction module 1102 is configured to extract image features of each of the K candidate covers through a convolutional neural network model; the image features are the output of a feature extraction component in the convolutional neural network;

an operation data obtaining module 1103, configured to respectively take the K candidate covers as video covers of the second video, push the second video, and obtain user operation data of each of the K candidate covers; the user operation data is used for indicating user operation received by the second video and candidate covers corresponding to the user operation;

the reinforcement learning module 1104 is configured to reinforcement learn network parameters of the confidence coefficient output component in the convolutional neural network model according to the image features of the K candidate covers and the user operation data of the K candidate covers; the confidence coefficient output component is used for outputting prediction confidence coefficient according to the image features extracted by the feature extraction component, and the prediction confidence coefficient is used for indicating the probability that the corresponding candidate cover is a video cover;

A model obtaining module 1105, configured to obtain the convolutional neural network model as a video cover determination model for determining a video cover when the output result of the confidence output component converges.

Optionally, the apparatus further includes:

a prediction confidence coefficient obtaining module, configured to obtain, before the model obtaining module 1105, the prediction confidence coefficient of each of the K candidate covers output by the confidence coefficient output component;

Optionally, the reinforcement learning module 1104 is configured to,

Optionally, the operation data acquisition module 1103 is configured to,

Optionally, the apparatus further includes: the grouping module is configured to group each user according to the user operation record of each user on each video before the operation data obtaining module 1103 obtains the user operation record of the second video by at least one user in a specified user group, so as to obtain at least one user group, where the at least one user group includes the specified user group.

Optionally, the apparatus further includes:

The operation data obtaining module 1103 is further configured to obtain new user operation data of each of the K candidate covers in the next specified length period;

the reinforcement learning module 1104 is further configured to reinforcement learn the network parameters of the confidence output component according to the image features of each of the K candidate covers and the new user operation data of each of the K candidate covers.

FIG. 12 is a block diagram illustrating the construction of a video cover presentation device according to an exemplary embodiment. The video cover presentation apparatus may be used in a computer device to perform all or part of the steps performed by the terminal in the embodiments shown in fig. 2, 3 or 5. The video cover presentation apparatus may include:

the first receiving module 1201 is configured to receive, at a first time, a first video cover of a first video pushed by a server, where the first video cover is any one of N candidate covers, and N is an integer greater than or equal to 2;

a first display module 1202, configured to display a video playing entry of the first video according to the first video cover;

the second receiving module 1203 is configured to receive, at a second moment, a second video cover of the first video pushed by the server; the second video cover is determined from the N candidate covers through a cover determination sub-model; the cover determination sub-model is a convolutional neural network model obtained by reinforcement learning according to the N candidate covers and the target user operation data of the N candidate covers; the target user operation data is used for indicating target user operation received by the first video and candidate covers corresponding to the target user operation; the target user operation is a user operation performed on the first video by each user in a target user group, and the designated user group is a user group where the corresponding user of the terminal is located;

And the second display module 1204 is configured to display a video playing inlet of the first video according to the second video cover.

Fig. 13 is a schematic diagram of a computer device according to an exemplary embodiment. The computer apparatus 1300 includes a Central Processing Unit (CPU) 1301, a system memory 1304 including a Random Access Memory (RAM) 1302 and a Read Only Memory (ROM) 1303, and a system bus 1305 connecting the system memory 1304 and the central processing unit 1301. The computer device 1300 also includes a basic input/output system (I/O system) 1306, which helps to transfer information between the various devices within the computer, and a mass storage device 1307 for storing an operating system 1313, application programs 1314, and other program modules 1315.

The basic input/output system 1306 includes a display 1308 for displaying information, and an input device 1309, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1308 and the input device 1309 are connected to the central processing unit 1301 through an input output controller 1310 connected to the system bus 1305. The basic input/output system 1306 may also include an input/output controller 1310 for receiving and processing input from a keyboard, mouse, or electronic stylus, among a plurality of other devices. Similarly, the input output controller 1310 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1307 is connected to the central processing unit 1301 through a mass storage controller (not shown) connected to the system bus 1305. The mass storage device 1307 and its associated computer-readable media provide non-volatile storage for the computer device 1300. That is, the mass storage device 1307 may include a computer-readable medium (not shown), such as a hard disk or CD-ROM drive.

The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1304 and mass storage device 1307 described above may be referred to collectively as memory.

The computer device 1300 may be connected to the internet or other network device through a network interface unit 1311 connected to the system bus 1305.

The memory further includes one or more programs stored in the memory, and the central processor 1301 implements all or part of the steps performed by the server in the methods shown in fig. 2, 3, or 5 by executing the one or more programs.

Fig. 14 shows a block diagram of a terminal 1400 provided in an exemplary embodiment of the present application. The terminal 1400 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 1400 may also be referred to as a user device, a portable terminal, a laptop terminal, a desktop terminal, and the like.

In general, terminal 1400 includes: a processor 1401 and a memory 1402.

Processor 1401 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1401 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1401 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1401 may be integrated with a GPU (Graphics Processing Unit, image processor) for rendering and rendering of content required to be displayed by the display screen. In some embodiments, the processor 1401 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 1402 may include one or more computer-readable storage media, which may be non-transitory. Memory 1402 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1402 is used to store at least one instruction for execution by processor 1401 to implement all or a portion of the steps performed by a terminal in the method embodiments illustrated in fig. 2, 3, or 5 described above.

In some embodiments, terminal 1400 may optionally further include: a peripheral interface 1403 and at least one peripheral. The processor 1401, memory 1402, and peripheral interface 1403 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 1403 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1404, a touch display screen 1405, a camera 1406, audio circuitry 1407, a positioning component 1408, and a power source 1409.

Peripheral interface 1403 may be used to connect at least one Input/Output (I/O) related peripheral to processor 1401 and memory 1402. In some embodiments, processor 1401, memory 1402, and peripheral interface 1403 are integrated on the same chip or circuit board; in some other embodiments, either or both of processor 1401, memory 1402, and peripheral interface 1403 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 1404 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1404 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1404 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1404 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 1404 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuit 1404 may also include NFC (Near Field Communication, short range wireless communication) related circuits, which are not limited in this application.

The display screen 1405 is used to display UI (user interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1405 is a touch display screen, the display screen 1405 also has the ability to collect touch signals at or above the surface of the display screen 1405. The touch signal may be input to the processor 1401 as a control signal for processing. At this time, the display 1405 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 1405 may be one, providing a front panel of the terminal 1400; in other embodiments, the display 1405 may be at least two, respectively disposed on different surfaces of the terminal 1400 or in a folded design; in still other embodiments, the display 1405 may be a flexible display disposed on a curved surface or a folded surface of the terminal 1400. Even more, the display 1405 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The display 1405 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera component 1406 is used to capture images or video. Optionally, camera assembly 1406 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 1406 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuitry 1407 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1401 for processing, or inputting the electric signals to the radio frequency circuit 1404 for voice communication. For purposes of stereo acquisition or noise reduction, a plurality of microphones may be provided at different portions of the terminal 1400, respectively. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1401 or the radio frequency circuit 1404 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuitry 1407 may also include a headphone jack.

The locating component 1408 is used to locate the current geographic location of the terminal 1400 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 1408 may be a positioning component based on the united states GPS (Global Positioning System ), the chinese beidou system, or the russian galileo system.

A power supply 1409 is used to power the various components in terminal 1400. The power supply 1409 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 1409 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1400 also includes one or more sensors 1410. The one or more sensors 1410 include, but are not limited to: acceleration sensor 1411, gyroscope sensor 1412, pressure sensor 1413, fingerprint sensor 1414, optical sensor 1415, and proximity sensor 1416.

The acceleration sensor 1411 may detect the magnitudes of accelerations on three coordinate axes of a coordinate system established with the terminal 1400. For example, the acceleration sensor 1411 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 1401 may control the touch display 1405 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 1411. The acceleration sensor 1411 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 1412 may detect a body direction and a rotation angle of the terminal 1400, and the gyro sensor 1412 may collect a 3D motion of the user to the terminal 1400 in cooperation with the acceleration sensor 1411. The processor 1401 may implement the following functions based on the data collected by the gyro sensor 1412: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

Pressure sensor 1413 may be disposed on a side frame of terminal 1400 and/or on an underlying layer of touch screen 1405. When the pressure sensor 1413 is provided at a side frame of the terminal 1400, a grip signal of the terminal 1400 by a user can be detected, and the processor 1401 performs right-and-left hand recognition or quick operation according to the grip signal collected by the pressure sensor 1413. When the pressure sensor 1413 is disposed at the lower layer of the touch screen 1405, the processor 1401 realizes control of the operability control on the UI interface according to the pressure operation of the user on the touch screen 1405. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 1414 is used to collect a fingerprint of a user, and the processor 1401 identifies the identity of the user based on the fingerprint collected by the fingerprint sensor 1414, or the fingerprint sensor 1414 identifies the identity of the user based on the fingerprint collected. Upon recognizing that the user's identity is a trusted identity, the user is authorized by the processor 1401 to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 1414 may be provided on the front, back, or side of the terminal 1400. When a physical key or vendor Logo is provided on terminal 1400, fingerprint sensor 1414 may be integrated with the physical key or vendor Logo.

The optical sensor 1415 is used to collect the ambient light intensity. In one embodiment, the processor 1401 may control the display brightness of the touch screen 1405 based on the intensity of ambient light collected by the optical sensor 1415. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 1405 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 1405 is turned down. In another embodiment, the processor 1401 may also dynamically adjust the shooting parameters of the camera assembly 1406 based on the ambient light intensity collected by the optical sensor 1415.

A proximity sensor 1416, also referred to as a distance sensor, is typically provided on the front panel of terminal 1400. The proximity sensor 1416 is used to collect the distance between the user and the front of the terminal 1400. In one embodiment, when the proximity sensor 1416 detects that the distance between the user and the front surface of the terminal 1400 gradually decreases, the processor 1401 controls the touch display 1405 to switch from the bright screen state to the off screen state; when the proximity sensor 1416 detects that the distance between the user and the front surface of the terminal 1400 gradually increases, the touch display 1405 is controlled by the processor 1401 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 14 is not limiting and that terminal 1400 may include more or less components than those illustrated, or may combine certain components, or employ a different arrangement of components.

In exemplary embodiments, a non-transitory computer-readable storage medium is also provided, such as a memory, including a computer program (instructions) executable by a processor of a computer device to perform all or part of the steps of the methods shown in the various embodiments of the present application. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A video pushing method, the method comprising:

acquiring a cover determination sub-model corresponding to the target user group, wherein the cover determination sub-model belongs to a video cover determination model, and the video cover determination model is a convolutional neural network model obtained by reinforcement learning according to K candidate covers in a second video and user operation data of the K candidate covers; the user operation data is used for indicating user operation received by the second video, and the candidate covers corresponding to the user operation are K which is an integer greater than or equal to 2; the video cover determination model comprises at least two cover determination sub-models, and the at least two cover determination sub-models respectively correspond to respective user groups; the cover determining sub-model corresponding to the target user group is a convolutional neural network model obtained by reinforcement learning according to K candidate covers in the second video and target user operation data of the K candidate covers; the target user operation data is used for indicating target user operation and candidate covers corresponding to the target user operation; the target user operation is a user operation performed on the second video by each user in the target user group;

Processing the N candidate covers through cover determination submodels corresponding to the target user group respectively to obtain respective prediction confidence degrees of the N candidate covers, wherein the prediction confidence degrees are used for indicating the probability that the corresponding candidate covers are video covers;

pushing the first video to the terminal according to the video seal of the first video.

2. The method of claim 1, wherein the obtaining N candidate covers in the first video comprises:

acquiring each key image frame in the first video;

3. The method of claim 2, wherein the extracting at least one key image frame from the at least two cluster centers, respectively, to obtain the N candidate covers, comprises:

4. The method of any of claims 1 to 3, wherein the video cover determination model includes a feature extraction component and a confidence output component;

5. A training method for determining a model of a video cover, the method comprising:

when the output result of the confidence coefficient output component converges, acquiring the convolutional neural network model as a video cover determination model for determining a video cover;

when the output result of the confidence coefficient output assembly is not converged, acquiring the display probability of each of the K candidate covers in the next specified length time period according to the output result of the confidence coefficient output assembly; according to the display probability of each of the K candidate covers in the next specified length time period, respectively taking the K candidate covers as video covers of the second video, and pushing the second video to each terminal; acquiring new user operation data of each of the K candidate covers in the next specified length time period; and performing reinforcement learning on the network parameters of the confidence coefficient output assembly according to the image characteristics of each of the K candidate covers and the new user operation data of each of the K candidate covers.

6. The method of claim 5, wherein the reinforcement learning of the network parameters of the confidence output component in the convolutional neural network model based on the image features of each of the K candidate covers and the user operation data of each of the K candidate covers comprises:

and acquiring the variable parameters in the strategy function as network parameters of the confidence output component.

7. The method of claim 5, wherein the step of determining the position of the probe is performed,

the step of pushing the second video by using the K candidate covers as video covers of the second video to obtain user operation data of each of the K candidate covers, includes:

8. The method of claim 7, wherein prior to the obtaining a user action record for the second video for at least one user of the specified user group, the method further comprises:

and grouping the users according to the user operation records of the users on the videos to obtain at least one user group, wherein the at least one user group comprises the appointed user group.

9. A video cover presentation method for use in a terminal, the method comprising:

at a second moment, receiving a second video cover of the first video pushed by the server; the second video cover is determined from the N candidate covers according to the prediction confidence of each of the N candidate covers, and the prediction confidence is used for indicating the probability that the corresponding candidate cover is the video cover; the prediction confidence degrees of the N candidate covers are obtained by respectively processing the N candidate covers through cover determination submodels corresponding to the target user groups; the target user group is a user group where the terminal corresponding user queried by the server is located; the video cover determining model comprises at least two cover determining sub-models, wherein the at least two cover determining sub-models respectively correspond to respective user groups, and the cover determining sub-models are convolutional neural network models obtained by reinforcement learning according to the N candidate covers and target user operation data of the N candidate covers; the target user operation data is used for indicating target user operation received by the first video and candidate covers corresponding to the target user operation; the target user operation is a user operation performed on the first video by each user in the target user group;

10. A video pushing device, the device comprising:

the confidence prediction module is used for inquiring a target user group where a corresponding user of the terminal is located; acquiring a cover determination sub-model corresponding to the target user group, wherein the cover determination sub-model belongs to a video cover determination model, and the video cover determination model is a convolutional neural network model obtained by reinforcement learning according to K candidate covers in a second video and user operation data of the K candidate covers; the user operation data is used for indicating user operation received by the second video, and the candidate covers corresponding to the user operation are K which is an integer greater than or equal to 2; the video cover determination model comprises at least two cover determination sub-models, and the at least two cover determination sub-models respectively correspond to respective user groups; the cover determining sub-model corresponding to the target user group is a convolutional neural network model obtained by reinforcement learning according to K candidate covers in the second video and target user operation data of the K candidate covers; the target user operation data is used for indicating target user operation and candidate covers corresponding to the target user operation; the target user operation is a user operation performed on the second video by each user in the target user group; processing the N candidate covers through cover determination submodels corresponding to the target user group respectively to obtain respective prediction confidence degrees of the N candidate covers, wherein the prediction confidence degrees are used for indicating the probability that the corresponding candidate covers are video covers;

11. A video cover presentation apparatus for use in a terminal, the apparatus comprising:

the second receiving module is used for receiving a second video cover of the first video pushed by the server at a second moment; the second video cover is determined from the N candidate covers according to the prediction confidence of each of the N candidate covers, and the prediction confidence is used for indicating the probability that the corresponding candidate cover is the video cover; the prediction confidence degrees of the N candidate covers are obtained by respectively processing the N candidate covers through cover determination submodels corresponding to the target user groups; the target user group is a user group where the terminal corresponding user queried by the server is located; the video cover determining model comprises at least two cover determining sub-models, wherein the at least two cover determining sub-models respectively correspond to respective user groups, and the cover determining sub-models are convolutional neural network models obtained by reinforcement learning according to the N candidate covers and target user operation data of the N candidate covers; the target user operation data is used for indicating target user operation received by the first video and candidate covers corresponding to the target user operation; the target user operation is a user operation performed on the first video by each user in the target user group;

12. A computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set that is loaded and executed by the processor to implement the method of any one of claims 1 to 9.

13. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the method of any one of claims 1 to 9.