CN108830235B

CN108830235B - Method and apparatus for generating information

Info

Publication number: CN108830235B
Application number: CN201810644175.4A
Authority: CN
Inventors: 李伟健; 李映虹; 王长虎
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Douyin Vision Co Ltd; Douyin Vision Beijing Co Ltd
Priority date: 2018-06-21
Filing date: 2018-06-21
Publication date: 2020-11-24
Anticipated expiration: 2038-06-21
Also published as: CN108830235A; WO2019242222A1

Abstract

The embodiment of the application discloses a method and a device for generating information. One embodiment of the method comprises: acquiring a target video; inputting a target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video, wherein the video recognition model is used for representing the corresponding relation between the video and the recognition result pair, the recognition result pair comprises a first recognition result and a second recognition result, the first recognition result is used for representing the foreground included in the video, and the second recognition result is used for representing the background included in the video. This embodiment improves the diversity of information generation.

Description

Method and apparatus for generating information

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for generating information.

Background

Video typically includes foreground and background. The foreground can include shot contents (such as people, animals, behaviors and the like) corresponding to the shot video; the background may include a shooting scene (e.g., sky, court, forest, etc.) corresponding to the video obtained by shooting.

Currently, video identification usually identifies only the foreground of a video.

Disclosure of Invention

The embodiment of the application provides a method and a device for generating information.

In a first aspect, an embodiment of the present application provides a method for generating information, where the method includes: acquiring a target video; inputting a target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video, wherein the video recognition model is used for representing the corresponding relation between the video and the recognition result pair, the recognition result pair comprises a first recognition result and a second recognition result, the first recognition result is used for representing the foreground included in the video, and the second recognition result is used for representing the background included in the video.

In some embodiments, the video recognition model comprises a first video recognition submodel, a second video recognition submodel, and a feature extraction network; inputting the target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video, wherein the recognition result pair comprises: inputting the target video into a feature extraction network to obtain the video features of the target video; and respectively inputting the obtained video features into the first video identification submodel and the second video identification submodel to obtain an identification result pair which corresponds to the target video and comprises a first identification result and a second identification result.

In some embodiments, the video recognition model is trained by: acquiring a training sample set, wherein training samples comprise sample videos and sample identification result pairs labeled in advance aiming at the sample videos; and taking the sample video of the training samples in the training sample set as input, taking the sample recognition result pair corresponding to the input sample video as expected output, and training by using a machine learning method to obtain the video recognition model.

In some embodiments, a video recognition model is obtained by training with a machine learning method, where a sample video of a training sample in a training sample set is used as an input, a sample recognition result pair corresponding to the input sample video is used as an output, and the method includes: dividing a training sample set into a preset number of training sample groups; selecting a training sample set from a preset number of training sample sets as a candidate training sample set, and performing the following training steps based on the candidate training sample set and a predetermined initial model: for the candidate training sample group, taking a sample video of a training sample as input, taking a sample recognition result pair corresponding to the input sample video as output, and training an initial model by using a machine learning method to obtain an initial video recognition model; determining whether a preset completion condition for indicating completion of training is satisfied; in response to determining that the completion condition is satisfied, generating a video recognition model based on the obtained initial video recognition model; and in response to determining that the completion condition is not met, selecting a training sample set from the unselected training sample sets as a new candidate training sample set, and taking the most recently obtained initial video recognition model as a new initial model, and continuing to perform the training step.

In some embodiments, the completion condition includes, but is not limited to, at least one of: the training sample groups which are not selected are not included in the preset number of training sample groups; and inputting the sample video of the training sample in the candidate training sample group into the initial model to obtain an actual recognition result pair, wherein the loss value of the actual recognition result pair relative to the sample recognition result pair corresponding to the input sample video is smaller than a preset loss threshold value.

In a second aspect, an embodiment of the present application provides an apparatus for generating information, where the apparatus includes: an acquisition unit configured to acquire a target video; the input unit is configured to input a target video into a pre-trained video recognition model, and obtain a recognition result pair corresponding to the target video, wherein the video recognition model is used for representing a corresponding relation between the video and the recognition result pair, the recognition result pair comprises a first recognition result and a second recognition result, the first recognition result is used for representing a foreground included in the video, and the second recognition result is used for representing a background included in the video.

In some embodiments, the video recognition model comprises a first video recognition submodel, a second video recognition submodel, and a feature extraction network; and the input unit includes: the first input module is configured to input the target video into a feature extraction network to obtain video features of the target video; and the second input module is configured to input the obtained video features into the first video identification submodel and the second video identification submodel respectively to obtain an identification result pair which corresponds to the target video and comprises the first identification result and the second identification result.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon which, when executed by one or more processors, cause the one or more processors to implement the method of any of the embodiments of the method for generating information described above.

In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which when executed by a processor implements the method of any of the above-described methods for generating information.

According to the method and the device for generating information, the target video is obtained, and then the target video is input into the pre-trained video recognition model, so that the recognition result pair corresponding to the target video is obtained, wherein the video recognition model is used for representing the corresponding relation between the video and the recognition result pair, the recognition result pair comprises the first recognition result and the second recognition result, the first recognition result is used for representing the foreground included in the video, and the second recognition result is used for representing the background included in the video, so that the pre-trained video recognition model can be used for simultaneously recognizing the foreground and the background of the target video, and the diversity of information generation is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating information according to the present application;

FIG. 3 is a schematic illustration of an application scenario of a method for generating information according to the present application;

FIG. 4 is a flow diagram of yet another embodiment of a method for generating information according to the present application;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for generating information according to the present application;

FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing an electronic device according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the method for generating information or the apparatus for generating information of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a model training application, a video recognition application, a web browser application, social platform software, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, and 103 are hardware, they may be various electronic devices with a display screen, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg Audio Layer 4), laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

When the

terminals

101, 102, 103 are hardware, a video capture device may also be installed thereon. The video acquisition equipment can be various equipment capable of realizing the function of acquiring video, such as a camera, a sensor and the like. The user may capture video using a video capture device on the

terminal

101, 102, 103.

The server 105 may be a server that provides various services, such as a background server that processes video displayed on the

terminal devices

101, 102, 103. The background server may perform processing such as analysis on the received data such as the target video, and may feed back a processing result (e.g., a pair of recognition results) to the terminal device.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In particular, in the case where data used in the target face video or the process of generating the recognition result does not need to be acquired from a remote place, the above system architecture may not include a network, but only include a terminal device or a server.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating information in accordance with the present application is shown. The method for generating information comprises the following steps:

step 201, acquiring a target video.

In the present embodiment, an execution subject (e.g., a server shown in fig. 1) of the method for generating information may acquire the target video by a wired connection manner or a wireless connection manner. The target video may be a video to be identified.

The execution main body may acquire a target video transmitted by an electronic device (for example, the terminal device shown in fig. 1) communicatively connected to the execution main body, or may acquire a target video stored locally in advance.

Step 202, inputting the target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video.

In this embodiment, based on the target video obtained in step 201, the executing entity may input the target video into a pre-trained video recognition model, and obtain a recognition result pair corresponding to the target video. The video identification model can be used for representing the corresponding relation between the video and the identification result pair. The pair of recognition results includes a first recognition result and a second recognition result. The first recognition result may be used to characterize a foreground included in the video. The second recognition result may be used to characterize a background included with the video. The recognition results (the first recognition result and the second recognition result) in the recognition result pair may include, but are not limited to, at least one of: text, numbers, symbols, images, video.

Here, it is understood that the foreground included in the video generally refers to the shooting content (e.g., people, animals, behaviors, etc.) corresponding to the video, and the background included in the video generally refers to the shooting scene (e.g., sky, court, forest, etc.) to which the shooting content belongs.

In some optional implementations of this embodiment, the video recognition model may be obtained by training through the following steps: firstly, a training sample set is obtained, wherein the training samples comprise sample videos and sample recognition result pairs which are labeled in advance aiming at the sample videos. Then, the sample video of the training sample in the training sample set is used as input, the sample recognition result pair corresponding to the input sample video is used as expected output, and the machine learning method is used for training to obtain the video recognition model.

Specifically, as an example, a training sample may be selected from a training sample set, and the following steps are performed: inputting a sample video of the selected training sample into an initial model (such as a Convolutional Neural Network (CNN), a residual error Network (ResNet), and the like) to obtain a recognition result pair; taking a sample identification result pair corresponding to the input sample video as expected output of the initial model, and adjusting parameters of the initial model based on the obtained identification result pair and the sample identification result pair; determining whether the unselected training samples exist in the training sample set; and determining the adjusted initial model as the video recognition model in response to the non-selected training samples not existing. It should be noted that the selection manner of the training samples is not limited in the present application. For example, the selection may be random, or a training sample with better definition of the sample video may be preferentially selected.

In some optional implementation manners of this embodiment, the video recognition model may also be obtained by training through the following steps:

first, a training sample set is obtained, and the training sample set is divided into a preset number of training sample groups.

Here, the training sample set may be divided into a preset number of training sample groups in various ways. For example, the training sample set may be divided into a preset number of training sample groups in an equal division manner, or the training sample set may be divided so that the number of training samples included in each of the preset number of training sample groups is greater than or equal to a preset threshold. It should be noted that the preset number can be preset by a technician.

Then, a training sample set may be selected from a preset number of training sample sets as a candidate training sample set, and based on the candidate training sample set and a predetermined initial model, the following training steps are performed: for the candidate training sample group, taking a sample video of a training sample as input, taking a sample recognition result pair corresponding to the input sample video as output, and training an initial model by using a machine learning method to obtain an initial video recognition model; determining whether a preset completion condition for indicating completion of training is satisfied; in response to determining that the completion condition is satisfied, a video recognition model is generated based on the obtained initial video recognition model.

Here, one of the obtained initial video recognition models may be selected as a video recognition model, or the obtained initial video recognition models may be processed (fused) to obtain a video recognition model.

It should be noted that the selection manner of the training sample set is not limited in the present application. For example, the selection may be random, or a training sample group with a large number of training samples may be preferentially selected.

In addition, in response to determining that the completion condition is not satisfied, selecting a training sample set from the unselected training sample sets as a new candidate training sample set, and taking the most recently obtained initial video recognition model as a new initial model, and continuing to perform the training step.

It should be noted that the execution subject of the above steps for obtaining the video recognition model may be the same as or different from the execution subject of the method for generating information. If the training data is the same as the training data, the executing agent of the step for obtaining the video recognition model can store the trained video recognition model locally after training the video recognition model. If not, the executing agent for the step of obtaining the video recognition model may send the trained video recognition model to the executing agent for the method of generating information after the video recognition model is trained.

In some optional implementations of this embodiment, the completion condition may include, but is not limited to, at least one of the following: the training sample groups which are not selected are not included in the preset number of training sample groups; and inputting the sample video of the training sample in the candidate training sample group into the initial model to obtain an actual recognition result pair, wherein the loss value of the actual recognition result pair relative to the sample recognition result pair corresponding to the input sample video is smaller than a preset loss threshold value.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating information according to the present embodiment. In the application scenario of fig. 3, the terminal 301 first sends a target video (video obtained by shooting a kite) 302 to the server 303. Then, the server 303 acquires the target video 302, inputs the target video 302 into a video recognition model 304 trained in advance, and obtains a recognition result pair 305 corresponding to the target video 302. The video identification model can be used for representing the corresponding relation between the video and the identification result pair. The recognition result pair 305 includes a first recognition result (kite) 3051 and a second recognition result (sky) 3052, the first recognition result 3051 may be used to characterize a foreground included in the video, and the second recognition result 3052 may be used to characterize a background included in the video.

In the method provided by the embodiment of the application, the target video is acquired, and then the target video is input into the pre-trained video recognition model, so that the recognition result pair corresponding to the target video is acquired, wherein the video recognition model is used for representing the corresponding relation between the video and the recognition result pair, the recognition result pair comprises the first recognition result and the second recognition result, the first recognition result is used for representing the foreground included in the video, and the second recognition result is used for representing the background included in the video, so that the pre-trained video recognition model can be utilized to simultaneously recognize the foreground and the background of the target video, and the diversity of information generation is improved.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method for generating information is shown. The flow 400 of the method for generating information comprises the steps of:

step 401, a target video is obtained.

In the present embodiment, an execution subject (e.g., a server shown in fig. 1) of the method for generating information may acquire the target video by a wired connection manner or a wireless connection manner.

It should be noted that step 401 may be implemented in a similar manner to step 201 in the foregoing embodiment. Accordingly, the above description regarding step 201 is also applicable to step 401 of this embodiment, and is not repeated here.

Step 402, inputting the target video into a feature extraction network of a pre-trained video recognition model to obtain the video features of the target video.

In this embodiment, the video recognition model may include a feature extraction network, and based on the target video obtained in step 401, the executing entity may input the target video into the feature extraction network of the video recognition model to obtain the video features of the target video.

It will be appreciated that the target video is essentially a sequence of target images arranged in chronological order. Thus, the video features of the target video may be embodied by image features of the target images in the target image sequence.

In this embodiment, the feature extraction network may be configured to extract image features of a target image corresponding to a target video, and generate and output video features corresponding to the target video based on the image features.

Specifically, the executing entity may directly determine the obtained image feature as the video feature corresponding to the target video, or may process the obtained image feature and determine the processed image feature as the video feature corresponding to the target video. As an example, the executing entity may fuse the obtained image features, obtain fused features, and determine the fused features as video features corresponding to the target video.

The feature extraction network may include a structure (e.g., a convolutional layer) for extracting image features, but may also include other structures (e.g., a pooling layer), and is not limited herein.

Step 403, inputting the obtained video features into a first video identification submodel and a second video identification submodel of the video identification model respectively, and obtaining an identification result pair corresponding to the target video and including a first identification result and a second identification result.

In this embodiment, the video recognition model may further include a first video recognition submodel and a second video recognition submodel, and further, the execution main body may input the obtained video features into the first video recognition submodel and the second video recognition submodel of the video recognition model, respectively, to obtain a recognition result pair corresponding to the target video and including the first recognition result and the second recognition result. Wherein the first recognition result can be used for characterizing a foreground included in the video. The second recognition result may be used to characterize a background included with the video. The recognition results (the first recognition result and the second recognition result) in the recognition result pair may include, but are not limited to, at least one of: text, numbers, symbols, images, video.

In this embodiment, the first video recognition submodel is connected to the feature extraction network, and is configured to generate a first recognition result based on the input video features. The second video submodel is connected with the feature extraction network and used for generating a second recognition result based on the input video features. Here, the first video identification submodel and the second video submodel may include structures for generating results (e.g., classifier, fully connected layer), but may also include other structures (e.g., output layer), and are not limited herein.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the process 400 of the method for generating information in this embodiment highlights the steps of inputting a target video into a feature extraction network, obtaining video features of the target video, and inputting the obtained video features as shared features into a first video identification submodel and a second video identification submodel, respectively, so as to obtain an identification result pair. Therefore, the scheme described in this embodiment can generate the first recognition result and the second recognition result by using the overall features (including the foreground features and the background features) of the target video, for the first recognition result, the reference data of the background features is added, and for the second recognition result, the reference data of the foreground features is added, so that more accurate video recognition can be realized, and the accuracy of information generation is improved.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for generating information, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the apparatus 500 for generating information of the present embodiment includes: an acquisition unit 501 and an input unit 502. Wherein the obtaining unit 501 is configured to obtain a target video; the input unit 502 is configured to input a target video into a pre-trained video recognition model, and obtain a recognition result pair corresponding to the target video, where the video recognition model is used to represent a correspondence between the video and the recognition result pair, and the recognition result pair includes a first recognition result and a second recognition result, the first recognition result is used to represent a foreground included in the video, and the second recognition result is used to represent a background included in the video.

In this embodiment, the acquiring unit 501 of the apparatus 500 for generating information may acquire the target video by a wired connection manner or a wireless connection manner. The target video may be a video to be identified.

The acquisition unit 501 may acquire a target video transmitted by an electronic device (for example, the terminal device shown in fig. 1) connected to the acquisition unit in communication therewith, or may acquire a target video stored locally in advance.

In this embodiment, based on the target video obtained in the obtaining unit 501, the input unit 502 may input the target video into a video recognition model trained in advance, and obtain a recognition result pair corresponding to the target video. The video identification model can be used for representing the corresponding relation between the video and the identification result pair. The pair of recognition results includes a first recognition result and a second recognition result. The first recognition result may be used to characterize a foreground included in the video. The second recognition result may be used to characterize a background included with the video. The recognition results (the first recognition result and the second recognition result) in the recognition result pair may include, but are not limited to, at least one of: text, numbers, symbols, images, video.

In some optional implementations of this embodiment, the video recognition model may include a first video recognition submodel, a second video recognition submodel, and a feature extraction network; and the input unit 502 may include: a first input module (not shown in the figure) configured to input the target video into a feature extraction network to obtain video features of the target video; and a second input module (not shown in the figure) configured to input the obtained video features into the first video identification submodel and the second video identification submodel, respectively, and obtain an identification result pair corresponding to the target video and including the first identification result and the second identification result.

In some optional implementations of this embodiment, the video recognition model may be obtained by training through the following steps: acquiring a training sample set, wherein training samples comprise sample videos and sample identification result pairs labeled in advance aiming at the sample videos; and taking the sample video of the training samples in the training sample set as input, taking the sample recognition result pair corresponding to the input sample video as expected output, and training by using a machine learning method to obtain the video recognition model.

In some optional implementation manners of this embodiment, taking a sample video of a training sample in a training sample set as an input, taking a sample recognition result pair corresponding to the input sample video as an output, and training by using a machine learning method to obtain a video recognition model, includes: dividing a training sample set into a preset number of training sample groups; selecting a training sample set from a preset number of training sample sets as a candidate training sample set, and performing the following training steps based on the candidate training sample set and a predetermined initial model: for the candidate training sample group, taking a sample video of a training sample as input, taking a sample recognition result pair corresponding to the input sample video as output, and training an initial model by using a machine learning method to obtain an initial video recognition model; determining whether a preset completion condition for indicating completion of training is satisfied; in response to determining that the completion condition is satisfied, generating a video recognition model based on the obtained initial video recognition model; and in response to determining that the completion condition is not met, selecting a training sample set from the unselected training sample sets as a new candidate training sample set, and taking the most recently obtained initial video recognition model as a new initial model, and continuing to perform the training step.

In some optional implementations of this embodiment, the completion condition may include, but is not limited to, at least one of: the training sample groups which are not selected are not included in the preset number of training sample groups; and inputting the sample video of the training sample in the candidate training sample group into the initial model to obtain an actual recognition result pair, wherein the loss value of the actual recognition result pair relative to the sample recognition result pair corresponding to the input sample video is smaller than a preset loss threshold value.

The device 500 provided by the above embodiment of the present application acquires a target video through the acquisition unit 501, and then the input unit 502 inputs the target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video, where the video recognition model is used to represent a correspondence between the video and the recognition result pair, the recognition result pair includes a first recognition result and a second recognition result, the first recognition result is used to represent a foreground included in the video, and the second recognition result is used to represent a background included in the video, so that the pre-trained video recognition model can be utilized to simultaneously recognize the foreground and the background of the target video, and the diversity of information generation is improved.

Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit and an input unit. The names of these units do not in some cases constitute a limitation on the unit itself, and for example, the acquisition unit may also be described as a "unit that acquires a target video".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a target video; inputting a target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video, wherein the video recognition model is used for representing the corresponding relation between the video and the recognition result pair, the recognition result pair comprises a first recognition result and a second recognition result, the first recognition result is used for representing the foreground included in the video, and the second recognition result is used for representing the background included in the video.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for generating information, comprising:

acquiring a target video;

inputting the target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video, wherein the video recognition model is used for representing the corresponding relation between the video and the recognition result pair, the recognition result pair comprises a first recognition result and a second recognition result, the first recognition result is used for representing the foreground included in the video, the second recognition result is used for representing the background included in the video, and the video recognition model comprises a first video recognition submodel, a second video recognition submodel and a feature extraction network; and

the inputting the target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video includes:

inputting the target video into the feature extraction network to obtain the video features of the target video;

and respectively inputting the obtained video features into the first video identification submodel and the second video identification submodel to obtain an identification result pair which corresponds to the target video and comprises a first identification result and a second identification result.

2. The method of claim 1, wherein the video recognition model is trained by:

acquiring a training sample set, wherein training samples comprise sample videos and sample identification result pairs labeled in advance aiming at the sample videos;

and taking the sample video of the training samples in the training sample set as input, taking the sample recognition result pair corresponding to the input sample video as expected output, and training by using a machine learning method to obtain the video recognition model.

3. The method according to claim 2, wherein the training using a machine learning method to obtain a video recognition model by using a sample video of the training samples in the training sample set as an input and a sample recognition result pair corresponding to the input sample video as an output comprises:

dividing a training sample set into a preset number of training sample groups;

selecting a training sample set from a preset number of training sample sets as a candidate training sample set, and performing the following training steps based on the candidate training sample set and a predetermined initial model: for the candidate training sample group, taking a sample video of a training sample as input, taking a sample recognition result pair corresponding to the input sample video as output, and training an initial model by using a machine learning method to obtain an initial video recognition model; determining whether a preset completion condition for indicating completion of training is satisfied; in response to determining that the completion condition is satisfied, generating a video recognition model based on the obtained initial video recognition model;

and in response to determining that the completion condition is not met, selecting a training sample set from the unselected training sample sets as a new candidate training sample set, and taking the initial video identification model obtained last time as a new initial model, and continuing to perform the training step.

4. The method of claim 3, wherein the completion condition includes, but is not limited to, at least one of: the training sample groups which are not selected are not included in the preset number of training sample groups; and inputting the sample video of the training sample in the candidate training sample group into the initial model to obtain an actual recognition result pair, wherein the loss value of the actual recognition result pair relative to the sample recognition result pair corresponding to the input sample video is smaller than a preset loss threshold value.

5. An apparatus for generating information, comprising:

an acquisition unit configured to acquire a target video;

the input unit is configured to input the target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video, wherein the video recognition model is used for representing a corresponding relation between the video and the recognition result pair, the recognition result pair comprises a first recognition result and a second recognition result, the first recognition result is used for representing a foreground included in the video, the second recognition result is used for representing a background included in the video, and the video recognition model comprises a first video recognition sub-model, a second video recognition sub-model and a feature extraction network; and

the input unit includes:

a first input module configured to input the target video into the feature extraction network, and obtain video features of the target video;

and the second input module is configured to input the obtained video features into the first video identification submodel and the second video identification submodel respectively to obtain an identification result pair which corresponds to the target video and comprises a first identification result and a second identification result.

6. The apparatus of claim 5, wherein the video recognition model is trained by:

7. The apparatus according to claim 6, wherein the training using a machine learning method to obtain the video recognition model by using the sample video of the training samples in the training sample set as an input and the sample recognition result pair corresponding to the input sample video as an output comprises:

dividing a training sample set into a preset number of training sample groups;

8. The apparatus of claim 7, wherein the completion condition includes, but is not limited to, at least one of: the training sample groups which are not selected are not included in the preset number of training sample groups; and inputting the sample video of the training sample in the candidate training sample group into the initial model to obtain an actual recognition result pair, wherein the loss value of the actual recognition result pair relative to the sample recognition result pair corresponding to the input sample video is smaller than a preset loss threshold value.

9. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.

10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-4.