[go: up one dir, main page]

CN108830235B - Method and apparatus for generating information - Google Patents

Method and apparatus for generating information Download PDF

Info

Publication number
CN108830235B
CN108830235B CN201810644175.4A CN201810644175A CN108830235B CN 108830235 B CN108830235 B CN 108830235B CN 201810644175 A CN201810644175 A CN 201810644175A CN 108830235 B CN108830235 B CN 108830235B
Authority
CN
China
Prior art keywords
video
sample
training
recognition result
training sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810644175.4A
Other languages
Chinese (zh)
Other versions
CN108830235A (en
Inventor
李伟健
李映虹
王长虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Douyin Vision Co Ltd
Douyin Vision Beijing Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN201810644175.4A priority Critical patent/CN108830235B/en
Publication of CN108830235A publication Critical patent/CN108830235A/en
Priority to PCT/CN2018/116184 priority patent/WO2019242222A1/en
Application granted granted Critical
Publication of CN108830235B publication Critical patent/CN108830235B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses a method and a device for generating information. One embodiment of the method comprises: acquiring a target video; inputting a target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video, wherein the video recognition model is used for representing the corresponding relation between the video and the recognition result pair, the recognition result pair comprises a first recognition result and a second recognition result, the first recognition result is used for representing the foreground included in the video, and the second recognition result is used for representing the background included in the video. This embodiment improves the diversity of information generation.

Description

Method and apparatus for generating information
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a method and a device for generating information.
Background
Video typically includes foreground and background. The foreground can include shot contents (such as people, animals, behaviors and the like) corresponding to the shot video; the background may include a shooting scene (e.g., sky, court, forest, etc.) corresponding to the video obtained by shooting.
Currently, video identification usually identifies only the foreground of a video.
Disclosure of Invention
The embodiment of the application provides a method and a device for generating information.
In a first aspect, an embodiment of the present application provides a method for generating information, where the method includes: acquiring a target video; inputting a target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video, wherein the video recognition model is used for representing the corresponding relation between the video and the recognition result pair, the recognition result pair comprises a first recognition result and a second recognition result, the first recognition result is used for representing the foreground included in the video, and the second recognition result is used for representing the background included in the video.
In some embodiments, the video recognition model comprises a first video recognition submodel, a second video recognition submodel, and a feature extraction network; inputting the target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video, wherein the recognition result pair comprises: inputting the target video into a feature extraction network to obtain the video features of the target video; and respectively inputting the obtained video features into the first video identification submodel and the second video identification submodel to obtain an identification result pair which corresponds to the target video and comprises a first identification result and a second identification result.
In some embodiments, the video recognition model is trained by: acquiring a training sample set, wherein training samples comprise sample videos and sample identification result pairs labeled in advance aiming at the sample videos; and taking the sample video of the training samples in the training sample set as input, taking the sample recognition result pair corresponding to the input sample video as expected output, and training by using a machine learning method to obtain the video recognition model.
In some embodiments, a video recognition model is obtained by training with a machine learning method, where a sample video of a training sample in a training sample set is used as an input, a sample recognition result pair corresponding to the input sample video is used as an output, and the method includes: dividing a training sample set into a preset number of training sample groups; selecting a training sample set from a preset number of training sample sets as a candidate training sample set, and performing the following training steps based on the candidate training sample set and a predetermined initial model: for the candidate training sample group, taking a sample video of a training sample as input, taking a sample recognition result pair corresponding to the input sample video as output, and training an initial model by using a machine learning method to obtain an initial video recognition model; determining whether a preset completion condition for indicating completion of training is satisfied; in response to determining that the completion condition is satisfied, generating a video recognition model based on the obtained initial video recognition model; and in response to determining that the completion condition is not met, selecting a training sample set from the unselected training sample sets as a new candidate training sample set, and taking the most recently obtained initial video recognition model as a new initial model, and continuing to perform the training step.
In some embodiments, the completion condition includes, but is not limited to, at least one of: the training sample groups which are not selected are not included in the preset number of training sample groups; and inputting the sample video of the training sample in the candidate training sample group into the initial model to obtain an actual recognition result pair, wherein the loss value of the actual recognition result pair relative to the sample recognition result pair corresponding to the input sample video is smaller than a preset loss threshold value.
In a second aspect, an embodiment of the present application provides an apparatus for generating information, where the apparatus includes: an acquisition unit configured to acquire a target video; the input unit is configured to input a target video into a pre-trained video recognition model, and obtain a recognition result pair corresponding to the target video, wherein the video recognition model is used for representing a corresponding relation between the video and the recognition result pair, the recognition result pair comprises a first recognition result and a second recognition result, the first recognition result is used for representing a foreground included in the video, and the second recognition result is used for representing a background included in the video.
In some embodiments, the video recognition model comprises a first video recognition submodel, a second video recognition submodel, and a feature extraction network; and the input unit includes: the first input module is configured to input the target video into a feature extraction network to obtain video features of the target video; and the second input module is configured to input the obtained video features into the first video identification submodel and the second video identification submodel respectively to obtain an identification result pair which corresponds to the target video and comprises the first identification result and the second identification result.
In some embodiments, the video recognition model is trained by: acquiring a training sample set, wherein training samples comprise sample videos and sample identification result pairs labeled in advance aiming at the sample videos; and taking the sample video of the training samples in the training sample set as input, taking the sample recognition result pair corresponding to the input sample video as expected output, and training by using a machine learning method to obtain the video recognition model.
In some embodiments, a video recognition model is obtained by training with a machine learning method, where a sample video of a training sample in a training sample set is used as an input, a sample recognition result pair corresponding to the input sample video is used as an output, and the method includes: dividing a training sample set into a preset number of training sample groups; selecting a training sample set from a preset number of training sample sets as a candidate training sample set, and performing the following training steps based on the candidate training sample set and a predetermined initial model: for the candidate training sample group, taking a sample video of a training sample as input, taking a sample recognition result pair corresponding to the input sample video as output, and training an initial model by using a machine learning method to obtain an initial video recognition model; determining whether a preset completion condition for indicating completion of training is satisfied; in response to determining that the completion condition is satisfied, generating a video recognition model based on the obtained initial video recognition model; and in response to determining that the completion condition is not met, selecting a training sample set from the unselected training sample sets as a new candidate training sample set, and taking the most recently obtained initial video recognition model as a new initial model, and continuing to perform the training step.
In some embodiments, the completion condition includes, but is not limited to, at least one of: the training sample groups which are not selected are not included in the preset number of training sample groups; and inputting the sample video of the training sample in the candidate training sample group into the initial model to obtain an actual recognition result pair, wherein the loss value of the actual recognition result pair relative to the sample recognition result pair corresponding to the input sample video is smaller than a preset loss threshold value.
In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon which, when executed by one or more processors, cause the one or more processors to implement the method of any of the embodiments of the method for generating information described above.
In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which when executed by a processor implements the method of any of the above-described methods for generating information.
According to the method and the device for generating information, the target video is obtained, and then the target video is input into the pre-trained video recognition model, so that the recognition result pair corresponding to the target video is obtained, wherein the video recognition model is used for representing the corresponding relation between the video and the recognition result pair, the recognition result pair comprises the first recognition result and the second recognition result, the first recognition result is used for representing the foreground included in the video, and the second recognition result is used for representing the background included in the video, so that the pre-trained video recognition model can be used for simultaneously recognizing the foreground and the background of the target video, and the diversity of information generation is improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a method for generating information according to the present application;
FIG. 3 is a schematic illustration of an application scenario of a method for generating information according to the present application;
FIG. 4 is a flow diagram of yet another embodiment of a method for generating information according to the present application;
FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for generating information according to the present application;
FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing an electronic device according to embodiments of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 shows an exemplary system architecture 100 to which embodiments of the method for generating information or the apparatus for generating information of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a model training application, a video recognition application, a web browser application, social platform software, etc., may be installed on the terminal devices 101, 102, 103.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, and 103 are hardware, they may be various electronic devices with a display screen, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg Audio Layer 4), laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
When the terminals 101, 102, 103 are hardware, a video capture device may also be installed thereon. The video acquisition equipment can be various equipment capable of realizing the function of acquiring video, such as a camera, a sensor and the like. The user may capture video using a video capture device on the terminal 101, 102, 103.
The server 105 may be a server that provides various services, such as a background server that processes video displayed on the terminal devices 101, 102, 103. The background server may perform processing such as analysis on the received data such as the target video, and may feed back a processing result (e.g., a pair of recognition results) to the terminal device.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In particular, in the case where data used in the target face video or the process of generating the recognition result does not need to be acquired from a remote place, the above system architecture may not include a network, but only include a terminal device or a server.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating information in accordance with the present application is shown. The method for generating information comprises the following steps:
step 201, acquiring a target video.
In the present embodiment, an execution subject (e.g., a server shown in fig. 1) of the method for generating information may acquire the target video by a wired connection manner or a wireless connection manner. The target video may be a video to be identified.
The execution main body may acquire a target video transmitted by an electronic device (for example, the terminal device shown in fig. 1) communicatively connected to the execution main body, or may acquire a target video stored locally in advance.
Step 202, inputting the target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video.
In this embodiment, based on the target video obtained in step 201, the executing entity may input the target video into a pre-trained video recognition model, and obtain a recognition result pair corresponding to the target video. The video identification model can be used for representing the corresponding relation between the video and the identification result pair. The pair of recognition results includes a first recognition result and a second recognition result. The first recognition result may be used to characterize a foreground included in the video. The second recognition result may be used to characterize a background included with the video. The recognition results (the first recognition result and the second recognition result) in the recognition result pair may include, but are not limited to, at least one of: text, numbers, symbols, images, video.
Here, it is understood that the foreground included in the video generally refers to the shooting content (e.g., people, animals, behaviors, etc.) corresponding to the video, and the background included in the video generally refers to the shooting scene (e.g., sky, court, forest, etc.) to which the shooting content belongs.
In some optional implementations of this embodiment, the video recognition model may be obtained by training through the following steps: firstly, a training sample set is obtained, wherein the training samples comprise sample videos and sample recognition result pairs which are labeled in advance aiming at the sample videos. Then, the sample video of the training sample in the training sample set is used as input, the sample recognition result pair corresponding to the input sample video is used as expected output, and the machine learning method is used for training to obtain the video recognition model.
Specifically, as an example, a training sample may be selected from a training sample set, and the following steps are performed: inputting a sample video of the selected training sample into an initial model (such as a Convolutional Neural Network (CNN), a residual error Network (ResNet), and the like) to obtain a recognition result pair; taking a sample identification result pair corresponding to the input sample video as expected output of the initial model, and adjusting parameters of the initial model based on the obtained identification result pair and the sample identification result pair; determining whether the unselected training samples exist in the training sample set; and determining the adjusted initial model as the video recognition model in response to the non-selected training samples not existing. It should be noted that the selection manner of the training samples is not limited in the present application. For example, the selection may be random, or a training sample with better definition of the sample video may be preferentially selected.
In some optional implementation manners of this embodiment, the video recognition model may also be obtained by training through the following steps:
first, a training sample set is obtained, and the training sample set is divided into a preset number of training sample groups.
Here, the training sample set may be divided into a preset number of training sample groups in various ways. For example, the training sample set may be divided into a preset number of training sample groups in an equal division manner, or the training sample set may be divided so that the number of training samples included in each of the preset number of training sample groups is greater than or equal to a preset threshold. It should be noted that the preset number can be preset by a technician.
Then, a training sample set may be selected from a preset number of training sample sets as a candidate training sample set, and based on the candidate training sample set and a predetermined initial model, the following training steps are performed: for the candidate training sample group, taking a sample video of a training sample as input, taking a sample recognition result pair corresponding to the input sample video as output, and training an initial model by using a machine learning method to obtain an initial video recognition model; determining whether a preset completion condition for indicating completion of training is satisfied; in response to determining that the completion condition is satisfied, a video recognition model is generated based on the obtained initial video recognition model.
Here, one of the obtained initial video recognition models may be selected as a video recognition model, or the obtained initial video recognition models may be processed (fused) to obtain a video recognition model.
It should be noted that the selection manner of the training sample set is not limited in the present application. For example, the selection may be random, or a training sample group with a large number of training samples may be preferentially selected.
In addition, in response to determining that the completion condition is not satisfied, selecting a training sample set from the unselected training sample sets as a new candidate training sample set, and taking the most recently obtained initial video recognition model as a new initial model, and continuing to perform the training step.
It should be noted that the execution subject of the above steps for obtaining the video recognition model may be the same as or different from the execution subject of the method for generating information. If the training data is the same as the training data, the executing agent of the step for obtaining the video recognition model can store the trained video recognition model locally after training the video recognition model. If not, the executing agent for the step of obtaining the video recognition model may send the trained video recognition model to the executing agent for the method of generating information after the video recognition model is trained.
In some optional implementations of this embodiment, the completion condition may include, but is not limited to, at least one of the following: the training sample groups which are not selected are not included in the preset number of training sample groups; and inputting the sample video of the training sample in the candidate training sample group into the initial model to obtain an actual recognition result pair, wherein the loss value of the actual recognition result pair relative to the sample recognition result pair corresponding to the input sample video is smaller than a preset loss threshold value.
With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating information according to the present embodiment. In the application scenario of fig. 3, the terminal 301 first sends a target video (video obtained by shooting a kite) 302 to the server 303. Then, the server 303 acquires the target video 302, inputs the target video 302 into a video recognition model 304 trained in advance, and obtains a recognition result pair 305 corresponding to the target video 302. The video identification model can be used for representing the corresponding relation between the video and the identification result pair. The recognition result pair 305 includes a first recognition result (kite) 3051 and a second recognition result (sky) 3052, the first recognition result 3051 may be used to characterize a foreground included in the video, and the second recognition result 3052 may be used to characterize a background included in the video.
In the method provided by the embodiment of the application, the target video is acquired, and then the target video is input into the pre-trained video recognition model, so that the recognition result pair corresponding to the target video is acquired, wherein the video recognition model is used for representing the corresponding relation between the video and the recognition result pair, the recognition result pair comprises the first recognition result and the second recognition result, the first recognition result is used for representing the foreground included in the video, and the second recognition result is used for representing the background included in the video, so that the pre-trained video recognition model can be utilized to simultaneously recognize the foreground and the background of the target video, and the diversity of information generation is improved.
With further reference to fig. 4, a flow 400 of yet another embodiment of a method for generating information is shown. The flow 400 of the method for generating information comprises the steps of:
step 401, a target video is obtained.
In the present embodiment, an execution subject (e.g., a server shown in fig. 1) of the method for generating information may acquire the target video by a wired connection manner or a wireless connection manner.
It should be noted that step 401 may be implemented in a similar manner to step 201 in the foregoing embodiment. Accordingly, the above description regarding step 201 is also applicable to step 401 of this embodiment, and is not repeated here.
Step 402, inputting the target video into a feature extraction network of a pre-trained video recognition model to obtain the video features of the target video.
In this embodiment, the video recognition model may include a feature extraction network, and based on the target video obtained in step 401, the executing entity may input the target video into the feature extraction network of the video recognition model to obtain the video features of the target video.
It will be appreciated that the target video is essentially a sequence of target images arranged in chronological order. Thus, the video features of the target video may be embodied by image features of the target images in the target image sequence.
In this embodiment, the feature extraction network may be configured to extract image features of a target image corresponding to a target video, and generate and output video features corresponding to the target video based on the image features.
Specifically, the executing entity may directly determine the obtained image feature as the video feature corresponding to the target video, or may process the obtained image feature and determine the processed image feature as the video feature corresponding to the target video. As an example, the executing entity may fuse the obtained image features, obtain fused features, and determine the fused features as video features corresponding to the target video.
The feature extraction network may include a structure (e.g., a convolutional layer) for extracting image features, but may also include other structures (e.g., a pooling layer), and is not limited herein.
Step 403, inputting the obtained video features into a first video identification submodel and a second video identification submodel of the video identification model respectively, and obtaining an identification result pair corresponding to the target video and including a first identification result and a second identification result.
In this embodiment, the video recognition model may further include a first video recognition submodel and a second video recognition submodel, and further, the execution main body may input the obtained video features into the first video recognition submodel and the second video recognition submodel of the video recognition model, respectively, to obtain a recognition result pair corresponding to the target video and including the first recognition result and the second recognition result. Wherein the first recognition result can be used for characterizing a foreground included in the video. The second recognition result may be used to characterize a background included with the video. The recognition results (the first recognition result and the second recognition result) in the recognition result pair may include, but are not limited to, at least one of: text, numbers, symbols, images, video.
In this embodiment, the first video recognition submodel is connected to the feature extraction network, and is configured to generate a first recognition result based on the input video features. The second video submodel is connected with the feature extraction network and used for generating a second recognition result based on the input video features. Here, the first video identification submodel and the second video submodel may include structures for generating results (e.g., classifier, fully connected layer), but may also include other structures (e.g., output layer), and are not limited herein.
As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the process 400 of the method for generating information in this embodiment highlights the steps of inputting a target video into a feature extraction network, obtaining video features of the target video, and inputting the obtained video features as shared features into a first video identification submodel and a second video identification submodel, respectively, so as to obtain an identification result pair. Therefore, the scheme described in this embodiment can generate the first recognition result and the second recognition result by using the overall features (including the foreground features and the background features) of the target video, for the first recognition result, the reference data of the background features is added, and for the second recognition result, the reference data of the foreground features is added, so that more accurate video recognition can be realized, and the accuracy of information generation is improved.
With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for generating information, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 5, the apparatus 500 for generating information of the present embodiment includes: an acquisition unit 501 and an input unit 502. Wherein the obtaining unit 501 is configured to obtain a target video; the input unit 502 is configured to input a target video into a pre-trained video recognition model, and obtain a recognition result pair corresponding to the target video, where the video recognition model is used to represent a correspondence between the video and the recognition result pair, and the recognition result pair includes a first recognition result and a second recognition result, the first recognition result is used to represent a foreground included in the video, and the second recognition result is used to represent a background included in the video.
In this embodiment, the acquiring unit 501 of the apparatus 500 for generating information may acquire the target video by a wired connection manner or a wireless connection manner. The target video may be a video to be identified.
The acquisition unit 501 may acquire a target video transmitted by an electronic device (for example, the terminal device shown in fig. 1) connected to the acquisition unit in communication therewith, or may acquire a target video stored locally in advance.
In this embodiment, based on the target video obtained in the obtaining unit 501, the input unit 502 may input the target video into a video recognition model trained in advance, and obtain a recognition result pair corresponding to the target video. The video identification model can be used for representing the corresponding relation between the video and the identification result pair. The pair of recognition results includes a first recognition result and a second recognition result. The first recognition result may be used to characterize a foreground included in the video. The second recognition result may be used to characterize a background included with the video. The recognition results (the first recognition result and the second recognition result) in the recognition result pair may include, but are not limited to, at least one of: text, numbers, symbols, images, video.
Here, it is understood that the foreground included in the video generally refers to the shooting content (e.g., people, animals, behaviors, etc.) corresponding to the video, and the background included in the video generally refers to the shooting scene (e.g., sky, court, forest, etc.) to which the shooting content belongs.
In some optional implementations of this embodiment, the video recognition model may include a first video recognition submodel, a second video recognition submodel, and a feature extraction network; and the input unit 502 may include: a first input module (not shown in the figure) configured to input the target video into a feature extraction network to obtain video features of the target video; and a second input module (not shown in the figure) configured to input the obtained video features into the first video identification submodel and the second video identification submodel, respectively, and obtain an identification result pair corresponding to the target video and including the first identification result and the second identification result.
In some optional implementations of this embodiment, the video recognition model may be obtained by training through the following steps: acquiring a training sample set, wherein training samples comprise sample videos and sample identification result pairs labeled in advance aiming at the sample videos; and taking the sample video of the training samples in the training sample set as input, taking the sample recognition result pair corresponding to the input sample video as expected output, and training by using a machine learning method to obtain the video recognition model.
In some optional implementation manners of this embodiment, taking a sample video of a training sample in a training sample set as an input, taking a sample recognition result pair corresponding to the input sample video as an output, and training by using a machine learning method to obtain a video recognition model, includes: dividing a training sample set into a preset number of training sample groups; selecting a training sample set from a preset number of training sample sets as a candidate training sample set, and performing the following training steps based on the candidate training sample set and a predetermined initial model: for the candidate training sample group, taking a sample video of a training sample as input, taking a sample recognition result pair corresponding to the input sample video as output, and training an initial model by using a machine learning method to obtain an initial video recognition model; determining whether a preset completion condition for indicating completion of training is satisfied; in response to determining that the completion condition is satisfied, generating a video recognition model based on the obtained initial video recognition model; and in response to determining that the completion condition is not met, selecting a training sample set from the unselected training sample sets as a new candidate training sample set, and taking the most recently obtained initial video recognition model as a new initial model, and continuing to perform the training step.
In some optional implementations of this embodiment, the completion condition may include, but is not limited to, at least one of: the training sample groups which are not selected are not included in the preset number of training sample groups; and inputting the sample video of the training sample in the candidate training sample group into the initial model to obtain an actual recognition result pair, wherein the loss value of the actual recognition result pair relative to the sample recognition result pair corresponding to the input sample video is smaller than a preset loss threshold value.
The device 500 provided by the above embodiment of the present application acquires a target video through the acquisition unit 501, and then the input unit 502 inputs the target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video, where the video recognition model is used to represent a correspondence between the video and the recognition result pair, the recognition result pair includes a first recognition result and a second recognition result, the first recognition result is used to represent a foreground included in the video, and the second recognition result is used to represent a background included in the video, so that the pre-trained video recognition model can be utilized to simultaneously recognize the foreground and the background of the target video, and the diversity of information generation is improved.
Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit and an input unit. The names of these units do not in some cases constitute a limitation on the unit itself, and for example, the acquisition unit may also be described as a "unit that acquires a target video".
As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a target video; inputting a target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video, wherein the video recognition model is used for representing the corresponding relation between the video and the recognition result pair, the recognition result pair comprises a first recognition result and a second recognition result, the first recognition result is used for representing the foreground included in the video, and the second recognition result is used for representing the background included in the video.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (10)

1. A method for generating information, comprising:
acquiring a target video;
inputting the target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video, wherein the video recognition model is used for representing the corresponding relation between the video and the recognition result pair, the recognition result pair comprises a first recognition result and a second recognition result, the first recognition result is used for representing the foreground included in the video, the second recognition result is used for representing the background included in the video, and the video recognition model comprises a first video recognition submodel, a second video recognition submodel and a feature extraction network; and
the inputting the target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video includes:
inputting the target video into the feature extraction network to obtain the video features of the target video;
and respectively inputting the obtained video features into the first video identification submodel and the second video identification submodel to obtain an identification result pair which corresponds to the target video and comprises a first identification result and a second identification result.
2. The method of claim 1, wherein the video recognition model is trained by:
acquiring a training sample set, wherein training samples comprise sample videos and sample identification result pairs labeled in advance aiming at the sample videos;
and taking the sample video of the training samples in the training sample set as input, taking the sample recognition result pair corresponding to the input sample video as expected output, and training by using a machine learning method to obtain the video recognition model.
3. The method according to claim 2, wherein the training using a machine learning method to obtain a video recognition model by using a sample video of the training samples in the training sample set as an input and a sample recognition result pair corresponding to the input sample video as an output comprises:
dividing a training sample set into a preset number of training sample groups;
selecting a training sample set from a preset number of training sample sets as a candidate training sample set, and performing the following training steps based on the candidate training sample set and a predetermined initial model: for the candidate training sample group, taking a sample video of a training sample as input, taking a sample recognition result pair corresponding to the input sample video as output, and training an initial model by using a machine learning method to obtain an initial video recognition model; determining whether a preset completion condition for indicating completion of training is satisfied; in response to determining that the completion condition is satisfied, generating a video recognition model based on the obtained initial video recognition model;
and in response to determining that the completion condition is not met, selecting a training sample set from the unselected training sample sets as a new candidate training sample set, and taking the initial video identification model obtained last time as a new initial model, and continuing to perform the training step.
4. The method of claim 3, wherein the completion condition includes, but is not limited to, at least one of: the training sample groups which are not selected are not included in the preset number of training sample groups; and inputting the sample video of the training sample in the candidate training sample group into the initial model to obtain an actual recognition result pair, wherein the loss value of the actual recognition result pair relative to the sample recognition result pair corresponding to the input sample video is smaller than a preset loss threshold value.
5. An apparatus for generating information, comprising:
an acquisition unit configured to acquire a target video;
the input unit is configured to input the target video into a pre-trained video recognition model to obtain a recognition result pair corresponding to the target video, wherein the video recognition model is used for representing a corresponding relation between the video and the recognition result pair, the recognition result pair comprises a first recognition result and a second recognition result, the first recognition result is used for representing a foreground included in the video, the second recognition result is used for representing a background included in the video, and the video recognition model comprises a first video recognition sub-model, a second video recognition sub-model and a feature extraction network; and
the input unit includes:
a first input module configured to input the target video into the feature extraction network, and obtain video features of the target video;
and the second input module is configured to input the obtained video features into the first video identification submodel and the second video identification submodel respectively to obtain an identification result pair which corresponds to the target video and comprises a first identification result and a second identification result.
6. The apparatus of claim 5, wherein the video recognition model is trained by:
acquiring a training sample set, wherein training samples comprise sample videos and sample identification result pairs labeled in advance aiming at the sample videos;
and taking the sample video of the training samples in the training sample set as input, taking the sample recognition result pair corresponding to the input sample video as expected output, and training by using a machine learning method to obtain the video recognition model.
7. The apparatus according to claim 6, wherein the training using a machine learning method to obtain the video recognition model by using the sample video of the training samples in the training sample set as an input and the sample recognition result pair corresponding to the input sample video as an output comprises:
dividing a training sample set into a preset number of training sample groups;
selecting a training sample set from a preset number of training sample sets as a candidate training sample set, and performing the following training steps based on the candidate training sample set and a predetermined initial model: for the candidate training sample group, taking a sample video of a training sample as input, taking a sample recognition result pair corresponding to the input sample video as output, and training an initial model by using a machine learning method to obtain an initial video recognition model; determining whether a preset completion condition for indicating completion of training is satisfied; in response to determining that the completion condition is satisfied, generating a video recognition model based on the obtained initial video recognition model;
and in response to determining that the completion condition is not met, selecting a training sample set from the unselected training sample sets as a new candidate training sample set, and taking the initial video identification model obtained last time as a new initial model, and continuing to perform the training step.
8. The apparatus of claim 7, wherein the completion condition includes, but is not limited to, at least one of: the training sample groups which are not selected are not included in the preset number of training sample groups; and inputting the sample video of the training sample in the candidate training sample group into the initial model to obtain an actual recognition result pair, wherein the loss value of the actual recognition result pair relative to the sample recognition result pair corresponding to the input sample video is smaller than a preset loss threshold value.
9. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.
10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-4.
CN201810644175.4A 2018-06-21 2018-06-21 Method and apparatus for generating information Active CN108830235B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810644175.4A CN108830235B (en) 2018-06-21 2018-06-21 Method and apparatus for generating information
PCT/CN2018/116184 WO2019242222A1 (en) 2018-06-21 2018-11-19 Method and device for use in generating information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810644175.4A CN108830235B (en) 2018-06-21 2018-06-21 Method and apparatus for generating information

Publications (2)

Publication Number Publication Date
CN108830235A CN108830235A (en) 2018-11-16
CN108830235B true CN108830235B (en) 2020-11-24

Family

ID=64142947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810644175.4A Active CN108830235B (en) 2018-06-21 2018-06-21 Method and apparatus for generating information

Country Status (2)

Country Link
CN (1) CN108830235B (en)
WO (1) WO2019242222A1 (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830235B (en) * 2018-06-21 2020-11-24 北京字节跳动网络技术有限公司 Method and apparatus for generating information
CN110288089B (en) * 2019-06-28 2021-07-09 北京百度网讯科技有限公司 Method and apparatus for transmitting information
CN111667003B (en) * 2020-06-05 2023-11-03 北京百度网讯科技有限公司 Data cleaning method, device, equipment and storage medium
CN111950344B (en) * 2020-06-28 2023-06-27 北京百度网讯科技有限公司 Biological category identification method and device, storage medium and electronic equipment
CN111768007B (en) * 2020-06-28 2023-08-08 北京百度网讯科技有限公司 Method and device for mining data
CN114067786A (en) * 2020-07-28 2022-02-18 腾讯科技(深圳)有限公司 Speech recognition method, device, electronic device and storage medium
CN112101282B (en) * 2020-09-25 2024-04-26 北京瞰天科技有限公司 Water target identification method and device, electronic equipment and storage medium
CN112215908B (en) * 2020-10-12 2022-12-02 国家计算机网络与信息安全管理中心 Compressed domain-oriented video content comparison system, optimization method, and comparison method
CN114581296A (en) * 2020-11-16 2022-06-03 上海哔哩哔哩科技有限公司 Image generation method and device
CN112541705B (en) * 2020-12-23 2024-01-23 北京百度网讯科技有限公司 Method, device, equipment and storage medium for generating user behavior evaluation model
CN112819078B (en) * 2021-02-04 2023-12-15 上海明略人工智能(集团)有限公司 Iteration method and device for picture identification model
CN112949456B (en) * 2021-02-26 2023-12-12 北京达佳互联信息技术有限公司 Video feature extraction model training and video feature extraction method and device
CN112995665A (en) * 2021-03-10 2021-06-18 慧视云创(深圳)智能科技有限公司 Video coding method and device for camera device
CN113204695B (en) * 2021-05-12 2023-09-26 北京百度网讯科技有限公司 Website identification method and device
CN113361575B (en) * 2021-05-28 2023-10-20 北京百度网讯科技有限公司 Model training method and device and electronic equipment
CN113378921B (en) * 2021-06-09 2024-11-05 北京百度网讯科技有限公司 Data screening method, device and electronic equipment
CN113642727B (en) * 2021-08-06 2024-05-28 北京百度网讯科技有限公司 Training method of neural network model and processing method and device of multimedia information
CN113705682B (en) * 2021-08-27 2024-05-14 微民保险代理有限公司 User behavior feature processing method and device
CN113723344B (en) * 2021-09-08 2025-02-14 北京有竹居网络技术有限公司 Video recognition method, device, readable medium and electronic device
CN113988192A (en) * 2021-10-29 2022-01-28 湖北三江航天万山特种车辆有限公司 Shift fault identification method, model training method and device for hydraulic transmission
CN114091128B (en) * 2021-11-23 2024-06-28 北京百度网讯科技有限公司 Method and device for determining layout scheme and electronic equipment
CN116156271B (en) * 2022-12-14 2024-06-21 北京奇艺世纪科技有限公司 Video title generation method and device, electronic equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777114A (en) * 2009-01-08 2010-07-14 北京中星微电子有限公司 Intelligent analysis system and intelligent analysis method for video monitoring, and system and method for detecting and tracking head and shoulder
CN105488044A (en) * 2014-09-16 2016-04-13 华为技术有限公司 Data processing method and device
CN105825234A (en) * 2016-03-16 2016-08-03 电子科技大学 Superpixel and background model fused foreground detection method
CN106383912A (en) * 2016-10-14 2017-02-08 上海谦问万答吧云计算科技有限公司 Picture retrieval method and apparatus
CN107154051A (en) * 2016-03-03 2017-09-12 株式会社理光 Background wipes out method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073948A (en) * 2012-01-17 2018-05-25 华为技术有限公司 A kind of photo sort management, server, apparatus and system
US9251613B2 (en) * 2013-10-28 2016-02-02 Cyberlink Corp. Systems and methods for automatically applying effects based on media content characteristics
US9754416B2 (en) * 2014-12-23 2017-09-05 Intel Corporation Systems and methods for contextually augmented video creation and sharing
CN107133354B (en) * 2017-05-25 2020-11-10 北京小米移动软件有限公司 Method and device for acquiring image description information
CN107909145A (en) * 2017-12-05 2018-04-13 苏州天瞳威视电子科技有限公司 A kind of training method of convolutional neural networks model
CN108090497B (en) * 2017-12-28 2020-07-07 Oppo广东移动通信有限公司 Video classification method and device, storage medium and electronic equipment
CN108830235B (en) * 2018-06-21 2020-11-24 北京字节跳动网络技术有限公司 Method and apparatus for generating information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777114A (en) * 2009-01-08 2010-07-14 北京中星微电子有限公司 Intelligent analysis system and intelligent analysis method for video monitoring, and system and method for detecting and tracking head and shoulder
CN105488044A (en) * 2014-09-16 2016-04-13 华为技术有限公司 Data processing method and device
CN107154051A (en) * 2016-03-03 2017-09-12 株式会社理光 Background wipes out method and device
CN105825234A (en) * 2016-03-16 2016-08-03 电子科技大学 Superpixel and background model fused foreground detection method
CN106383912A (en) * 2016-10-14 2017-02-08 上海谦问万答吧云计算科技有限公司 Picture retrieval method and apparatus

Also Published As

Publication number Publication date
CN108830235A (en) 2018-11-16
WO2019242222A1 (en) 2019-12-26

Similar Documents

Publication Publication Date Title
CN108830235B (en) Method and apparatus for generating information
CN108805091B (en) Method and apparatus for generating a model
CN107578017B (en) Method and apparatus for generating image
CN109492128B (en) Method and apparatus for generating a model
WO2020000879A1 (en) Image recognition method and apparatus
CN109034069B (en) Method and apparatus for generating information
CN108960316B (en) Method and apparatus for generating a model
CN108986169B (en) Method and apparatus for processing image
US11436863B2 (en) Method and apparatus for outputting data
CN109101919B (en) Method and apparatus for generating information
CN109829432B (en) Method and apparatus for generating information
CN109993150B (en) Method and device for identifying age
CN109376267B (en) Method and apparatus for generating a model
CN109447156B (en) Method and apparatus for generating a model
CN107609506B (en) Method and apparatus for generating image
CN109145828B (en) Method and apparatus for generating video category detection model
CN110009059B (en) Method and apparatus for generating a model
CN109981787B (en) Method and device for displaying information
CN109583389B (en) Drawing recognition method and device
CN109377508B (en) Image processing method and device
CN110059623B (en) Method and apparatus for generating information
CN110084317B (en) Method and device for recognizing images
CN109145783B (en) Method and apparatus for generating information
CN108921138B (en) Method and apparatus for generating information
CN110046571B (en) Method and device for identifying age

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Li Weijian

Inventor after: Li Yinghong

Inventor after: Wang Changhu

Inventor before: Li Weijian

Inventor before: Li Yinghong

Inventor before: Wang Changhu

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Patentee after: Tiktok vision (Beijing) Co.,Ltd.

Address before: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Patentee before: BEIJING BYTEDANCE NETWORK TECHNOLOGY Co.,Ltd.

Address after: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Patentee after: Douyin Vision Co.,Ltd.

Address before: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Patentee before: Tiktok vision (Beijing) Co.,Ltd.

CP01 Change in the name or title of a patent holder