CN110392281A - Image synthesizing method, device, computer equipment and storage medium - Google Patents
Image synthesizing method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110392281A CN110392281A CN201810359953.5A CN201810359953A CN110392281A CN 110392281 A CN110392281 A CN 110392281A CN 201810359953 A CN201810359953 A CN 201810359953A CN 110392281 A CN110392281 A CN 110392281A
- Authority
- CN
- China
- Prior art keywords
- target
- video
- candidate
- information
- segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000003860 storage Methods 0.000 title claims abstract description 33
- 230000002194 synthesizing effect Effects 0.000 title claims abstract description 31
- 239000012634 fragment Substances 0.000 claims abstract description 171
- 238000004458 analytical method Methods 0.000 claims abstract description 91
- 239000000203 mixture Substances 0.000 claims abstract description 23
- 238000005520 cutting process Methods 0.000 claims description 23
- 238000004590 computer program Methods 0.000 claims description 21
- 238000000605 extraction Methods 0.000 claims description 21
- 210000004209 hair Anatomy 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 8
- 238000004519 manufacturing process Methods 0.000 abstract description 5
- 239000000284 extract Substances 0.000 description 19
- 238000010586 diagram Methods 0.000 description 12
- 230000008569 process Effects 0.000 description 7
- 230000015572 biosynthetic process Effects 0.000 description 6
- 244000144977 poultry Species 0.000 description 6
- 238000003786 synthesis reaction Methods 0.000 description 6
- 241001465754 Metazoa Species 0.000 description 3
- 230000033764 rhythmic process Effects 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000005055 memory storage Effects 0.000 description 2
- 241000406668 Loxodonta cyclotis Species 0.000 description 1
- 238000007630 basic procedure Methods 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000036461 convulsion Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/232—Content retrieval operation locally within server, e.g. reading video streams from disk arrays
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23418—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23424—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/432—Content retrieval operation from a local storage medium, e.g. hard-disk
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/435—Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44016—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Databases & Information Systems (AREA)
- Business, Economics & Management (AREA)
- Marketing (AREA)
- Television Signal Processing For Recording (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application involves a kind of image synthesizing methods, this method includes obtaining target information, obtain video analysis data, the video analysis data has recorded candidate information, the matching relationship of candidate segment range and candidate video mark, target video mark corresponding with the target information and target fragment range are searched from the video analysis data, corresponding target video is obtained according to target video mark, corresponding target video segment is extracted from corresponding target video according to the target fragment range, spliced each target video segment extracted to obtain synthetic video.Synthetic video method production is simple and at low cost, time saving and energy saving.Furthermore, it is also proposed that a kind of Video Composition device, computer equipment and storage medium.
Description
Technical field
This application involves computer processing technical fields, set more particularly to a kind of image synthesizing method, device, computer
Standby and storage medium.
Background technique
Video Composition refers to the method that multiple video clips are synthesized a video, and traditional Video Composition is usually people
Work, to what is synthesized after video progress editing, is manually cut using Video editing software for example, ghost raises a kind of video exactly pass through
Volume synthesis a kind of video, make a synthetic video need to carry out material video select and editing, need in this process
Entire video is constantly watched, generally requires to spend a large amount of energy and time, and artificial editing video is looked by dragging
The video clip for wanting editing is looked for, the subjective operation of producer is limited to, the video clip extracted is often inaccurate.
Summary of the invention
Based on this, it is necessary to a kind of make video simple, at low cost and high accuracy in view of the above-mentioned problems, proposing and close
At method, apparatus, computer equipment and storage medium.
A kind of image synthesizing method, which comprises
Obtain target information;
Video analysis data is obtained, the video analysis data has recorded candidate information, candidate segment range and candidate view
The matching relationship that frequency marking is known;
Target video mark corresponding with the target information and target fragment model are searched from the video analysis data
It encloses;
Corresponding target video is obtained according to target video mark;
Corresponding target video segment is extracted from corresponding target video according to the target fragment range;
Spliced each target video segment extracted to obtain synthetic video.
A kind of Video Composition device, described device include:
The first information obtains module, for obtaining target information;
Matching relationship obtains module, for obtaining video analysis data, the video analysis data have recorded candidate information,
The matching relationship of candidate segment range and candidate video mark;
Searching module, for searching target video mark corresponding with the target information from the video analysis data
With target fragment range;
First video acquiring module, for obtaining corresponding target video according to target video mark;
First extraction module, for extracting corresponding target from corresponding target video according to the target fragment range
Video clip;
First splicing module, for being spliced each target video segment extracted to obtain synthetic video.
The target information is for determining corresponding target speaker in one of the embodiments, and the searching module is also
For from the video analysis data search include the target speaker multiple target videos mark, obtain the target
Pronunciation corresponding target fragment range in each target video;First splicing module be also used to extract with it is described
The corresponding multiple target video segments of target speaker are spliced to obtain the synthetic video for repeating the target speaker.Wherein one
In a embodiment, the target information is for determining that corresponding target picture, the searching module are also used to from the video point
Lookup includes multiple target videos mark of the target picture in analysis data, obtains the target picture and regards in each target
Corresponding target fragment range in frequency;First splicing module is also used to extract corresponding with the target picture more
A target video segment is spliced to obtain the synthetic video for repeating the target picture.
First splicing module is also used to obtain each target video segment extracted in one of the embodiments,
Duration, calculate the corresponding total duration of the multiple target video segment;When the total duration is less than preset duration, duplication is mentioned
The target video segment got obtain duplication target video segment, by it is described duplication target video segment and extraction it is original
Target video segment is spliced, and synthetic video is obtained.
It includes multiple target words that the searching module, which is also used to work as in the target information, in one of the embodiments,
When language, target video mark corresponding with each target word and corresponding target fragment are searched from the video analysis data
Range;The target video segment corresponding with each target word that first splicing module is also used to extract is spliced
Obtain synthetic video.
A kind of computer equipment, including memory and processor, the memory are stored with computer program, the calculating
When machine program is executed by the processor, so that the processor executes following steps:
Obtain target information;
Video analysis data is obtained, the video analysis data has recorded candidate information, candidate segment range and candidate view
The matching relationship that frequency marking is known;
Target video mark corresponding with the target information and target fragment model are searched from the video analysis data
It encloses;
Corresponding target video is obtained according to target video mark;
Corresponding target video segment is extracted from corresponding target video according to the target fragment range;
Spliced each target video segment extracted to obtain synthetic video.
A kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor,
So that the processor executes following steps:
Obtain target information;
Video analysis data is obtained, the video analysis data has recorded candidate information, candidate segment range and candidate view
The matching relationship that frequency marking is known;
Target video mark corresponding with the target information and target fragment model are searched from the video analysis data
It encloses;
Corresponding target video is obtained according to target video mark;
Corresponding target video segment is extracted from corresponding target video according to the target fragment range;
Spliced each target video segment extracted to obtain synthetic video.
Then above-mentioned image synthesizing method, device, computer equipment and storage medium are obtained by obtaining target information
Video analysis data, video analysis data have recorded the matching relationship of candidate information, candidate segment range and candidate video mark,
Target video mark corresponding with target information and target fragment can be searched from video analysis data according to matching relationship
Range obtains corresponding target video according to target video mark, then according to target fragment range from corresponding target video
It is middle to extract corresponding target video segment, then spliced each target video segment extracted to obtain synthetic video.
The method of above-mentioned Video Composition can obtain each target video segment according to target information automatically, then be spelled automatically
It connecing, whole process is participated in without artificial, and production is simple and at low cost, and it is time saving and energy saving, and automatically determined according to target information
Target video segment avoids the error of manual operation, improves the accuracy of extraction.
A kind of image synthesizing method, which comprises
Obtain target information;
Obtain target video;
Target fragment range corresponding with the target information is searched in the target video according to the target information;
Target view corresponding with the target information is extracted from the target video according to the target fragment range
Frequency segment;
Spliced each target video segment extracted to obtain synthetic video.
A kind of Video Composition device, described device include:
Second data obtaining module, for obtaining target information;
Second video acquiring module, for obtaining target video;
Determining module, for determination to be corresponding with the target information in the target video according to the target information
Target fragment range;
Second extraction module, for being extracted from the target video and the target according to the target fragment range
The corresponding target video segment of information;
Second splicing module, for being spliced each target video segment extracted to obtain synthetic video.
The target information is for determining corresponding target speaker in one of the embodiments, and the determining module is also
For obtaining target speaker corresponding target fragment range in the target video;Second splicing module is also used to mention
The multiple target video segments corresponding with the target speaker got are spliced to obtain the synthesis for repeating the target speaker
Video.
The target information is for determining corresponding target picture in one of the embodiments, and the determining module is also
For obtaining target picture corresponding target fragment range in the target video;Second splicing module is also used to mention
The multiple target video segments corresponding with the target picture got are spliced to obtain the synthesis for repeating the target picture
Video.
A kind of computer equipment, including memory and processor, the memory are stored with computer program, the calculating
When machine program is executed by the processor, so that the processor executes following steps:
Obtain target information;
Obtain target video;
Target fragment range corresponding with the target information is searched in the target video according to the target information;
Target view corresponding with the target information is extracted from the target video according to the target fragment range
Frequency segment;
Spliced each target video segment extracted to obtain synthetic video.
A kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor,
So that the processor executes following steps:
Obtain target information;
Obtain target video;
Target fragment range corresponding with the target information is searched in the target video according to the target information;
Target view corresponding with the target information is extracted from the target video according to the target fragment range
Frequency segment;
Spliced each target video segment extracted to obtain synthetic video.
Above-mentioned image synthesizing method, device, computer equipment and storage medium are regarded by obtaining target information and target
Frequently, target fragment range corresponding with target information is searched in target video according to target information, then according to target fragment
Range extracts target video segment corresponding with target information from target video, each target video segment that will be extracted
Spliced to obtain synthetic video.The method of above-mentioned Video Composition can be regarded in the target got automatically according to target information
Target fragment range corresponding with target information is found in frequency, is then automatically extracted according to target fragment range, in turn
The each target video segment extracted is spliced, whole process is participated in without artificial, and production is simple and at low cost, time saving
It is laborsaving, and be the target video segment automatically determined according to target information, the error of manual operation is avoided, extraction is improved
Accuracy.
Detailed description of the invention
Fig. 1 is the applied environment figure of image synthesizing method in one embodiment;
Fig. 2 is the flow chart of image synthesizing method in one embodiment;
Fig. 3 is the schematic diagram that target information corresponding video identifier and segment ranges is searched in one embodiment;
Fig. 4 is the flow chart that matching relationship is determined in one embodiment;
Fig. 5 is the flow diagram of image synthesizing method in one embodiment;
Fig. 6 is that speech recognition obtains the schematic diagram of candidate text information and segment ranges in one embodiment;
Fig. 7 is that identification object obtains the schematic diagram of candidate target and segment ranges in one embodiment;
Fig. 8 is the schematic diagram that speech utterance object and segment ranges are identified in one embodiment;
Fig. 9 is the flow chart that target information is obtained in one embodiment;
Figure 10 is that splicing obtains the flow chart of synthetic video in one embodiment;
Figure 11 is the flow chart of image synthesizing method in another embodiment;
Figure 12 is the flow chart of image synthesizing method in another embodiment;
Figure 13 is the structural block diagram of image synthesizing system in one embodiment;
Figure 14 is the structural block diagram of Video Composition device in another embodiment;
Figure 15 is the structural block diagram of Video Composition device in another embodiment;
Figure 16 is the structural block diagram of computer equipment in one embodiment.
Specific embodiment
It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood
The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, and
It is not used in restriction the application.
Fig. 1 is the applied environment figure of image synthesizing method in one embodiment.Referring to Fig.1, the image synthesizing method application
In image synthesizing system.The image synthesizing system includes terminal 110 and server 120, and terminal 110 and server 120 pass through net
Network connection, terminal 110 specifically can be terminal console or mobile terminal, and mobile terminal specifically can be mobile phone, tablet computer, pen
Remember at least one of this computer etc..Server 120 can use the service of the either multiple server compositions of independent server
Device cluster is realized.Terminal 110 sends Video Composition request to server 120, includes target information in Video Composition request,
After server 120 gets target information, video analysis data is obtained, video analysis data has recorded candidate information, candidate piece
The matching relationship of segment limit and candidate video mark, searches target video mark corresponding with target information from video analysis data
Know and target fragment range, corresponding target video is obtained according to target video mark, according to target fragment range from corresponding
Corresponding target video segment is extracted in target video, and each target video segment extracted is spliced to obtain synthesis view
Frequently, obtained synthetic video is sent to terminal 110.
As shown in Fig. 2, in one embodiment, providing a kind of image synthesizing method, this method both can be applied to take
Business device, also can be applied to terminal, and the present embodiment is illustrated with being applied to terminal.The image synthesizing method specifically includes as follows
Step:
Step S202 obtains target information.
Wherein, target information refers to the information as querying condition.Target information can be word, be also possible to word
Phonetic, such as " zhibo " can also be the name etc. of personage.The acquisition of target information can be by receiving user's input
Information is also possible to the information that Automatic sieve is selected, for example, can will be gone out in text by identifying to one section of text information
The existing highest word of frequency is as target information.
Step S204, obtains video analysis data, and video analysis data has recorded candidate information, candidate segment range and time
Select the matching relationship of video identifier.
Wherein, video analysis data refer to by video audio data or video data carry out discriminance analysis obtain
Data.Video analysis data has recorded the matching relationship between candidate information, candidate segment range and candidate video mark.It waits
Information is selected to refer to the pre-stored information as search criterion, candidate information can be word, it is also possible to the phonetic of word,
It can also be the name etc. of personage.Video identifier is used for one video of unique identification, and candidate segment range refers to candidate information pair
Position of the video clip answered in candidate video.Video clip range can be put and terminate using at the beginning of video clip
Time point indicates, can also be indicated using the starting video frame identification and termination video frame identification of video clip.Video frame
Mark refers to the number of video frame, is used for one video frame of unique identification.Specifically, discriminance analysis is carried out to candidate video in advance
Candidate information is obtained, and obtains candidate information corresponding video clip range in candidate video, by candidate information, piece of video
Segment limit and corresponding candidate video mark is associated storage, i.e., the matching relationship of three is stored, be convenient for it is subsequent into
Row is searched.
In one embodiment, before obtaining video analysis data, further includes: selected range of video is obtained, according to
Selected range of video obtains video analysis data corresponding to the video in the range of video, then only in the selected video
Target video segment corresponding with target information is searched in range.It is of course also possible to using all videos in database as lookup
Range of video.Since the candidate video in database may be very much, by the way that video retrieval range is arranged, be conducive to more quickly
Find target video corresponding with target information and corresponding target fragment range.
Step S206 searches target video mark corresponding with target information and target fragment model from video analysis data
It encloses.
Wherein, it is had recorded in video analysis data between candidate information, candidate segment range and candidate video mark three
Matching relationship, so corresponding with target information target video mark and target information can be found according to target information
The corresponding target fragment range in target video.When target fragment range can be with start time point and the termination of video clip
Between point to indicate, with the starting video frame identification of video clip and video frame identification can also be terminated indicate.With target information
Corresponding target video mark can be one, be also possible to multiple, the mesh corresponding with target information in a target video
Standard film segment limit can have one, can also have multiple.
Step S208 obtains corresponding target video according to target video mark.
Wherein, since video identifier is used for one video of unique identification, so after getting target video mark, so that it may
To obtain target video corresponding with target video mark according to target video mark, it is convenient for subsequent progress editing and spelling
It connects.
Step S210 extracts corresponding target video segment according to target fragment range from corresponding target video.
Wherein, the target fragment range according to target information in target video is extracted accordingly from corresponding target video
Target video segment.Such as, it is assumed that target fragment range is the 10th millisecond (ms) to the 20th millisecond (ms), then just from target
The video clip between 10ms to 20ms is extracted in video.
Step S212 is spliced each target video segment extracted to obtain synthetic video.
Wherein, synthetic video refers to the video for being spliced multiple video clips.Extraction obtains and target information
After corresponding each target video segment, spliced each target video segment to obtain synthetic video.The mode of splicing can
Can also be spliced according to the sequence for extracting each target video segment using random splicing, naturally it is also possible to according to
Customized other splicings sequence is spliced, for example, can orderly be spliced according to the length of target video segment.
As shown in figure 3, in one embodiment, it is assumed that target information be " master ", find include " master " view
Frequency has tri- videos of A, B, C, and " master " corresponding target fragment range is 10ms (millisecond) to 20ms in A video.B view
" master " corresponding target fragment range is 10ms to 25ms, " master " corresponding target fragment range in C video in frequency
There are two, respectively 20ms to 35ms, 60ms to 75ms, according to target fragment range respectively from corresponding video
Video clip corresponding with " master " is extracted, is then spliced to obtain synthetic video S, as shown in figure 3, extracting each view
After video clip in frequency, splicing has obtained the synthetic video S of a 55ms.
Then above-mentioned image synthesizing method obtains video analysis data, video analysis data note by obtaining target information
The matching relationship for having recorded candidate information, candidate segment range and candidate video mark, can be from video point according to matching relationship
It analyses and searches target video mark corresponding with target information and target fragment range in data, according to target video mark acquisition pair
Then the target video answered extracts corresponding target video segment, so according to target fragment range from corresponding target video
Spliced each target video segment extracted to obtain synthetic video afterwards.The method of above-mentioned Video Composition, according to target
Information can obtain each target video segment automatically, then be spliced automatically, and whole process is participated in without artificial, production
It is simple and at low cost, it is time saving and energy saving, and be the target video segment automatically determined according to target information, avoid manual operation
Error improves the accuracy of extraction.
In one embodiment, target video mark corresponding with target information and target are searched from video analysis data
Segment ranges, comprising: for target information for determining corresponding target speaker, searching from video analysis data includes target hair
Multiple target videos of sound identify, and obtain target speaker corresponding target fragment range in each target video;It will extract
Each target video segment spliced to obtain synthetic video, comprising: the multiple mesh corresponding with target speaker that will be extracted
Mark video clip is spliced to obtain the synthetic video for repeating target speaker;
Wherein, target speaker refers to and pronunciation corresponding to target information.For example, if target information is " live streaming ", that
Target speaker is pronunciation corresponding to " live streaming " the two words.Candidate information, candidate segment are had recorded in video analysis data
Matching relationship between range and candidate video mark three.In one embodiment, the corresponding candidate pronunciation of candidate information, it is candidate
The corresponding candidate segment range of information is segment ranges corresponding to candidate pronunciation in video.So from video analysis data
Multiple target video marks corresponding with target information are searched, i.e., searching from video analysis data includes the more of target speaker
Then a target video mark obtains target speaker in each target video in corresponding target fragment valve, by what is extracted
Multiple target video segments corresponding with target speaker are spliced and combined to obtain the synthetic video for repeating the target speaker.
In one embodiment, target video mark corresponding with target information and target are searched from video analysis data
Segment ranges, comprising: for target information for determining corresponding target picture, searching from video analysis data includes that target is drawn
Multiple target videos in face identify, and obtain target picture corresponding target fragment range in each target video;It will extract
Each target video segment spliced to obtain synthetic video, comprising: the multiple mesh corresponding with target picture that will be extracted
Mark video clip is spliced to obtain the synthetic video for repeating target picture.
Wherein, target picture refers to and picture corresponding to target information.For example, if target information is " dog ",
Target picture is the picture for including " dog ", if target information be " Sun Wukong ", target picture be include that " grandson realizes
The picture of sky ".Between candidate information, candidate segment range and candidate video mark three is had recorded in video analysis data
With relationship.In one embodiment, the corresponding candidate picture of candidate information, the corresponding candidate segment range of candidate information are candidate draw
Face corresponding segment ranges in video.So searching multiple target views corresponding with target information from video analysis data
Frequency marking is known, i.e., searches multiple target videos comprising target picture from video analysis data and identify, then obtain target picture
The corresponding target fragment range in each target video is then proposed according within the scope of target fragment to each target video piece
Section is spliced the multiple target video segments extracted to obtain the synthetic video for repeating target picture.
In one embodiment, before obtaining target information, further includes: determine candidate information, candidate segment range with
Candidate video identifies the matching relationship between three.As shown in figure 4, specifically includes the following steps:
Step S214 obtains candidate video.
Wherein, candidate video refers to the video of storage being used for for lookup, can be according to target information in candidate video
Find required target video.
Step S216 is at least one of step S216A, S216B, S216C.
Step S216A carries out speech recognition to the corresponding audio of candidate video and obtains corresponding multiple candidate text informations,
Using each candidate text information as candidate information.
Wherein, video includes audio and video image.Audio refers to that the sound in video, video image refer in video
Picture.Speech recognition refers to that by the content recognition in sound be text.Candidate text information refers to be obtained by speech recognition
Text information.Candidate text information can be the word that identification obtains, and be also possible to the phonetic of word.It can in candidate text information
Can also simultaneously include multiple words comprising a word.In some embodiments, by carrying out speech recognition to audio
What is directly obtained is a long text, and the inside includes many words, can be carried out at word cutting to the text of the length
Reason obtains multiple candidate words, using candidate word as candidate text information.
In one embodiment, the principle of speech recognition is as follows: important module there are three in speech recognition, acoustic model,
Dictionary and language model.The basic procedure of speech recognition is such that audio stream becomes segment one by one, and each segment first passes through
The operation of acoustic model matches to find with which sound, forms certain sounds of conjecture later, and by searching for dictionary, these sounds can be with
Reflect into different words, later from a word to next word, then with arranging in pairs or groups between different words in language model
Relationship guesses type path by constantly calculating acoustic model and language model, terminates when segments all in audio stream all compare,
According to the component selections one highest recognition result for being exactly.
Step S216B carries out image recognition to the corresponding picture frame of candidate video and obtains corresponding multiple candidate targets, will
Each candidate target is as candidate information.
Wherein, picture frame refers to that video image, video are made of the video image of a frame frame, and a picture frame is corresponding
One video image.Multiple candidate targets are obtained by carrying out image recognition to the picture frame in candidate video, candidate target can
To be to identify obtained personage, it is also possible to the animal that identification obtains, can also be other objects etc. that identification obtains.It is candidate right
The identification of elephant can be identified using Object identifying model.Object identifying model is the model obtained by learning training,
The target object for including in image for identification.Such as, it is assumed that it to identify in image whether include " Sun Wukong ", first pass through in advance
Model training (for example, being trained using convolutional neural networks model) extracts the image characteristics about " Sun Wukong ", is then formed
The identification model of " Sun Wukong " this personage is identified, later using the model to whether including Sun Wukong's progress in video pictures
Identification.In one embodiment, multiple Object identifying models are preset to be respectively used to identify different objects, identification is obtained
Candidate target as candidate information.
Step S216C carries out speech utterance Object identifying to the corresponding audio of candidate video, obtains multiple candidate speech hairs
Sound object, using each candidate speech sounding object as candidate information.
Wherein, speech utterance object refers to sound sender.Identify that speech utterance object, seeking to identification speaks
Sound is issued by whom.By being identified to obtain speech utterance corresponding to every section audio data to the audio in candidate video
Object.Such as, it is assumed that occur the dialogue of four personages in one section of video, identifies the corresponding sounding personage of every a word, it should
Identify that obtained sounding personage is exactly the speech utterance object that identification obtains, using the speech utterance object as candidate information.Than
Such as, it is assumed that the speech utterance object identified is " Donald Trump ", then " Donald Trump " is just used as candidate information.
Step S218 obtains each candidate information candidate segment range corresponding in candidate video, candidate segment model
It encloses including segment origin identification and segments end mark.
Wherein, candidate segment range refers to position range of the corresponding video clip of candidate information in candidate video.When
When candidate information is candidate text information, the pronunciation of candidate text information candidate segment corresponding in candidate video is obtained.
When candidate information is candidate target, obtains and identifies obtained candidate target corresponding candidate segment range in candidate video,
Candidate segment range refers to the segment ranges that candidate target occurs in candidate video.When candidate information is candidate speech sounding pair
As when, obtain each candidate speech sounding object corresponding candidate segment range, candidate segment range in candidate video and refer to
The range that the corresponding audio data of candidate speech generating object occurs in candidate video.
Candidate segment range is indicated using segment origin identification and segments end mark.In one embodiment,
It can be identified using start time point as segment origin identification using time point is terminated as segments end.For example, candidate letter
The 14th second to the 16th second this period of time of corresponding video clip in candidate video is ceased, then between the 14th second to the 16th second
Video is corresponding video clip.In another embodiment, using the starting video frame identification of video clip as segment
Origin identification is identified using video frame identification is terminated as segments end.For example, the corresponding video clip of candidate information is in candidate
The 20th frame in video between the 50th frame, then the 20th frame to the video between the 50th frame be corresponding video clip.
The video identifier of candidate information, candidate segment range and candidate video is associated storage and obtained by step S220
Matching relationship.
Wherein, it is searched for the ease of subsequent, by candidate information, the video identifier of candidate segment range and candidate video
It is associated storage, obtains the matching relationship of three.The mode specifically stored customized can be arranged, in one embodiment,
Candidate information, candidate video mark, candidate segment range are subjected to one-to-one correspondence storage.Assume that in video A include multiple
Each candidate information is identified with the candidate video and corresponding segment ranges carries out one-to-one correspondence association by candidate information.Tool
Body can be stored the matching relationship of three into database using following storage organization, as shown in table 1.
Table 1
Field name | Type | Description |
id | int | Number, major key |
pronounce | varchar(256) | Candidate information |
duration | int | Pronounce duration (millisecond) |
fileName | varchar(256) | Video identifier |
beginTime | int | Time started (millisecond) |
endTime | int | End time (millisecond) |
In another embodiment, it can quickly be searched in order to subsequent according to target information, by same candidate information
It is associated storage with corresponding all candidate video marks, while again identifying each candidate video and corresponding candidate segment
Range is associated.Such as, it is assumed that candidate information S, candidate video corresponding with candidate information S are identified with A, B, C tri-,
Candidate information S is in A video there are two corresponding candidate segment ranges, respectively A1 and A2;Candidate information S is right in B video
There are three the candidate segment ranges answered, respectively B1, B2 and B3;Candidate information S corresponding candidate segment range in C video has
1, be C1.Three can be associated using two-stage associated form, i.e. candidate information S is directly closed with video A, B and C
Connection, then video A, B and C is associated with corresponding video clip range respectively.
It is illustrated in figure 5 the flow diagram of Video Composition in one embodiment, including two parts, first part are (right
Should right-hand component in figure) it is to obtain candidate video, the audio or video in candidate video is analyzed to obtain and multiple candidate is believed
Breath obtains candidate segment range corresponding with each candidate information, and candidate information, candidate segment range are regarded with corresponding candidate
Matching relationship between frequency marking knowledge is stored to database.Second part (left-hand component in corresponding diagram) is to obtain target information, so
Searched in the database according to target information afterwards corresponding with target information target video mark and with each target video mark
Know corresponding target fragment range, then extracts corresponding target view from corresponding target video according to target fragment range
Frequency segment is spliced each target video segment extracted to obtain synthetic video.
In one embodiment, image synthesizing method is applied in the scene of ghost poultry video, wherein ghost poultry video is one
Relatively conventional original video type in kind video website, the type cooperate BGM (one with high level of synchronization, quick duplicate material
Kind than faster rhythm) rhythm ghost as twitch reach brain wash or happiness sense effect, or by video (or audio) into
Row editing, one section of rhythm being composed with the repetition picture (or sound) of very high frequency draw the high one kind of sync rates with synaeresis
Video.Due to being often directed to the picture of some word being iteratively repeated in ghost poultry video, in order to rapidly synthesize terrible poultry
Then video will search in the database mesh corresponding with target information in the core word that ghost raises video as target information
Video identifier and target fragment range corresponding with each target video mark are marked, then according to target fragment range from correspondence
Target video in extract corresponding target video segment, each target video segment extracted is spliced and is closed
At ghost poultry video.In another embodiment, after the ghost poultry video synthesized, video in addition can also be raiseeed for the ghost and added
Background music is added to build a kind of allegro comedy sense.
In one embodiment, target information is target text information;Voice knowledge is carried out to the corresponding audio of candidate video
The step S216A for not obtaining corresponding multiple candidate text informations includes: to carry out speech recognition to the audio in candidate video to obtain
To speech text, word cutting is carried out to speech text and handles to obtain multiple candidate text informations;Each candidate information is obtained in candidate
Corresponding candidate segment range in video, candidate segment range include the steps that segment origin identification and segments end mark
S218 includes: to obtain current candidate text information, obtains the corresponding audio session of current candidate text information, audio session
Time point is terminated including Voice onset time point and pronunciation;Using Voice onset time point as segment origin identification, eventually by pronunciation
Only time point identifies as segments end.
Wherein, target information is target text information, and target text information can be target word, be also possible to target word
The phonetic of language.Speech recognition is carried out to the audio in candidate video and obtains speech text, includes multiple words in speech text,
Multiple candidate text informations can be obtained by carrying out word cutting processing to speech text.In one embodiment, to speech text
Before progress word cutting processing further include: speech recognition is pre-processed, pretreatment includes removal stop words, for example, " ",
The meaningless word such as " ground ".
Current candidate text information refers to current text information to be processed, obtains the corresponding sound of current candidate text information
Frequency period, the i.e. pronunciation of acquisition current candidate text information corresponding audio session, audio session in candidate video
Time point is terminated including Voice onset time point and pronunciation, using Voice onset time point as segment origin identification, eventually by pronunciation
Only time point identifies as segments end.After current candidate text information corresponding audio session has been determined, obtain next
Candidate text information is as current candidate text information, subsequently into the corresponding audio session of acquisition current candidate text information
The step of.For example, as shown in Figure 6, it is assumed that speech recognition is carried out to the audio in one section of video M and obtains 5 candidate words,
Respectively word 1, word 2, word 3, word 4, word 5, wherein the corresponding audio session of word 1 is 5ms to the
15ms, the corresponding audio session of word 2 are 20ms to 25ms, and the corresponding audio session of word 3 is 35ms to the
45ms, the corresponding audio session of word 4 are 55ms to 70ms, and the corresponding audio session of word 5 is 75ms to the
90ms.It is subsequent that each candidate word is associated storage with corresponding video identifier and corresponding audio session respectively.
In one embodiment, target information is the corresponding target phonetic of target word;Audio corresponding to candidate video
It carries out speech recognition and obtains the step S216A of corresponding multiple candidate text informations to include: to carry out the audio in candidate video
Speech recognition obtains phonetic text, carries out word cutting to phonetic text and handles to obtain the corresponding candidate pinyin of multiple candidate words;It obtains
The candidate segment range for taking each candidate information corresponding in candidate video, candidate segment range include segment origin identification and
Segments end identification of steps S218 includes: to obtain current candidate phonetic, obtains the corresponding sound of audio data of current candidate phonetic
Frequency period, audio session include that Voice onset time point and pronunciation terminate time point;Using Voice onset time point as piece
Pronunciation is terminated time point as segments end and identified by section origin identification.
Wherein, target information is the corresponding target phonetic of target word.Speech recognition is carried out to the audio in candidate video
Phonetic text is obtained, includes the phonetic of multiple words in phonetic text, carrying out word cutting processing to phonetic text can obtain
The corresponding candidate pinyin of multiple candidate's words.Current candidate phonetic refers to current phonetic to be processed, obtains current candidate phonetic
The corresponding audio session of audio data, audio data refer to the pronunciation of current candidate phonetic corresponding audio.Audio
Period includes that Voice onset time point and pronunciation terminate time point, will using Voice onset time point as segment origin identification
Pronunciation terminates time point as segments end mark.After current candidate phonetic corresponding audio session has been determined, under acquisition
The corresponding candidate pinyin of a candidate's word is used as current candidate phonetic, when audio corresponding subsequently into acquisition current candidate phonetic
Between section the step of.
It, can will be with video corresponding to identical phonetic but different terms by using target phonetic as target information
Segment all extracts, and is conducive to extract more video clips, for example, target phonetic is " shifu ", qualified word
Language includes: " master worker ", " master ", " restaurant ", " real to pay ", " poem tax " etc..Progress is equivalent to as target information using phonetic
Video clip corresponding to same phonetic but different terms can all be extracted, be conducive to extract more by fuzzy query
The video clip of multiple coincidence condition.
In one embodiment, target information further includes tone corresponding to the target phonetic of target word;Candidate is regarded
Audio in frequency carries out speech recognition and obtains the step S216A of candidate pinyin corresponding to multiple candidate words to include: to regard candidate
Audio in frequency carries out speech recognition and obtains the tone of each phonetic in phonetic text and phonetic text, cuts to phonetic text
Word handles to obtain the corresponding candidate pinyin of multiple candidate words and the corresponding tone of candidate pinyin;Each candidate information is obtained to wait
Candidate segment range corresponding in video is selected, candidate segment range includes the steps that segment origin identification and segments end mark
S218 includes: to obtain current candidate phonetic and the corresponding tone of current candidate phonetic, according to current candidate phonetic and current candidate
The corresponding tone of phonetic obtains corresponding audio session, and audio session includes that Voice onset time point and pronunciation terminate the time
Point;Using Voice onset time point as segment origin identification, pronunciation is terminated into time point as segments end and is identified.
Wherein, in order to accurately extract video clip corresponding to the word of same pronunciation, in addition to packet in target information
It also include tone corresponding with the target phonetic outside containing target phonetic.Voice knowledge is being carried out to the audio in candidate video
When other, other than identification obtains phonetic text, it is also necessary to identify the tone of each phonetic in phonetic text.To phonetic text into
Multiple candidate corresponding candidate pinyins of word and corresponding tone are obtained after the processing of row word cutting.It obtains current candidate phonetic and works as
Then the corresponding tone of preceding candidate pinyin obtains corresponding sound according to current candidate phonetic and the corresponding tone of current candidate phonetic
The frequency period.
By that can will have identical phonetic harmony using target phonetic and the corresponding tone of target phonetic as target information
Video clip corresponding to the word of tune extracts.For example, target phonetic is " shifu ", it is first that corresponding tone, which is divided into,
Sound and the falling tone.So qualified word includes: " master worker ", " master ", " poem tax " etc..By that will include in target information
There is tone, be conducive to extract video clip corresponding to the identical word of pronunciation, relative to directly with target word work
For the mode of target information, be conducive to extract more qualified video clips.
In one embodiment, target information is target object;Image recognition is carried out to the corresponding picture frame of candidate video
The step S216A for obtaining corresponding multiple candidate targets includes: to carry out in the corresponding picture frame of candidate video to the object for including
Identification obtains corresponding multiple candidate targets;Each candidate information candidate segment range corresponding in candidate video is obtained,
Candidate segment range includes the steps that segment origin identification and segments end mark S218 include: to obtain current candidate object, obtains
The video frame for taking current candidate object to occur in candidate video, by it is multiple include current candidate object successive video frames make
For a video clip, video clip includes the mark of initial video frame and the mark for terminating video frame;By initial video frame
Mark is used as segment origin identification, and the mark for terminating video frame is identified as segments end.
Wherein, target information is target object.Target object can be specified personage, be also possible to specified animal,
It can also be other specified objects.By being identified to obtain multiple candidate targets to the object for including in picture frame.One
In a embodiment, candidate target identifies to obtain using trained Object identifying model.By extracting different objects
Feature training obtains the Object identifying model of different objects for identification.Such as, it is assumed that candidate target is " dog ", passes through extraction
The feature of " dog " trains to obtain the identification model of " dog " for identification, using the model in video pictures whether include
" dog " is identified.
Due to having usually contained multiple objects in one section of video, come pair so needing to preset multiple Object identifying models
Each object in video is identified, then identifies which video frame obtains each object has appeared in, such as Fig. 7 institute
Show, in one embodiment, it is assumed that there are 4 personages, respectively personage A altogether in one section of video N, personage B, personage C and
Personage D, this 4 personages are the candidate targets of identification, using pre-set Object identifying model to 4 personages in video
It is identified, identifies that each personage has appeared in which video frame in the video respectively, as shown in fig. 7, identification obtains
Personage A has appeared in the 10th frame in video between the 20th frame, and personage B has appeared in the 25th frame to 40 frames in video, people
Object C has appeared in 45 frames to 60 frames, and personage D has appeared in the 70th frame to the 85th frame.The subsequent personage that will be recognized with it is corresponding
Video identifier and corresponding segment ranges carry out corresponding storage.
Current candidate object refers to current object to be processed, obtains the view that current candidate object occurs in candidate video
Frequency frame, then using multiple successive video frames for including current candidate object as a video clip, and by the piece of video
Initial video frame identification in section is as segment origin identification, using the termination video frame identification in the video clip as segment end
Only identify.The candidate target that identification obtains is stored with candidate video mark and corresponding video clip, convenient for subsequent
Target video mark corresponding with target object and target fragment range are searched according to target object, then according to target fragment model
The target video segment extracted from target video with target object is enclosed, each target video segment extracted is synthesized
Obtain synthetic video corresponding with target object.
In one embodiment, target information is speech utterance object;Voice hair is carried out to the corresponding audio of candidate video
Sound Object identifying, the step S216A for obtaining multiple candidate sounding objects includes: to carry out speech utterance to the audio in candidate video
Object identifying, identification obtain multiple candidate speech sounding objects;Obtain each candidate information time corresponding in candidate video
Chip select segment limit, candidate segment range include the steps that segment origin identification and segments end mark S218 include: to obtain currently
Candidate speech sounding object, obtains the corresponding audio session of current candidate speech utterance object, and audio session includes pronunciation
Start time point and pronunciation terminate time point;Using Voice onset time point as segment origin identification, pronunciation is terminated into time point
It is identified as segments end.
Wherein, target information is speech utterance object.Speech utterance object, that is, sound sender.By to candidate video
In audio carry out speech utterance Object identifying, identify the multiple speech utterance objects occurred in candidate video, i.e., candidate language
Sound sounding object.And obtain audio session corresponding to the audio data of each speech utterance object, audio time here
Section can be one section, be also possible to multistage, if different location of some personage in candidate video has been said repeatedly, accordingly
Just have multiple audio sessions.The time point that each audio session has recorded the time point that voice starts and voice terminates.
Speech utterance Object identifying model that being identified by of speech utterance object pre-establishes identifies, due to different people
Speech utterance feature possessed by object is different, it is possible to identify audio by learning the speech utterance feature of different personages
Speech utterance object corresponding to data.For example, as shown in Figure 8, it is assumed that in one section of video K, four personages occur
Dialogue, this four personages be respectively " Lucy ", " Joy ", " Coco " and " Mary ", it is corresponding with each personage using pre-establishing
Speech utterance Object identifying model the audio in video is identified, identify audio data corresponding to each personage,
Obtain audio session corresponding to respective audio data.As shown in figure 8, " Lucy " corresponding audio session is that 20ms is arrived
30ms, there are two " Joy " corresponding audio sessions, respectively 35ms to 45ms and 60ms to 75ms,
" Coco " corresponding audio session is 50ms to 60ms, and " Mary " corresponding audio session is 80ms to 90ms.
As shown in figure 9, in one embodiment, obtaining target information, comprising:
Step S202A obtains the video of upload.
Wherein, in order to get target information from the video of upload automatically, terminal obtains the video of upload first.
Step S202B carries out speech recognition to the audio in the video of upload and obtains choosing text, and statistics is chosen in text
The frequency that each word occurs.
Wherein, video includes audio, in the video of upload audio carry out speech recognition, the effect of speech recognition be by
The Content Transformation for including in audio is text, and identification obtains after choosing text, carries out word cutting processing to text is chosen, obtains one
A word, and count the frequency that each word occurs in the selection text.Frequency refers to the frequency occurred within the unit time
It is secondary.The frequency that word occurs is higher, illustrates that the word relatively has feature.
Step S202C determines target word according to the frequency that each word occurs in choosing text, target word is made
For target information.
Wherein, after obtaining the frequency that each word occurs in choosing text, target is determined according to the frequency of word
The determination of word, target word can also will be greater than predeterminated frequency directly using the highest word of frequency as target word
Word as target word, target word can have multiple.The target word of the determination is target information.
As shown in Figure 10, in one embodiment, each target video segment extracted is spliced and is synthesized
The step 212 of video includes:
Step S212A obtains the duration for each target video segment extracted, and it is corresponding to calculate multiple target video segments
Total duration.
Wherein, before synthetic video, the duration of each target video segment is calculated first, is then calculated each
The total duration of a target video segment.Such as, it is assumed that extract 3 segments, when 3 seconds a length of, second segment of the 1st segment
When it is 8 seconds a length of, the third fragment when it is 10 seconds a length of, then corresponding total duration be 3+8+10=21 seconds.
Step S212B, judges whether total duration is less than preset duration, if so, S212C is entered step, if it is not, then entering
Step S212D.
Wherein, calculate the corresponding total duration of each target video segment be in order to judge duration corresponding to synthetic video,
If total duration is too short, need to be extended.Specifically, if total duration is less than preset duration, the target extracted is replicated
Video clip obtains duplication target video segment.If total duration is not less than preset duration, do not need to replicate.
Step S212C replicates the target video segment extracted and obtains duplication target video segment, will replicate target video
Segment and the original object video clip of extraction are spliced, and synthetic video is obtained.
Wherein, duplication target video segment, which refers to, replicates target video segment.Original object piece of video
Section refers to the target video segment directly extracted, in order to distinguish with duplication target video segment, so referred to as original object regards
Frequency segment.The duplication target video segment that duplication obtains is spliced with original object video clip, obtains synthetic video.
Step S212D is spliced each target video segment to obtain synthetic video.
Wherein, when total duration is not less than preset duration, directly each target video segment extracted is spliced
Obtain synthetic video.
In one embodiment, target video mark corresponding with target information and target are searched from video analysis data
Segment ranges include: to search and each target from video analysis data in target information when including multiple target words
The corresponding target video mark of word and corresponding target fragment range;The each target video segment extracted is spliced
Obtain synthetic video, comprising: the target video segment corresponding with each target word extracted is spliced and is synthesized
Video.
It wherein, may include multiple target words in target information, target word can be to be existed in the form of text,
The form for being also possible to phonetic exists.When in target information including multiple target words, respectively from video analysis data
Target video mark corresponding with each target word and corresponding target fragment range are searched, then according to target fragment range
Extract target video segment corresponding with each target word.Then each target video segment extracted is spliced
Obtain synthetic video.
The mode of splicing customized can be arranged, and the corresponding target video segment of same target word can preferentially be spliced
Together, then together with target video fragment assembly corresponding with other target words.It can also be by different target word pair
The target video segment answered carries out intersection splicing, naturally it is also possible to use other modes, such as the mode spliced at random.For example,
Assuming that target word S1, S2 and S3, the corresponding target video segment of S1 have 3, respectively A1 comprising there are three in target information,
B1 and C1.The corresponding target video segment of S2 has 4, respectively A2, B2, C2 and D2.The corresponding target video segment of S3 has 3
It is a, respectively A3, B3 and C3.So connecting method can use A1-B1-C1-A2-B2-C2-D2-A3-B3-C3, can also adopt
With A1-A2-A3-B1-B2-B3-C1-C2-C3-D2, naturally it is also possible to use other connecting methods.
As shown in figure 11, in one embodiment it is proposed that a kind of image synthesizing method, this method comprises:
Step S1101 obtains candidate video.
Step S1102 carries out speech recognition to the corresponding audio of candidate video and obtains corresponding multiple candidate text informations,
Using each candidate text information as candidate information.
Step S1103 carries out image recognition to the corresponding picture frame of candidate video and obtains corresponding multiple candidate targets, will
Each candidate target is as candidate information.
Step S1104 carries out speech utterance Object identifying to the corresponding audio of candidate video, obtains multiple candidate speech hairs
Sound object, using each candidate speech sounding object as candidate information.
Step S1105 obtains each candidate information candidate segment range corresponding in candidate video, candidate segment model
It encloses including segment origin identification and segments end mark.
The video identifier of candidate information, candidate segment range and candidate video is associated storage and obtained by step S1106
Matching relationship.
Step S1107 obtains target information.
Step S1108, obtain video analysis data, video analysis data have recorded candidate information, candidate segment range and
The matching relationship of candidate video mark.
Step S1109 searches target video mark corresponding with target information and target fragment from video analysis data
Range.
Step S1110 obtains corresponding target video according to target video mark.
Step S1111 extracts corresponding target video segment according to target fragment range from corresponding target video.
Step S1112 is spliced each target video segment extracted to obtain synthetic video.
Above-mentioned image synthesizing method be all be that discriminance analysis has been carried out to candidate video in advance, then store candidate letter
Relationship between breath, candidate video mark and candidate segment range, is searched according to the relationship of the storage.In following embodiment
In, give a kind of method identified in real time to obtain target video segment corresponding with target information.
As shown in figure 12, in one embodiment it is proposed that a kind of image synthesizing method, this method comprises:
Step S1202 obtains target information.
Wherein, target information refers to the information as querying condition.Target information can be word, be also possible to word
Phonetic can also be the name etc. of personage.The acquisition of target information can be the information by receiving user's input, be also possible to
The information that Automatic sieve is selected, for example, can be by being identified to one section of text information, by the highest word of the frequency of occurrences in text
Language is as target information.
Step S1204 obtains target video.
Wherein, target video refers to the specified video for being used to search target video segment corresponding with target information.Mesh
Mark video can be one, be also possible to multiple.Obtaining for target video can be also possible to by receiving the video uploaded
Several videos selected in the multiple range of video provided.
Step S1206 determines target fragment range corresponding with target information according to target information in target video.
Wherein, it after target information and target video has been determined, is carried out by the audio or video image to target video
Analysis determines target fragment range corresponding with target information in target video.In one embodiment, in target video
Audio carry out speech recognition when obtaining text information identical with target information, obtain the corresponding sound of pronunciation of text information
The frequency period, audio session include pronunciation start time point and pronunciation termination time point, it is true according to the audio session
Fixed corresponding target fragment range.Target fragment range refers to the corresponding target video segment of target information in target video
Position.Target fragment range can be indicated with the start time point of video clip and termination time point, can also use piece of video
The starting video frame identification of section is indicated with video frame identification is terminated.The target corresponding with target information in a target video
Segment ranges can be one, be also possible to multiple.
Step S1208 extracts target video corresponding with target information according to target fragment range from target video
Segment.
Wherein, the target fragment range according to target information in target video is extracted accordingly from corresponding target video
Target video segment.Such as, it is assumed that target fragment range is the 10th millisecond (ms) to the 20th millisecond (ms), then just from target
The video clip between 10ms to 20ms is extracted in video.
Step S1210 is spliced each target video segment extracted to obtain synthetic video.
Wherein, synthetic video refers to the video for being spliced multiple video clips.Extraction obtains and target information
After corresponding each target video segment, spliced each target video segment to obtain synthetic video.The mode of splicing can
Can also be spliced according to the sequence for extracting each target video segment using random splicing, naturally it is also possible to according to
Customized other splicings sequence is spliced, for example, can orderly be spliced according to the length of target video segment.
Above-mentioned image synthesizing method, by obtaining target information and target video, according to target information in target video
It is middle to search corresponding with target information target fragment range, it is then extracted from target video according to target fragment range and mesh
The corresponding target video segment of information is marked, is spliced each target video segment extracted to obtain synthetic video.It is above-mentioned
The method of Video Composition can be found in the target video got corresponding with target information automatically according to target information
Then target fragment range is automatically extracted according to target fragment range, and then each target video segment that will be extracted
Spliced, whole process is participated in without artificial, and production is simple and at low cost, time saving and energy saving.
In one embodiment, described to be determined and the target information in the target video according to the target information
Corresponding target fragment range, comprising: the target information obtains target speaker described for determining corresponding target speaker
Corresponding target fragment range in target video;The each target video segment that will be extracted is spliced to obtain synthesis view
Frequently, comprising: spliced the multiple target video segments corresponding with the target speaker extracted to obtain the repetition mesh
Mark the synthetic video of pronunciation;
Wherein, target speaker refers to and pronunciation corresponding to target information.For example, if target information is " live streaming ", that
Target speaker is pronunciation corresponding to " live streaming " the two words.Lookup and target in given one or more target videos
Corresponding target fragment range of pronouncing will then be extracted then according to the corresponding target video segment of target fragment range extraction
To multiple target video segments corresponding with target speaker spliced to obtain and repeat the synthetic video of target speaker.
In one embodiment, described to be determined and the target information in the target video according to the target information
Corresponding target fragment range, comprising: the target information obtains target picture described for determining corresponding target picture
Corresponding target fragment range in target video;The each target video segment that will be extracted is spliced to obtain synthesis view
Frequently, comprising: spliced the multiple target video segments corresponding with the target picture extracted to obtain the repetition mesh
Mark the synthetic video of picture.
Wherein, target picture refers to and picture corresponding to target information.For example, if target information is " dog ",
Target picture is the picture for including " dog ", if target information be " Sun Wukong ", target picture be include that " grandson realizes
The picture of sky ".Target fragment range corresponding with target picture is searched in given one or more target videos, then root
Corresponding target video segment is extracted according to target fragment range, all includes target in each video frame in target video segment
Picture.Then the multiple target video segments corresponding with target picture extracted are spliced to obtain and repeats target picture
Synthetic video.
In one embodiment, target information is target text information;Searched in target video according to target information with
The step S1206 of the corresponding target fragment range of target information includes: to carry out speech recognition to the audio in target video, works as knowledge
When being clipped to text information identical with target text information, it is corresponding in target video to obtain the audio data of text information
Start time point and termination time point according to start time point and terminate time point determining target corresponding with target text information
Segment ranges.
Wherein, target information is target text information, and target text information can be target word, be also possible to target word
The phonetic of language.Speech recognition refers to that by the content recognition in audio be text, when recognizing text identical with target text information
When this information, obtains the audio data start time point corresponding in target video of text information and terminate time point.
Audio data refers to data corresponding to the pronunciation of text information.Such as, it is assumed that target text information is " zhibo ", then
Data corresponding to the pronunciation of " zhibo " in target video are audio data.According to start time point and terminate time point
Determine target fragment range corresponding with target text information.
In one embodiment, target information is target object;It is searched in target video according to target information and target
The corresponding target fragment range of information, comprising: image recognition is carried out to the picture frame in target video, identification obtains including mesh
The video frame for marking object, using multiple successive video frames for including target object as a video clip;According to video clip
In initial video frame mark and terminate the mark of video frame and determine corresponding with target object target fragment range.
Wherein, target information is target object, and target object can be specified personage, be also possible to specified animal,
It can also be specified object.Video frame refers to that video image, video are made of the video frame of a frame frame, each video frame
For a video image, by carrying out image recognition to the picture frame in target video, identification obtains including target object
Video frame.Identifying to the target object in video image can be known using trained recongnition of objects model
Not, the target object that recongnition of objects model includes in image for identification, in one embodiment, recongnition of objects mould
Type is to learn to obtain by extracting the feature of target object and being trained using convolutional neural networks model.Recongnition of objects
Model can be realized using the prior art, be defined here and for how to obtain recongnition of objects model.Target object
Be likely to occur in a video repeatedly, then the target fragment range correspondingly got also have it is multiple.It will include mesh
The successive video frames of object are marked as a video clip, then according to the initial video frame identification and termination view in video clip
Frequency frame identification determines corresponding target fragment range.For example, it is assumed that target object is " Sun Wukong ", and " Sun Wukong " is in target
Compartment of terrain occurs 5 times in video, then correspondingly corresponding to 5 target fragment ranges.
In one embodiment, target information is target voice sounding object;It is looked into target video according to target information
Look for target fragment range corresponding with target information, comprising: speech utterance Object identifying is carried out to the audio in target video, is known
Audio fragment corresponding to target utterance object is not obtained;It is true according to the corresponding start time point of audio fragment and termination time point
Set the goal segment ranges.
Wherein, target information is target voice sounding object.Speech utterance object, that is, sound sender.By to target
Audio in video carries out speech utterance Object identifying, identifies audio data corresponding to target voice sounding object, obtains
Audio fragment corresponding to the audio data, audio fragment here can be one section, be also possible to multistage, for example, target language
Different location of the sound sounding object in target video has been said repeatedly, then correspondingly just having multiple audio fragments.Audio piece
Section includes start time point and terminates time point, determines target fragment range, mesh according to start time point and termination time point
Range corresponding to standard film segment limit, that is, target video segment.Target voice sounding object is identified by the mesh pre-established
For mark speech utterance Object identifying model come what is identified, the speech utterance feature as possessed by different personages is different, so
Audio data corresponding to target utterance object can be identified by the speech utterance feature of learning objective sounding object.
It should be understood that although each step in the flow chart of Fig. 2 to 12 is successively shown according to the instruction of arrow,
It is these steps is not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps
There is no stringent sequences to limit for rapid execution, these steps can execute in other order.Moreover, in Fig. 2 to 12 extremely
Few a part of step may include that perhaps these sub-steps of multiple stages or stage are not necessarily same to multiple sub-steps
Moment executes completion, but can execute at different times, and the execution sequence in these sub-steps or stage is also not necessarily
It successively carries out, but in turn or can be handed over at least part of the sub-step or stage of other steps or other steps
Alternately execute.
As shown in figure 13, in one embodiment it is proposed that a kind of Video Composition device, the device include:
The first information obtains module 1302, for obtaining target information;
Matching relationship obtains module 1304, and for obtaining video analysis data, the video analysis data has recorded candidate
The matching relationship of information, candidate segment range and candidate video mark;
Searching module 1306, for searching target video corresponding with the target information from the video analysis data
Mark and target fragment range;
First video acquiring module 1308, for obtaining corresponding target video according to target video mark;
First extraction module 1310, it is corresponding for being extracted from corresponding target video according to the target fragment range
Target video segment;
First splicing module 1312, for being spliced each target video segment extracted to obtain synthetic video.
In one embodiment, the target information is for determining corresponding target speaker, and the searching module 1306 is also
For from the video analysis data search include the target speaker multiple target videos mark, obtain the target
Pronunciation corresponding target fragment range in each target video;First splicing module 1312 be also used to extract with
The corresponding multiple target video segments of the target speaker are spliced to obtain the synthetic video for repeating the target speaker.
In one embodiment, the target information is for determining corresponding target picture, and the searching module 1306 is also
For from the video analysis data search include the target picture multiple target videos mark, obtain the target
Picture corresponding target fragment range in each target video;First splicing module 1312 be also used to extract with
The corresponding multiple target video segments of the target picture are spliced to obtain the synthetic video for repeating the target picture.
As shown in figure 14, in one embodiment, above-mentioned Video Composition device further include:
Candidate video obtains module 1314, for obtaining candidate video;
Identification module 1316, wherein identification module 1316 includes the first identification module 1316A, the second identification module
At least one module in 1316B, third identification module 1316C;
First identification module 1316A, it is corresponding for being obtained to the progress speech recognition of the candidate video corresponding audio
Multiple candidate's text informations, using each candidate text information as the candidate information;And/or
Second identification module 1316B is corresponded to for carrying out image recognition to the corresponding picture frame of the candidate video
Multiple candidate targets, using each candidate target as the candidate information;And/or
Third identification module 1316C is obtained for carrying out speech utterance Object identifying to the corresponding audio of the candidate video
To multiple candidate speech sounding objects, using each candidate speech sounding object as the candidate information;
Segment ranges obtain module 1318, for obtaining each candidate information candidate corresponding in the candidate video
Segment ranges, the candidate segment range include segment origin identification and segments end mark;
Memory module 1320, for by the video identifier of the candidate information, candidate segment range and the candidate video
It is associated storage and obtains the matching relationship.
In one embodiment, the target information is target text information;First identification module is also used to institute
The audio progress speech recognition stated in candidate video obtains speech text, handles to obtain to speech text progress word cutting multiple
Candidate text information;The segment ranges obtain module and are also used to obtain current candidate text information, obtain the current candidate
The corresponding audio session of text information, the audio session include that Voice onset time point and pronunciation terminate time point, will
The Voice onset time point terminates time point as the segments end mark as the segment origin identification, using the pronunciation
Know.
In one embodiment, the target information is the corresponding target phonetic of target word;First identification module
It is also used to carry out speech recognition to the audio in the candidate video to obtain phonetic text, the phonetic text is carried out at word cutting
Reason obtains the corresponding candidate pinyin of multiple candidate words;The segment ranges obtain module and are also used to obtain current candidate phonetic,
The corresponding audio session of the current candidate phonetic is obtained, the audio session includes that Voice onset time point and pronunciation are whole
The pronunciation is terminated time point as institute using the Voice onset time point as the segment origin identification by only time point
State segments end mark.
In one embodiment, the target information further includes tone corresponding to the target phonetic of the target word;
First identification module is also used to carry out the audio in the candidate video speech recognition to obtain phonetic text and phonetic text
The tone of each phonetic in this, to the phonetic text carry out word cutting handle to obtain the corresponding candidate pinyin of multiple candidate words and
The corresponding tone of candidate pinyin;The segment ranges obtain module and are also used to obtain current candidate phonetic and current candidate phonetic pair
The tone answered obtains corresponding audio session, institute according to the current candidate phonetic and the corresponding tone of current candidate phonetic
Stating audio session includes that Voice onset time point and pronunciation terminate time point, using the Voice onset time point as described
Section origin identification, terminates time point as the segments end for the pronunciation and identifies.
In one embodiment, the target information is target object;Second identification module is also used to the time
The object for including in the corresponding picture frame of video is selected to be identified to obtain corresponding multiple candidate targets;The segment ranges obtain
Module is also used to obtain current candidate object, obtains the video frame that current candidate object occurs in the candidate video, will be more
For a successive video frames for including the current candidate object as a video clip, the video clip includes initial video
The mark of frame and the mark for terminating video frame will be described using the mark of the initial video frame as the segment origin identification
The mark for terminating video frame is identified as the segments end.
In one embodiment, the target information is speech utterance object;The third identification module is also used to institute
It states the audio in candidate video and carries out speech utterance Object identifying, identification obtains multiple candidate speech sounding objects;The segment
Range obtains module and is also used to obtain current candidate speech utterance object, and it is corresponding to obtain the current candidate speech utterance object
Audio session, the audio session includes that Voice onset time point and pronunciation terminate time point, when the pronunciation is originated
Between point be used as the segment origin identification, pronunciation termination time point is identified as the segments end.
In one embodiment, the first information obtains the video that module is also used to obtain upload, to the upload
Video carries out speech recognition and obtains choosing text, counts described and chooses the frequency that each word occurs in text, according to each word
The frequency that language occurs in the selection text determines target word, using the target word as the target information.
In one embodiment, first splicing module be also used to obtain extract each target video segment when
It is long, calculate the corresponding total duration of the multiple target video segment;When the total duration is less than preset duration, duplication is extracted
The target video segment obtain duplication target video segment, by it is described duplication target video segment and extraction original object
Video clip is spliced, and synthetic video is obtained.
In one embodiment, it includes multiple target words that the searching module, which is also used to work as in the target information,
When, target video mark corresponding with each target word and corresponding target fragment model are searched from the video analysis data
It encloses;The target video segment corresponding with each target word that first splicing module is also used to extract splice
To synthetic video.
As shown in figure 15, in one embodiment it is proposed that a kind of Video Composition device, the device include:
Second data obtaining module 1502, for obtaining target information;
Second video acquiring module 1504, for obtaining target video;
Determining module 1506, for being determined and the target information pair in the target video according to the target information
The target fragment range answered;
Second extraction module 1508, for according to the target fragment range extracted from the target video with it is described
The corresponding target video segment of target information;
Second splicing module 1510, for being spliced each target video segment extracted to obtain synthetic video.
In one embodiment, the target information is for determining that corresponding target speaker, the determining module are also used to
Obtain target speaker corresponding target fragment range in the target video;Second splicing module is also used to extract
Multiple target video segments corresponding with the target speaker spliced to obtain the synthetic video for repeating the target speaker.
In one embodiment, the target information is for determining that corresponding target picture, the determining module are also used to
Obtain target picture corresponding target fragment range in the target video;Second splicing module is also used to extract
Multiple target video segments corresponding with the target picture spliced to obtain the synthetic video for repeating the target picture.
In one embodiment, the target information is target text information;The determining module is also used to the mesh
The audio marked in video carries out speech recognition, when recognizing text information identical with the target text information, obtains institute
It states the audio data start time point corresponding in the target video of text information and terminates time point, according to described
Begin time point and the termination time point determining target fragment range corresponding with the target text information.
In one embodiment, the target information is target object;The determining module is also used to regard the target
Video frame in frequency carries out image recognition, identification obtain include the target object video frame, by it is multiple include described
The successive video frames of target object are as a video clip, according to initial video frame identification in the video clip and described
It terminates video frame identification and determines target fragment range corresponding with the target object.
In one embodiment, the target information is target voice sounding object;The determining module is also used to institute
It states the audio in target video and carries out speech utterance Object identifying, identification obtains audio piece corresponding to the target utterance object
Section determines the target fragment range according to the corresponding start time point of the audio fragment and termination time point.
Figure 16 shows the internal structure chart of computer equipment in one embodiment.The computer equipment is either end
End, is also possible to server.As shown in figure 16, which includes processor, the memory connected by system bus
And network interface.Wherein, memory includes non-volatile memory medium and built-in storage.The non-volatile of the computer equipment is deposited
Storage media is stored with operating system, can also be stored with computer program, when which is executed by processor, may make place
It manages device and realizes image synthesizing method.Computer program can also be stored in the built-in storage, which is held by processor
When row, processor may make to execute image synthesizing method.It will be understood by those skilled in the art that structure shown in Figure 16, only
It is only the block diagram of part-structure relevant to application scheme, does not constitute the computer being applied thereon to application scheme
The restriction of equipment, specific computer equipment may include than more or fewer components as shown in the figure, or the certain portions of combination
Part, or with different component layouts.
In one embodiment, image synthesizing method provided by the present application can be implemented as a kind of shape of computer program
Formula, computer program can be run in computer equipment as shown in figure 16.Composition can be stored in the memory of computer equipment
Each program module of the Video Composition device, for example, the first information of Figure 13 obtains module 1302, matching relationship obtains module
1304, searching module 1306, the first video acquiring module 1308, the first extraction module 1310 and the first splicing module 1312.Respectively
The computer program that a program module is constituted makes processor execute the view of each embodiment of the application described in this specification
Step in frequency synthesizer.For example, computer equipment shown in Figure 16 can pass through Video Composition device as shown in fig. 13 that
The first information obtain module 1302 obtain target information;Module 1304, which is obtained, by matching relationship obtains video analysis data,
The video analysis data has recorded the matching relationship of candidate information, candidate segment range and candidate video mark;By searching for
Module searches target video mark corresponding with the target information and target fragment range from the video analysis data;It is logical
It crosses the first video acquiring module 1308 and corresponding target video is obtained according to target video mark;Pass through the first extraction module
1310 extract corresponding target video segment according to the target fragment range from corresponding target video;Pass through the first splicing
Module 1312 is spliced each target video segment extracted to obtain synthetic video.
In one embodiment it is proposed that a kind of computer equipment, including memory and processor, the memory storage
There is computer program, when the computer program is executed by the processor, so that the processor executes following steps: obtaining
Target information;Video analysis data is obtained, the video analysis data has recorded candidate information, candidate segment range and candidate view
The matching relationship that frequency marking is known;Target video mark corresponding with the target information and mesh are searched from the video analysis data
Standard film segment limit;Corresponding target video is obtained according to target video mark;According to the target fragment range from correspondence
Target video in extract corresponding target video segment;The each target video segment extracted is spliced and is synthesized
Video.
In one embodiment, target video mark corresponding with the target information is searched from the video analysis data
Know and target fragment range, comprising: the target information is for determining corresponding target speaker, from the video analysis data
Lookup includes multiple target videos mark of the target speaker, and it is corresponding in each target video to obtain the target speaker
Target fragment range;Or the target information is looked into from the video analysis data for determining corresponding target picture
Look for include the target picture multiple target videos mark, it is corresponding in each target video to obtain the target picture
Target fragment range;The each target video segment that will be extracted is spliced to obtain synthetic video, comprising: will be extracted
Multiple target video segments corresponding with the target speaker spliced to obtain the synthetic video for repeating the target speaker;
Or the multiple target video segments corresponding with the target picture extracted are spliced to obtain and repeat the target picture
The synthetic video in face.
In one embodiment, before the acquisition target information, further includes: obtain candidate video;To the candidate
The corresponding audio of video carries out speech recognition and obtains corresponding multiple candidate text informations, using each candidate text information as institute
State candidate information;And/or image recognition is carried out to the corresponding picture frame of the candidate video and obtains corresponding multiple candidate targets,
Using each candidate target as the candidate information;And/or speech utterance object is carried out to the corresponding audio of the candidate video
Identification, obtains multiple candidate speech sounding objects, using each candidate speech sounding object as the candidate information;It obtains
Each candidate information candidate segment range corresponding in the candidate video, the candidate segment range include segment starting
Mark and segments end mark;The candidate information, candidate segment range and the video identifier of the candidate video are closed
Connection storage obtains the matching relationship.
In one embodiment, the target information is target text information;It is described to the corresponding sound of the candidate video
Frequency carries out speech recognition and obtains corresponding multiple candidate text informations, comprising: carries out voice to the audio in the candidate video
Identification obtains speech text, carries out word cutting to the speech text and handles to obtain multiple candidate text informations;The acquisition is each
Candidate information candidate segment range corresponding in the candidate video, the candidate segment range includes segment origin identification
It is identified with segments end, comprising: current candidate text information is obtained, when obtaining the corresponding audio of the current candidate text information
Between section, the audio session include Voice onset time point and pronunciation terminate time point;The Voice onset time point is made
For the segment origin identification, the pronunciation is terminated into time point as the segments end and is identified.
In one embodiment, the target information is the corresponding target phonetic of target word;It is described that the candidate is regarded
Frequently corresponding audio carries out speech recognition and obtains corresponding multiple candidate text informations, comprising: to the sound in the candidate video
Frequency carries out speech recognition and obtains phonetic text, carries out word cutting to the phonetic text and handles to obtain the corresponding time of multiple candidate words
Select phonetic;It is described to obtain each candidate information candidate segment range corresponding in the candidate video, the candidate segment
Range includes segment origin identification and segments end mark, comprising: obtains current candidate phonetic, obtains the current candidate phonetic
Corresponding audio session, the audio session include that Voice onset time point and pronunciation terminate time point;By the pronunciation
The pronunciation is terminated time point as the segments end and identified by start time point as the segment origin identification.
In one embodiment, the target information further includes tone corresponding to the target phonetic of the target word;
Speech recognition is carried out to the audio in the candidate video and obtains candidate pinyin corresponding to multiple candidate words, comprising: to described
Audio in candidate video carries out speech recognition and obtains the tone of each phonetic in phonetic text and phonetic text, to the phonetic
Text carries out word cutting and handles to obtain the corresponding candidate pinyin of multiple candidate words and the corresponding tone of candidate pinyin;It is described to obtain respectively
A candidate information candidate segment range corresponding in the candidate video, the candidate segment range include segment starting mark
Know and segments end identifies, comprising: current candidate phonetic and the corresponding tone of current candidate phonetic is obtained, according to the current time
Phonetic and the corresponding tone of current candidate phonetic is selected to obtain corresponding audio session, the audio session includes pronunciation starting
Time point and pronunciation terminate time point;Using the Voice onset time point as the segment origin identification, eventually by the pronunciation
Only time point identifies as the segments end.
In one embodiment, the target information is target object;The corresponding picture frame of the candidate video is carried out
Image recognition obtains corresponding multiple candidate targets, comprising: to the object for including in the corresponding picture frame of the candidate video into
Row identification obtains corresponding multiple candidate targets;It is described to obtain each candidate information candidate corresponding in the candidate video
Segment ranges, the candidate segment range include segment origin identification and segments end mark, comprising: obtain current candidate pair
As, obtain the video frame that occurs in the candidate video of current candidate object, by it is multiple include the current candidate object
Successive video frames as a video clip, the video clip include initial video frame mark and terminate video frame mark
Know;Using the mark of the initial video frame as the segment origin identification, using the mark for terminating video frame as described in
Segments end mark.
In one embodiment, the target information is speech utterance object;To the corresponding audio of the candidate video into
Row speech utterance Object identifying obtains multiple candidate sounding objects, comprising: carry out voice hair to the audio in the candidate video
Sound Object identifying, identification obtain multiple candidate speech sounding objects;It is described to obtain each candidate information in the candidate video
Corresponding candidate segment range, the candidate segment range include segment origin identification and segments end mark, comprising: are obtained
Current candidate speech utterance object, obtains the corresponding audio session of the current candidate speech utterance object, when the audio
Between section include Voice onset time point and pronunciation terminate time point;The Voice onset time point is originated as the segment and is marked
Know, the pronunciation is terminated into time point as the segments end and is identified.
In one embodiment, the acquisition target information, comprising: obtain the video of upload;To the video of the upload
It carries out speech recognition to obtain choosing text, counts described and choose the frequency that each word occurs in text;Existed according to each word
The frequency occurred in the selection text determines target word, using the target word as the target information.
In one embodiment, each target video segment that will be extracted is spliced to obtain synthetic video, packet
It includes: obtaining the duration for each target video segment extracted, calculate the corresponding total duration of the multiple target video segment;When
When the total duration is less than preset duration, replicates the target video segment extracted and obtain duplication target video segment, it will
The duplication target video segment and the original object video clip of extraction are spliced, and synthetic video is obtained.
In one embodiment, described that target view corresponding with the target information is searched from the video analysis data
Frequency marking is known and target fragment range includes: when including multiple target words, from the video analysis in the target information
Target video mark corresponding with each target word and corresponding target fragment range are searched in data;It is described to extract
Each target video segment is spliced to obtain synthetic video, comprising: the target corresponding with each target word that will be extracted
Video clip is spliced to obtain synthetic video.
In one embodiment it is proposed that a kind of computer equipment, including memory and processor, the memory storage
There is computer program, when the computer program is executed by the processor, so that the processor executes following steps: obtaining
Target information;Obtain target video;According to the target information, determination is corresponding with the target information in the target video
Target fragment range;It is extracted from the target video according to the target fragment range corresponding with the target information
Target video segment;Spliced each target video segment extracted to obtain synthetic video.
In one embodiment, described to be determined and the target information in the target video according to the target information
Corresponding target fragment range, comprising: the target information obtains target speaker described for determining corresponding target speaker
Corresponding target fragment range in target video;Or the target information obtains target for determining corresponding target picture
Picture corresponding target fragment range in the target video;The each target video segment that will be extracted is spliced
Obtain synthetic video, comprising: spliced to obtain by the multiple target video segments corresponding with the target speaker extracted
Repeat the synthetic video of the target speaker;Or the multiple target video segments corresponding with the target picture that will be extracted
Spliced to obtain the synthetic video for repeating the target picture.
In one embodiment, the target information is target text information;It is described according to the target information described
Corresponding with target information target fragment range is determined in target video, comprising: to the audio in the target video into
Row speech recognition obtains the audio of the text information when recognizing text information identical with the target text information
Data start time point corresponding in the target video and termination time point;According to the start time point and the end
Only time point determines target fragment range corresponding with the target text information.
In one embodiment, the target information is target object;It is described according to the target information in the target
Target fragment range corresponding with the target information is determined in video, comprising: carry out to the video frame in the target video
Image recognition, identification obtain include the target object video frame, by it is multiple include the target object continuous view
Frequency frame is as a video clip;According in the video clip initial video frame identification and the termination video frame identification it is true
Fixed target fragment range corresponding with the target object.
In one embodiment, the target information is target voice sounding object;It is described to be existed according to the target information
Target fragment range corresponding with the target information is determined in the target video, comprising: to the sound in the target video
Frequency carries out speech utterance Object identifying, and identification obtains audio fragment corresponding to the target utterance object, according to the audio
The corresponding start time point of segment and termination time point determine the target fragment range.
In one embodiment it is proposed that a kind of computer readable storage medium, is stored with computer program, the calculating
When machine program is executed by processor, so that the processor executes following steps: obtaining target information;Obtain video analysis number
According to the video analysis data has recorded the matching relationship of candidate information, candidate segment range and candidate video mark;From described
Target video mark corresponding with the target information and target fragment range are searched in video analysis data;According to the target
Video identifier obtains corresponding target video;Corresponding mesh is extracted from corresponding target video according to the target fragment range
Mark video clip;Spliced each target video segment extracted to obtain synthetic video.
In one embodiment, target video mark corresponding with the target information is searched from the video analysis data
Know and target fragment range, comprising: the target information is for determining corresponding target speaker, from the video analysis data
Lookup includes multiple target videos mark of the target speaker, and it is corresponding in each target video to obtain the target speaker
Target fragment range;Or the target information is looked into from the video analysis data for determining corresponding target picture
Look for include the target picture multiple target videos mark, it is corresponding in each target video to obtain the target picture
Target fragment range;The each target video segment that will be extracted is spliced to obtain synthetic video, comprising: will be extracted
Multiple target video segments corresponding with the target speaker spliced to obtain the synthetic video for repeating the target speaker;
Or the multiple target video segments corresponding with the target picture extracted are spliced to obtain and repeat the target picture
The synthetic video in face.
In one embodiment, before the acquisition target information, further includes: obtain candidate video;To the candidate
The corresponding audio of video carries out speech recognition and obtains corresponding multiple candidate text informations, using each candidate text information as institute
State candidate information;And/or image recognition is carried out to the corresponding picture frame of the candidate video and obtains corresponding multiple candidate targets,
Using each candidate target as the candidate information;And/or speech utterance object is carried out to the corresponding audio of the candidate video
Identification, obtains multiple candidate speech sounding objects, using each candidate speech sounding object as the candidate information;It obtains
Each candidate information candidate segment range corresponding in the candidate video, the candidate segment range include segment starting
Mark and segments end mark;The candidate information, candidate segment range and the video identifier of the candidate video are closed
Connection storage obtains the matching relationship.
In one embodiment, the target information is target text information;It is described to the corresponding sound of the candidate video
Frequency carries out speech recognition and obtains corresponding multiple candidate text informations, comprising: carries out voice to the audio in the candidate video
Identification obtains speech text, carries out word cutting to the speech text and handles to obtain multiple candidate text informations;The acquisition is each
Candidate information candidate segment range corresponding in the candidate video, the candidate segment range includes segment origin identification
It is identified with segments end, comprising: current candidate text information is obtained, when obtaining the corresponding audio of the current candidate text information
Between section, the audio session include Voice onset time point and pronunciation terminate time point;The Voice onset time point is made
For the segment origin identification, the pronunciation is terminated into time point as the segments end and is identified.
In one embodiment, the target information is the corresponding target phonetic of target word;It is described that the candidate is regarded
Frequently corresponding audio carries out speech recognition and obtains corresponding multiple candidate text informations, comprising: to the sound in the candidate video
Frequency carries out speech recognition and obtains phonetic text, carries out word cutting to the phonetic text and handles to obtain the corresponding time of multiple candidate words
Select phonetic;It is described to obtain each candidate information candidate segment range corresponding in the candidate video, the candidate segment
Range includes segment origin identification and segments end mark, comprising: obtains current candidate phonetic, obtains the current candidate phonetic
Corresponding audio session, the audio session include that Voice onset time point and pronunciation terminate time point;By the pronunciation
The pronunciation is terminated time point as the segments end and identified by start time point as the segment origin identification.
In one embodiment, the target information further includes tone corresponding to the target phonetic of the target word;
Speech recognition is carried out to the audio in the candidate video and obtains candidate pinyin corresponding to multiple candidate words, comprising: to described
Audio in candidate video carries out speech recognition and obtains the tone of each phonetic in phonetic text and phonetic text, to the phonetic
Text carries out word cutting and handles to obtain the corresponding candidate pinyin of multiple candidate words and the corresponding tone of candidate pinyin;It is described to obtain respectively
A candidate information candidate segment range corresponding in the candidate video, the candidate segment range include segment starting mark
Know and segments end identifies, comprising: current candidate phonetic and the corresponding tone of current candidate phonetic is obtained, according to the current time
Phonetic and the corresponding tone of current candidate phonetic is selected to obtain corresponding audio session, the audio session includes pronunciation starting
Time point and pronunciation terminate time point;Using the Voice onset time point as the segment origin identification, eventually by the pronunciation
Only time point identifies as the segments end.
In one embodiment, the target information is target object;The corresponding picture frame of the candidate video is carried out
Image recognition obtains corresponding multiple candidate targets, comprising: to the object for including in the corresponding picture frame of the candidate video into
Row identification obtains corresponding multiple candidate targets;It is described to obtain each candidate information candidate corresponding in the candidate video
Segment ranges, the candidate segment range include segment origin identification and segments end mark, comprising: obtain current candidate pair
As, obtain the video frame that occurs in the candidate video of current candidate object, by it is multiple include the current candidate object
Successive video frames as a video clip, the video clip include initial video frame mark and terminate video frame mark
Know;Using the mark of the initial video frame as the segment origin identification, using the mark for terminating video frame as described in
Segments end mark.
In one embodiment, the target information is speech utterance object;To the corresponding audio of the candidate video into
Row speech utterance Object identifying obtains multiple candidate sounding objects, comprising: carry out voice hair to the audio in the candidate video
Sound Object identifying, identification obtain multiple candidate speech sounding objects;It is described to obtain each candidate information in the candidate video
Corresponding candidate segment range, the candidate segment range include segment origin identification and segments end mark, comprising: are obtained
Current candidate speech utterance object, obtains the corresponding audio session of the current candidate speech utterance object, when the audio
Between section include Voice onset time point and pronunciation terminate time point;The Voice onset time point is originated as the segment and is marked
Know, the pronunciation is terminated into time point as the segments end and is identified.
In one embodiment, the acquisition target information, comprising: obtain the video of upload;To the video of the upload
It carries out speech recognition to obtain choosing text, counts described and choose the frequency that each word occurs in text;Existed according to each word
The frequency occurred in the selection text determines target word, using the target word as the target information.
In one embodiment, each target video segment that will be extracted is spliced to obtain synthetic video, packet
It includes: obtaining the duration for each target video segment extracted, calculate the corresponding total duration of the multiple target video segment;When
When the total duration is less than preset duration, replicates the target video segment extracted and obtain duplication target video segment, it will
The duplication target video segment and the original object video clip of extraction are spliced, and synthetic video is obtained.
In one embodiment, described that target view corresponding with the target information is searched from the video analysis data
Frequency marking is known and target fragment range includes: when including multiple target words, from the video analysis in the target information
Target video mark corresponding with each target word and corresponding target fragment range are searched in data;It is described to extract
Each target video segment is spliced to obtain synthetic video, comprising: the target corresponding with each target word that will be extracted
Video clip is spliced to obtain synthetic video.
In one embodiment it is proposed that a kind of computer readable storage medium, is stored with computer program, the calculating
When machine program is executed by processor, so that the processor executes following steps: obtaining target information;Obtain target video;Root
Target fragment range corresponding with the target information is determined in the target video according to the target information;According to the mesh
Standard film segment limit extracts target video segment corresponding with the target information from the target video;It is each by what is extracted
A target video segment is spliced to obtain synthetic video.
In one embodiment, described to be determined and the target information in the target video according to the target information
Corresponding target fragment range, comprising: the target information obtains target speaker described for determining corresponding target speaker
Corresponding target fragment range in target video;Or the target information obtains target for determining corresponding target picture
Picture corresponding target fragment range in the target video;The each target video segment that will be extracted is spliced
Obtain synthetic video, comprising: spliced to obtain by the multiple target video segments corresponding with the target speaker extracted
Repeat the synthetic video of the target speaker;Or the multiple target video segments corresponding with the target picture that will be extracted
Spliced to obtain the synthetic video for repeating the target picture.
In one embodiment, the target information is target text information;It is described according to the target information described
Corresponding with target information target fragment range is determined in target video, comprising: to the audio in the target video into
Row speech recognition obtains the audio of the text information when recognizing text information identical with the target text information
Data start time point corresponding in the target video and termination time point;According to the start time point and the end
Only time point determines target fragment range corresponding with the target text information.
In one embodiment, the target information is target object;It is described according to the target information in the target
Target fragment range corresponding with the target information is determined in video, comprising: carry out to the video frame in the target video
Image recognition, identification obtain include the target object video frame, by it is multiple include the target object continuous view
Frequency frame is as a video clip;According in the video clip initial video frame identification and the termination video frame identification it is true
Fixed target fragment range corresponding with the target object.
In one embodiment, the target information is target voice sounding object;It is described to be existed according to the target information
Target fragment range corresponding with the target information is determined in the target video, comprising: to the sound in the target video
Frequency carries out speech utterance Object identifying, and identification obtains audio fragment corresponding to the target utterance object, according to the audio
The corresponding start time point of segment and termination time point determine the target fragment range.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read
In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein
Each embodiment used in any reference to memory, storage, database or other media, may each comprise non-volatile
And/or volatile memory.Nonvolatile memory may include that read-only memory (ROM), programming ROM (PROM), electricity can be compiled
Journey ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory
(RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static state RAM
(SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM
(ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight
Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment
In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance
Shield all should be considered as described in this specification.
The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously
The limitation to the application the scope of the patents therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art
For, without departing from the concept of this application, various modifications and improvements can be made, these belong to the guarantor of the application
Protect range.Therefore, the scope of protection shall be subject to the appended claims for the application patent.
Claims (30)
1. a kind of image synthesizing method, which comprises
Obtain target information;
Video analysis data is obtained, the video analysis data has recorded candidate information, candidate segment range and candidate video mark
The matching relationship of knowledge;
Target video mark corresponding with the target information and target fragment range are searched from the video analysis data;
Corresponding target video is obtained according to target video mark;
Corresponding target video segment is extracted from corresponding target video according to the target fragment range;
Spliced each target video segment extracted to obtain synthetic video.
2. believing the method according to claim 1, wherein being searched from the video analysis data with the target
Cease corresponding target video mark and target fragment range, comprising:
For the target information for determining corresponding target speaker, searching from the video analysis data includes the target
Multiple target videos mark of pronunciation, obtains the target speaker corresponding target fragment range in each target video;Or
Person
For the target information for determining corresponding target picture, searching from the video analysis data includes the target
Multiple target videos of picture identify, and obtain the target picture corresponding target fragment range in each target video;
The each target video segment that will be extracted is spliced to obtain synthetic video, comprising:
The multiple target video segments corresponding with the target speaker extracted are spliced to obtain and repeat the target hair
The synthetic video of sound;Or
The multiple target video segments corresponding with the target picture extracted are spliced to obtain and repeat the target picture
The synthetic video in face.
3. the method according to claim 1, wherein before the acquisition target information, further includes:
Obtain candidate video;
Speech recognition is carried out to the corresponding audio of the candidate video and obtains corresponding multiple candidate text informations, by each candidate
Text information is as the candidate information;And/or
Image recognition is carried out to the corresponding picture frame of the candidate video and obtains corresponding multiple candidate targets, each candidate is right
As the candidate information;And/or
Speech utterance Object identifying is carried out to the corresponding audio of the candidate video, obtains multiple candidate speech sounding objects, it will
Each candidate speech sounding object is as the candidate information;
Each candidate information candidate segment range corresponding in the candidate video is obtained, the candidate segment range includes
Segment origin identification and segments end mark;
The video identifier of the candidate information, candidate segment range and the candidate video is associated storage and obtains described
With relationship.
4. according to the method described in claim 3, it is characterized in that, the target information is target text information;
It is described that corresponding multiple candidate text informations are obtained to the progress speech recognition of the candidate video corresponding audio, comprising:
Speech recognition is carried out to the audio in the candidate video and obtains speech text, word cutting processing is carried out to the speech text
Obtain multiple candidate text informations;
It is described to obtain each candidate information candidate segment range corresponding in the candidate video, the candidate segment range
It is identified including segment origin identification and segments end, comprising:
Current candidate text information is obtained, obtains the corresponding audio session of the current candidate text information, when the audio
Between section include Voice onset time point and pronunciation terminate time point;
Using the Voice onset time point as the segment origin identification, the pronunciation is terminated into time point as the segment
Terminate mark.
5. according to the method described in claim 3, it is characterized in that, the target information is that the corresponding target of target word is spelled
Sound;
It is described that corresponding multiple candidate text informations are obtained to the progress speech recognition of the candidate video corresponding audio, comprising:
Speech recognition is carried out to the audio in the candidate video and obtains phonetic text, word cutting processing is carried out to the phonetic text
Obtain the corresponding candidate pinyin of multiple candidate words;
It is described to obtain each candidate information candidate segment range corresponding in the candidate video, the candidate segment range
It is identified including segment origin identification and segments end, comprising:
Current candidate phonetic is obtained, obtains the corresponding audio session of the current candidate phonetic, the audio session includes
Voice onset time point and pronunciation terminate time point;
Using the Voice onset time point as the segment origin identification, the pronunciation is terminated into time point as the segment
Terminate mark.
6. according to the method described in claim 5, it is characterized in that, the target information further includes the target of the target word
Tone corresponding to phonetic;
Speech recognition is carried out to the audio in the candidate video and obtains candidate pinyin corresponding to multiple candidate words, comprising:
Speech recognition is carried out to the audio in the candidate video and obtains the tone of each phonetic in phonetic text and phonetic text,
Word cutting is carried out to the phonetic text to handle to obtain the corresponding candidate pinyin of multiple candidate words and the corresponding tone of candidate pinyin;
It is described to obtain each candidate information candidate segment range corresponding in the candidate video, the candidate segment range
It is identified including segment origin identification and segments end, comprising:
Current candidate phonetic and the corresponding tone of current candidate phonetic are obtained, is spelled according to the current candidate phonetic and current candidate
The corresponding tone of sound obtains corresponding audio session, when the audio session includes Voice onset time point and pronunciation termination
Between point;
Using the Voice onset time point as the segment origin identification, the pronunciation is terminated into time point as the segment
Terminate mark.
7. according to the method described in claim 3, it is characterized in that, the target information is target object;
Image recognition is carried out to the corresponding picture frame of the candidate video and obtains corresponding multiple candidate targets, comprising:
The object for including in the corresponding picture frame of the candidate video is identified to obtain corresponding multiple candidate targets;
It is described to obtain each candidate information candidate segment range corresponding in the candidate video, the candidate segment range
It is identified including segment origin identification and segments end, comprising:
Current candidate object is obtained, the video frame that current candidate object occurs in the candidate video is obtained, includes by multiple
There are the successive video frames of the current candidate object as a video clip, the video clip includes the mark of initial video frame
Know and terminate the mark of video frame;
Using the mark of the initial video frame as the segment origin identification, using the mark for terminating video frame as described in
Segments end mark.
8. according to the method described in claim 3, it is characterized in that, the target information is speech utterance object;
Speech utterance Object identifying is carried out to the corresponding audio of the candidate video, obtains multiple candidate sounding objects, comprising:
Speech utterance Object identifying is carried out to the audio in the candidate video, identification obtains multiple candidate speech sounding objects;
It is described to obtain each candidate information candidate segment range corresponding in the candidate video, the candidate segment range
It is identified including segment origin identification and segments end, comprising:
Current candidate speech utterance object is obtained, the corresponding audio session of the current candidate speech utterance object, institute are obtained
Stating audio session includes that Voice onset time point and pronunciation terminate time point;
Using the Voice onset time point as the segment origin identification, the pronunciation is terminated into time point as the segment
Terminate mark.
9. the method according to claim 1, wherein the acquisition target information, comprising:
Obtain the video uploaded;
Speech recognition is carried out to the video of the upload to obtain choosing text, is counted described and is chosen what each word in text occurred
Frequency;
Target word is determined according to the frequency that each word occurs in the selection text, using the target word as described in
Target information.
10. the method according to claim 1, wherein described carry out each target video segment extracted
Splicing obtains synthetic video, comprising:
The duration for obtaining each target video segment extracted, calculates the corresponding total duration of the multiple target video segment;
When the total duration is less than preset duration, replicates the target video segment extracted and obtain duplication target video piece
The original object video clip of the duplication target video segment and extraction is spliced, obtains synthetic video by section.
11. the method according to claim 1, wherein it is described from the video analysis data search with it is described
The corresponding target video mark of target information and target fragment range include:
When in the target information including multiple target words, searched and each target word from the video analysis data
The corresponding target video mark of language and corresponding target fragment range;
The each target video segment that will be extracted is spliced to obtain synthetic video, comprising:
Spliced the target video segment corresponding with each target word extracted to obtain synthetic video.
12. a kind of image synthesizing method, which comprises
Obtain target information;
Obtain target video;
Target fragment range corresponding with the target information is determined in the target video according to the target information;
Target video piece corresponding with the target information is extracted from the target video according to the target fragment range
Section;
Spliced each target video segment extracted to obtain synthetic video.
13. according to the method for claim 12, which is characterized in that it is described according to the target information in the target video
Middle determination target fragment range corresponding with the target information, comprising:
The target information obtains target speaker corresponding target in the target video for determining corresponding target speaker
Segment ranges;Or
The target information obtains target picture corresponding target in the target video for determining corresponding target picture
Segment ranges;
The each target video segment that will be extracted is spliced to obtain synthetic video, comprising:
The multiple target video segments corresponding with the target speaker extracted are spliced to obtain and repeat the target hair
The synthetic video of sound;Or
The multiple target video segments corresponding with the target picture extracted are spliced to obtain and repeat the target picture
The synthetic video in face.
14. according to the method for claim 12, which is characterized in that the target information is target text information;
It is described to determine target fragment range corresponding with the target information in the target video according to the target information,
Include:
Speech recognition is carried out to the audio in the target video, when recognizing text envelope identical with the target text information
When breath, obtains the audio data start time point corresponding in the target video of the text information and terminate the time
Point;
Target fragment model corresponding with the target text information is determined according to the start time point and the termination time point
It encloses.
15. according to the method for claim 12, which is characterized in that the target information is target object;
It is described to determine target fragment range corresponding with the target information in the target video according to the target information,
Include:
To in the target video video frame carry out image recognition, identification obtain include the target object video frame,
Using multiple successive video frames for including the target object as a video clip;
According to the initial video frame identification and termination video frame identification determination and the target object in the video clip
Corresponding target fragment range.
16. according to the method for claim 12, which is characterized in that the target information is target voice sounding object;
It is described to determine target fragment range corresponding with the target information in the target video according to the target information,
Include:
Speech utterance Object identifying is carried out to the audio in the target video, identification obtains corresponding to the target utterance object
Audio fragment, according to the corresponding start time point of the audio fragment and terminate time point determine the target fragment range.
17. a kind of Video Composition device, described device include:
The first information obtains module, for obtaining target information;
Matching relationship obtains module, and for obtaining video analysis data, the video analysis data has recorded candidate information, candidate
The matching relationship of segment ranges and candidate video mark;
Searching module, for searching target video mark corresponding with the target information and mesh from the video analysis data
Standard film segment limit;
First video acquiring module, for obtaining corresponding target video according to target video mark;
First extraction module, for extracting corresponding target video from corresponding target video according to the target fragment range
Segment;
First splicing module, for being spliced each target video segment extracted to obtain synthetic video.
18. device according to claim 17, which is characterized in that described device further include:
Candidate video obtains module, for obtaining candidate video;
First identification module obtains corresponding multiple candidate texts for carrying out speech recognition to the corresponding audio of the candidate video
This information, using each candidate text information as the candidate information;And/or
Second identification module obtains corresponding multiple candidates for carrying out image recognition to the corresponding picture frame of the candidate video
Object, using each candidate target as the candidate information;And/or
Third identification module obtains multiple times for carrying out speech utterance Object identifying to the corresponding audio of the candidate video
Speech utterance object is selected, using each candidate speech sounding object as the candidate information;
Segment ranges obtain module, for obtaining each candidate information candidate segment model corresponding in the candidate video
It encloses, the candidate segment range includes segment origin identification and segments end mark;
Memory module, for the candidate information, candidate segment range and the video identifier of the candidate video to be associated
Storage obtains the matching relationship.
19. device according to claim 18, which is characterized in that the target information is target text information;
First identification module is also used to carry out speech recognition to the audio in the candidate video to obtain speech text, to institute
Speech text progress word cutting is stated to handle to obtain multiple candidate text informations;
The segment ranges obtain module and are also used to obtain current candidate text information, obtain the current candidate text information pair
The audio session answered, the audio session include Voice onset time point and pronunciation terminate time point, by it is described pronounce
Time point begin as the segment origin identification, the pronunciation is terminated into time point as the segments end and is identified.
20. device according to claim 18, which is characterized in that the target information is that the corresponding target of target word is spelled
Sound;
First identification module is also used to carry out speech recognition to the audio in the candidate video to obtain phonetic text, to institute
Phonetic text progress word cutting is stated to handle to obtain the corresponding candidate pinyin of multiple candidate words;
The segment ranges obtain module and are also used to obtain current candidate phonetic, obtain the corresponding audio of the current candidate phonetic
Period, the audio session includes that Voice onset time point and pronunciation terminate time point, by the Voice onset time point
As the segment origin identification, the pronunciation is terminated into time point as the segments end and is identified.
21. device according to claim 20, which is characterized in that the target information further includes the mesh of the target word
Mark tone corresponding to phonetic;
First identification module is also used to carry out speech recognition to the audio in the candidate video to obtain phonetic text and spelling
The tone of each phonetic in sound text carries out word cutting to the phonetic text and handles to obtain the corresponding candidate spelling of multiple candidate words
Sound and the corresponding tone of candidate pinyin;
The segment ranges obtain module and are also used to obtain current candidate phonetic and the corresponding tone of current candidate phonetic, according to institute
It states current candidate phonetic and the corresponding tone of current candidate phonetic obtains corresponding audio session, the audio session includes
Voice onset time point and pronunciation terminate time point, using the Voice onset time point as the segment origin identification, by institute
State pronunciation terminate time point identified as the segments end.
22. device according to claim 18, which is characterized in that the target information is target object;
Second identification module is also used to be identified to obtain to the object for including in the corresponding picture frame of the candidate video
Corresponding multiple candidate targets;
The segment ranges obtain module and are also used to obtain current candidate object, obtain current candidate object in the candidate video
The video frame of middle appearance, it is described using multiple successive video frames for including the current candidate object as a video clip
Video clip includes the mark of initial video frame and the mark for terminating video frame, using the mark of the initial video frame as described in
Segment origin identification identifies the mark for terminating video frame as the segments end.
23. device according to claim 18, which is characterized in that the target information is speech utterance object;
The third identification module is also used to carry out speech utterance Object identifying to the audio in the candidate video, and identification obtains
Multiple candidate speech sounding objects;
The segment ranges obtain module and are also used to obtain current candidate speech utterance object, obtain the current candidate voice hair
The corresponding audio session of sound object, the audio session includes that Voice onset time point and pronunciation terminate time point, by institute
Voice onset time point is stated as the segment origin identification, the pronunciation is terminated into time point as the segments end mark
Know.
24. device according to claim 17, which is characterized in that the first information obtains module and is also used to obtain upload
Video, speech recognition is carried out to the video of the upload and obtains choosing text, each word in the selection text is counted and goes out
Existing frequency determines target word according to the frequency that each word occurs in the selection text, the target word is made
For the target information.
25. a kind of Video Composition device, described device include:
Second data obtaining module, for obtaining target information;
Second video acquiring module, for obtaining target video;
Determining module, for determining target corresponding with the target information in the target video according to the target information
Segment ranges;
Second extraction module, for being extracted from the target video and the target information according to the target fragment range
Corresponding target video segment;
Second splicing module, for being spliced each target video segment extracted to obtain synthetic video.
26. device according to claim 25, which is characterized in that the target information is target text information;
The determining module is also used to carry out speech recognition to the audio in the target video, when recognizing and target text
When the identical text information of this information, the starting corresponding in the target video of the audio data of the text information is obtained
Time point and termination time point, according to the start time point and termination time point determination and the target text information pair
The target fragment range answered.
27. device according to claim 25, which is characterized in that the target information is target object;
The determining module is also used to carry out the video frame in the target video image recognition, and identification obtains including described
The video frame of target object, using multiple successive video frames for including the target object as a video clip, according to institute
The initial video frame identification and the termination video frame identification stated in video clip determine target corresponding with the target object
Segment ranges.
28. device according to claim 25, which is characterized in that the target information is target voice sounding object;
The determining module is also used to carry out the audio in the target video speech utterance Object identifying, and identification obtains described
Audio fragment corresponding to target utterance object is determined according to the corresponding start time point of the audio fragment and termination time point
The target fragment range.
29. a kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor,
So that the processor is executed such as the step of any one of claims 1 to 16 the method.
30. a kind of computer equipment, including memory and processor, the memory is stored with computer program, the calculating
When machine program is executed by the processor, so that the processor is executed such as any one of claims 1 to 16 the method
Step.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810359953.5A CN110392281B (en) | 2018-04-20 | 2018-04-20 | Video synthesis method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810359953.5A CN110392281B (en) | 2018-04-20 | 2018-04-20 | Video synthesis method and device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110392281A true CN110392281A (en) | 2019-10-29 |
CN110392281B CN110392281B (en) | 2022-03-18 |
Family
ID=68282737
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810359953.5A Active CN110392281B (en) | 2018-04-20 | 2018-04-20 | Video synthesis method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110392281B (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110913271A (en) * | 2019-11-29 | 2020-03-24 | Oppo广东移动通信有限公司 | Video processing method, mobile terminal and non-volatile computer-readable storage medium |
CN111031333A (en) * | 2019-12-02 | 2020-04-17 | 北京达佳互联信息技术有限公司 | Video processing method, device, system and storage medium |
CN111031394A (en) * | 2019-12-30 | 2020-04-17 | 广州酷狗计算机科技有限公司 | Video production method, device, equipment and storage medium |
CN111246224A (en) * | 2020-03-24 | 2020-06-05 | 成都忆光年文化传播有限公司 | Video live broadcast method and video live broadcast system |
CN111918146A (en) * | 2020-07-28 | 2020-11-10 | 广州筷子信息科技有限公司 | Video synthesis method and system |
WO2021093737A1 (en) * | 2019-11-15 | 2021-05-20 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating video, electronic device, and computer readable medium |
WO2021120685A1 (en) * | 2019-12-20 | 2021-06-24 | 苏宁云计算有限公司 | Video generation method and apparatus, and computer system |
CN113055612A (en) * | 2019-12-27 | 2021-06-29 | 浙江宇视科技有限公司 | Video playing method, device, electronic equipment, system and medium |
CN113259754A (en) * | 2020-02-12 | 2021-08-13 | 北京达佳互联信息技术有限公司 | Video generation method and device, electronic equipment and storage medium |
CN113268635A (en) * | 2021-05-19 | 2021-08-17 | 北京达佳互联信息技术有限公司 | Video processing method, device, server and computer readable storage medium |
CN113286173A (en) * | 2021-05-19 | 2021-08-20 | 北京沃东天骏信息技术有限公司 | Video editing method and device |
CN113572977A (en) * | 2021-07-06 | 2021-10-29 | 上海哔哩哔哩科技有限公司 | Video production method and device |
CN113613068A (en) * | 2021-08-03 | 2021-11-05 | 北京字跳网络技术有限公司 | Video processing method and device, electronic equipment and storage medium |
CN113676772A (en) * | 2021-08-16 | 2021-11-19 | 上海哔哩哔哩科技有限公司 | Video generation method and device |
WO2021259322A1 (en) * | 2020-06-23 | 2021-12-30 | 广州筷子信息科技有限公司 | System and method for generating video |
CN114257862A (en) * | 2020-09-24 | 2022-03-29 | 北京字跳网络技术有限公司 | A video generation method, device, device and storage medium |
CN117495757A (en) * | 2022-07-22 | 2024-02-02 | 数坤(深圳)智能网络科技有限公司 | Ultrasound image regularization method, device, equipment and computer-readable storage medium |
CN118042248A (en) * | 2024-04-11 | 2024-05-14 | 深圳市捷易科技有限公司 | Video generation method, device, equipment and readable storage medium |
Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070168864A1 (en) * | 2006-01-11 | 2007-07-19 | Koji Yamamoto | Video summarization apparatus and method |
CN101021855A (en) * | 2006-10-11 | 2007-08-22 | 鲍东山 | Video searching system based on content |
CN102427507A (en) * | 2011-09-30 | 2012-04-25 | 北京航空航天大学 | Football video highlight automatic synthesis method based on event model |
CN103260082A (en) * | 2013-05-21 | 2013-08-21 | 王强 | Video processing method and device |
CN103956166A (en) * | 2014-05-27 | 2014-07-30 | 华东理工大学 | Multimedia courseware retrieval system based on voice keyword recognition |
CN104244086A (en) * | 2014-09-03 | 2014-12-24 | 陈飞 | Video real-time splicing device and method based on real-time conversation semantic analysis |
CN104796781A (en) * | 2015-03-31 | 2015-07-22 | 小米科技有限责任公司 | Video clip extraction method and device |
CN105224925A (en) * | 2015-09-30 | 2016-01-06 | 努比亚技术有限公司 | Video process apparatus, method and mobile terminal |
CN105578222A (en) * | 2016-02-01 | 2016-05-11 | 百度在线网络技术(北京)有限公司 | Information push method and device |
CN106021496A (en) * | 2016-05-19 | 2016-10-12 | 海信集团有限公司 | Video search method and video search device |
CN106297799A (en) * | 2016-08-09 | 2017-01-04 | 乐视控股(北京)有限公司 | Voice recognition processing method and device |
CN106331479A (en) * | 2016-08-22 | 2017-01-11 | 北京金山安全软件有限公司 | Video processing method and device and electronic equipment |
CN106372246A (en) * | 2016-09-20 | 2017-02-01 | 深圳市同行者科技有限公司 | Audio playing method and device |
US20170062006A1 (en) * | 2015-08-26 | 2017-03-02 | Twitter, Inc. | Looping audio-visual file generation based on audio and video analysis |
CN106663099A (en) * | 2014-04-10 | 2017-05-10 | 谷歌公司 | Methods, systems, and media for searching for video content |
CN106921749A (en) * | 2017-03-31 | 2017-07-04 | 北京京东尚科信息技术有限公司 | For the method and apparatus of pushed information |
WO2017157276A1 (en) * | 2016-03-14 | 2017-09-21 | 腾讯科技(深圳)有限公司 | Method and device for stitching multimedia files |
CN107517405A (en) * | 2017-07-31 | 2017-12-26 | 努比亚技术有限公司 | The method, apparatus and computer-readable recording medium of a kind of Video processing |
-
2018
- 2018-04-20 CN CN201810359953.5A patent/CN110392281B/en active Active
Patent Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070168864A1 (en) * | 2006-01-11 | 2007-07-19 | Koji Yamamoto | Video summarization apparatus and method |
CN101021855A (en) * | 2006-10-11 | 2007-08-22 | 鲍东山 | Video searching system based on content |
CN102427507A (en) * | 2011-09-30 | 2012-04-25 | 北京航空航天大学 | Football video highlight automatic synthesis method based on event model |
CN103260082A (en) * | 2013-05-21 | 2013-08-21 | 王强 | Video processing method and device |
CN106663099A (en) * | 2014-04-10 | 2017-05-10 | 谷歌公司 | Methods, systems, and media for searching for video content |
CN103956166A (en) * | 2014-05-27 | 2014-07-30 | 华东理工大学 | Multimedia courseware retrieval system based on voice keyword recognition |
CN104244086A (en) * | 2014-09-03 | 2014-12-24 | 陈飞 | Video real-time splicing device and method based on real-time conversation semantic analysis |
CN104796781A (en) * | 2015-03-31 | 2015-07-22 | 小米科技有限责任公司 | Video clip extraction method and device |
US20170062006A1 (en) * | 2015-08-26 | 2017-03-02 | Twitter, Inc. | Looping audio-visual file generation based on audio and video analysis |
CN105224925A (en) * | 2015-09-30 | 2016-01-06 | 努比亚技术有限公司 | Video process apparatus, method and mobile terminal |
CN105578222A (en) * | 2016-02-01 | 2016-05-11 | 百度在线网络技术(北京)有限公司 | Information push method and device |
WO2017157276A1 (en) * | 2016-03-14 | 2017-09-21 | 腾讯科技(深圳)有限公司 | Method and device for stitching multimedia files |
CN106021496A (en) * | 2016-05-19 | 2016-10-12 | 海信集团有限公司 | Video search method and video search device |
CN106297799A (en) * | 2016-08-09 | 2017-01-04 | 乐视控股(北京)有限公司 | Voice recognition processing method and device |
CN106331479A (en) * | 2016-08-22 | 2017-01-11 | 北京金山安全软件有限公司 | Video processing method and device and electronic equipment |
CN106372246A (en) * | 2016-09-20 | 2017-02-01 | 深圳市同行者科技有限公司 | Audio playing method and device |
CN106921749A (en) * | 2017-03-31 | 2017-07-04 | 北京京东尚科信息技术有限公司 | For the method and apparatus of pushed information |
CN107517405A (en) * | 2017-07-31 | 2017-12-26 | 努比亚技术有限公司 | The method, apparatus and computer-readable recording medium of a kind of Video processing |
Non-Patent Citations (1)
Title |
---|
杨朝欢: "基于深度学习的重复视频检测", 《中国优秀硕士学位论文全文数据库(基础科学辑)》 * |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11818424B2 (en) | 2019-11-15 | 2023-11-14 | Beijing Bytedance Network Technology Co., Ltd. | Method and apparatus for generating video, electronic device, and computer readable medium |
WO2021093737A1 (en) * | 2019-11-15 | 2021-05-20 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating video, electronic device, and computer readable medium |
CN110913271B (en) * | 2019-11-29 | 2022-01-18 | Oppo广东移动通信有限公司 | Video processing method, mobile terminal and non-volatile computer-readable storage medium |
CN110913271A (en) * | 2019-11-29 | 2020-03-24 | Oppo广东移动通信有限公司 | Video processing method, mobile terminal and non-volatile computer-readable storage medium |
CN111031333A (en) * | 2019-12-02 | 2020-04-17 | 北京达佳互联信息技术有限公司 | Video processing method, device, system and storage medium |
CN111031333B (en) * | 2019-12-02 | 2022-04-22 | 北京达佳互联信息技术有限公司 | Video processing method, device, system and storage medium |
WO2021120685A1 (en) * | 2019-12-20 | 2021-06-24 | 苏宁云计算有限公司 | Video generation method and apparatus, and computer system |
CN113055612A (en) * | 2019-12-27 | 2021-06-29 | 浙江宇视科技有限公司 | Video playing method, device, electronic equipment, system and medium |
CN111031394A (en) * | 2019-12-30 | 2020-04-17 | 广州酷狗计算机科技有限公司 | Video production method, device, equipment and storage medium |
CN113259754B (en) * | 2020-02-12 | 2023-09-19 | 北京达佳互联信息技术有限公司 | Video generation method, device, electronic equipment and storage medium |
CN113259754A (en) * | 2020-02-12 | 2021-08-13 | 北京达佳互联信息技术有限公司 | Video generation method and device, electronic equipment and storage medium |
CN111246224A (en) * | 2020-03-24 | 2020-06-05 | 成都忆光年文化传播有限公司 | Video live broadcast method and video live broadcast system |
WO2021259322A1 (en) * | 2020-06-23 | 2021-12-30 | 广州筷子信息科技有限公司 | System and method for generating video |
CN111918146A (en) * | 2020-07-28 | 2020-11-10 | 广州筷子信息科技有限公司 | Video synthesis method and system |
CN111918146B (en) * | 2020-07-28 | 2021-06-01 | 广州筷子信息科技有限公司 | Video synthesis method and system |
CN114257862B (en) * | 2020-09-24 | 2024-05-14 | 北京字跳网络技术有限公司 | Video generation method, device, equipment and storage medium |
CN114257862A (en) * | 2020-09-24 | 2022-03-29 | 北京字跳网络技术有限公司 | A video generation method, device, device and storage medium |
CN113286173A (en) * | 2021-05-19 | 2021-08-20 | 北京沃东天骏信息技术有限公司 | Video editing method and device |
CN113268635A (en) * | 2021-05-19 | 2021-08-17 | 北京达佳互联信息技术有限公司 | Video processing method, device, server and computer readable storage medium |
CN113268635B (en) * | 2021-05-19 | 2024-01-02 | 北京达佳互联信息技术有限公司 | Video processing method, device, server and computer readable storage medium |
CN113286173B (en) * | 2021-05-19 | 2023-08-04 | 北京沃东天骏信息技术有限公司 | Video editing method and device |
CN113572977B (en) * | 2021-07-06 | 2024-02-27 | 上海哔哩哔哩科技有限公司 | Video production method and device |
CN113572977A (en) * | 2021-07-06 | 2021-10-29 | 上海哔哩哔哩科技有限公司 | Video production method and device |
CN113613068A (en) * | 2021-08-03 | 2021-11-05 | 北京字跳网络技术有限公司 | Video processing method and device, electronic equipment and storage medium |
CN113676772B (en) * | 2021-08-16 | 2023-08-08 | 上海哔哩哔哩科技有限公司 | Video generation method and device |
CN113676772A (en) * | 2021-08-16 | 2021-11-19 | 上海哔哩哔哩科技有限公司 | Video generation method and device |
CN117495757A (en) * | 2022-07-22 | 2024-02-02 | 数坤(深圳)智能网络科技有限公司 | Ultrasound image regularization method, device, equipment and computer-readable storage medium |
CN117495757B (en) * | 2022-07-22 | 2025-05-06 | 数坤(深圳)智能网络科技有限公司 | Ultrasound image normalization method, device, equipment and computer readable storage medium |
CN118042248A (en) * | 2024-04-11 | 2024-05-14 | 深圳市捷易科技有限公司 | Video generation method, device, equipment and readable storage medium |
CN118042248B (en) * | 2024-04-11 | 2024-07-05 | 深圳市捷易科技有限公司 | Video generation method, device, equipment and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110392281B (en) | 2022-03-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110392281A (en) | Image synthesizing method, device, computer equipment and storage medium | |
CN114401438B (en) | Video generation method and device for virtual digital person, storage medium and terminal | |
CN109658916B (en) | Speech synthesis method, speech synthesis device, storage medium and computer equipment | |
CN104157285B (en) | Audio recognition method, device and electronic equipment | |
DE69514382T2 (en) | VOICE RECOGNITION | |
Maity et al. | IITKGP-MLILSC speech database for language identification | |
CN108711422A (en) | Audio recognition method, device, computer readable storage medium and computer equipment | |
DE602004012909T2 (en) | A method and apparatus for modeling a speech recognition system and estimating a word error rate based on a text | |
US20170133038A1 (en) | Method and apparatus for keyword speech recognition | |
CN108281139A (en) | Speech transcription method and apparatus, robot | |
CN109979440B (en) | Keyword sample determination method, voice recognition method, device, equipment and medium | |
CN111402865B (en) | Method for generating voice recognition training data and method for training voice recognition model | |
CN110121116A (en) | Video generation method and device | |
KR20150052600A (en) | System for grasping speech meaning of recording audio data based on keyword spotting, and indexing method and method thereof using the system | |
CN113593522B (en) | Voice data labeling method and device | |
DE10054583C2 (en) | Method and apparatus for recording, searching and playing back notes | |
CN110298463A (en) | Meeting room preordering method, device, equipment and storage medium based on speech recognition | |
CN110442855B (en) | Voice analysis method and system | |
CN112397053B (en) | Voice recognition method and device, electronic equipment and readable storage medium | |
CN106297766B (en) | Phoneme synthesizing method and system | |
CN108364655A (en) | Method of speech processing, medium, device and computing device | |
Lee et al. | Voice imitating text-to-speech neural networks | |
CN119479620A (en) | Streaming voice interaction method and related device, equipment and storage medium | |
CN110059174A (en) | Inquiry guidance method and device | |
CN108899016A (en) | A kind of regular method, apparatus of speech text, equipment and readable storage medium storing program for executing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |