CN110392281A

CN110392281A - Image synthesizing method, device, computer equipment and storage medium

Info

Publication number: CN110392281A
Application number: CN201810359953.5A
Authority: CN
Inventors: 严怡
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-04-20
Filing date: 2018-04-20
Publication date: 2019-10-29
Anticipated expiration: 2038-04-20
Also published as: CN110392281B

Abstract

This application involves a kind of image synthesizing methods, this method includes obtaining target information, obtain video analysis data, the video analysis data has recorded candidate information, the matching relationship of candidate segment range and candidate video mark, target video mark corresponding with the target information and target fragment range are searched from the video analysis data, corresponding target video is obtained according to target video mark, corresponding target video segment is extracted from corresponding target video according to the target fragment range, spliced each target video segment extracted to obtain synthetic video.Synthetic video method production is simple and at low cost, time saving and energy saving.Furthermore, it is also proposed that a kind of Video Composition device, computer equipment and storage medium.

Description

Image synthesizing method, device, computer equipment and storage medium

Technical field

This application involves computer processing technical fields, set more particularly to a kind of image synthesizing method, device, computer Standby and storage medium.

Background technique

Video Composition refers to the method that multiple video clips are synthesized a video, and traditional Video Composition is usually people Work, to what is synthesized after video progress editing, is manually cut using Video editing software for example, ghost raises a kind of video exactly pass through Volume synthesis a kind of video, make a synthetic video need to carry out material video select and editing, need in this process Entire video is constantly watched, generally requires to spend a large amount of energy and time, and artificial editing video is looked by dragging The video clip for wanting editing is looked for, the subjective operation of producer is limited to, the video clip extracted is often inaccurate.

Summary of the invention

Based on this, it is necessary to a kind of make video simple, at low cost and high accuracy in view of the above-mentioned problems, proposing and close At method, apparatus, computer equipment and storage medium.

A kind of image synthesizing method, which comprises

Obtain target information；

Video analysis data is obtained, the video analysis data has recorded candidate information, candidate segment range and candidate view The matching relationship that frequency marking is known；

Target video mark corresponding with the target information and target fragment model are searched from the video analysis data It encloses；

Corresponding target video is obtained according to target video mark；

Corresponding target video segment is extracted from corresponding target video according to the target fragment range；

Spliced each target video segment extracted to obtain synthetic video.

A kind of Video Composition device, described device include:

The first information obtains module, for obtaining target information；

Matching relationship obtains module, for obtaining video analysis data, the video analysis data have recorded candidate information, The matching relationship of candidate segment range and candidate video mark；

Searching module, for searching target video mark corresponding with the target information from the video analysis data With target fragment range；

First video acquiring module, for obtaining corresponding target video according to target video mark；

First extraction module, for extracting corresponding target from corresponding target video according to the target fragment range Video clip；

First splicing module, for being spliced each target video segment extracted to obtain synthetic video.

The target information is for determining corresponding target speaker in one of the embodiments, and the searching module is also For from the video analysis data search include the target speaker multiple target videos mark, obtain the target Pronunciation corresponding target fragment range in each target video；First splicing module be also used to extract with it is described The corresponding multiple target video segments of target speaker are spliced to obtain the synthetic video for repeating the target speaker.Wherein one In a embodiment, the target information is for determining that corresponding target picture, the searching module are also used to from the video point Lookup includes multiple target videos mark of the target picture in analysis data, obtains the target picture and regards in each target Corresponding target fragment range in frequency；First splicing module is also used to extract corresponding with the target picture more A target video segment is spliced to obtain the synthetic video for repeating the target picture.

First splicing module is also used to obtain each target video segment extracted in one of the embodiments, Duration, calculate the corresponding total duration of the multiple target video segment；When the total duration is less than preset duration, duplication is mentioned The target video segment got obtain duplication target video segment, by it is described duplication target video segment and extraction it is original Target video segment is spliced, and synthetic video is obtained.

It includes multiple target words that the searching module, which is also used to work as in the target information, in one of the embodiments, When language, target video mark corresponding with each target word and corresponding target fragment are searched from the video analysis data Range；The target video segment corresponding with each target word that first splicing module is also used to extract is spliced Obtain synthetic video.

A kind of computer equipment, including memory and processor, the memory are stored with computer program, the calculating When machine program is executed by the processor, so that the processor executes following steps:

Obtain target information；

Corresponding target video is obtained according to target video mark；

Spliced each target video segment extracted to obtain synthetic video.

A kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor, So that the processor executes following steps:

Obtain target information；

Corresponding target video is obtained according to target video mark；

Spliced each target video segment extracted to obtain synthetic video.

Then above-mentioned image synthesizing method, device, computer equipment and storage medium are obtained by obtaining target information Video analysis data, video analysis data have recorded the matching relationship of candidate information, candidate segment range and candidate video mark, Target video mark corresponding with target information and target fragment can be searched from video analysis data according to matching relationship Range obtains corresponding target video according to target video mark, then according to target fragment range from corresponding target video It is middle to extract corresponding target video segment, then spliced each target video segment extracted to obtain synthetic video. The method of above-mentioned Video Composition can obtain each target video segment according to target information automatically, then be spelled automatically It connecing, whole process is participated in without artificial, and production is simple and at low cost, and it is time saving and energy saving, and automatically determined according to target information Target video segment avoids the error of manual operation, improves the accuracy of extraction.

A kind of image synthesizing method, which comprises

Obtain target information；

Obtain target video；

Target fragment range corresponding with the target information is searched in the target video according to the target information；

Target view corresponding with the target information is extracted from the target video according to the target fragment range Frequency segment；

Spliced each target video segment extracted to obtain synthetic video.

A kind of Video Composition device, described device include:

Second data obtaining module, for obtaining target information；

Second video acquiring module, for obtaining target video；

Determining module, for determination to be corresponding with the target information in the target video according to the target information Target fragment range；

Second extraction module, for being extracted from the target video and the target according to the target fragment range The corresponding target video segment of information；

Second splicing module, for being spliced each target video segment extracted to obtain synthetic video.

The target information is for determining corresponding target speaker in one of the embodiments, and the determining module is also For obtaining target speaker corresponding target fragment range in the target video；Second splicing module is also used to mention The multiple target video segments corresponding with the target speaker got are spliced to obtain the synthesis for repeating the target speaker Video.

The target information is for determining corresponding target picture in one of the embodiments, and the determining module is also For obtaining target picture corresponding target fragment range in the target video；Second splicing module is also used to mention The multiple target video segments corresponding with the target picture got are spliced to obtain the synthesis for repeating the target picture Video.

Obtain target information；

Obtain target video；

Spliced each target video segment extracted to obtain synthetic video.

Obtain target information；

Obtain target video；

Spliced each target video segment extracted to obtain synthetic video.

Above-mentioned image synthesizing method, device, computer equipment and storage medium are regarded by obtaining target information and target Frequently, target fragment range corresponding with target information is searched in target video according to target information, then according to target fragment Range extracts target video segment corresponding with target information from target video, each target video segment that will be extracted Spliced to obtain synthetic video.The method of above-mentioned Video Composition can be regarded in the target got automatically according to target information Target fragment range corresponding with target information is found in frequency, is then automatically extracted according to target fragment range, in turn The each target video segment extracted is spliced, whole process is participated in without artificial, and production is simple and at low cost, time saving It is laborsaving, and be the target video segment automatically determined according to target information, the error of manual operation is avoided, extraction is improved Accuracy.

Detailed description of the invention

Fig. 1 is the applied environment figure of image synthesizing method in one embodiment；

Fig. 2 is the flow chart of image synthesizing method in one embodiment；

Fig. 3 is the schematic diagram that target information corresponding video identifier and segment ranges is searched in one embodiment；

Fig. 4 is the flow chart that matching relationship is determined in one embodiment；

Fig. 5 is the flow diagram of image synthesizing method in one embodiment；

Fig. 6 is that speech recognition obtains the schematic diagram of candidate text information and segment ranges in one embodiment；

Fig. 7 is that identification object obtains the schematic diagram of candidate target and segment ranges in one embodiment；

Fig. 8 is the schematic diagram that speech utterance object and segment ranges are identified in one embodiment；

Fig. 9 is the flow chart that target information is obtained in one embodiment；

Figure 10 is that splicing obtains the flow chart of synthetic video in one embodiment；

Figure 11 is the flow chart of image synthesizing method in another embodiment；

Figure 12 is the flow chart of image synthesizing method in another embodiment；

Figure 13 is the structural block diagram of image synthesizing system in one embodiment；

Figure 14 is the structural block diagram of Video Composition device in another embodiment；

Figure 15 is the structural block diagram of Video Composition device in another embodiment；

Figure 16 is the structural block diagram of computer equipment in one embodiment.

Specific embodiment

It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, and It is not used in restriction the application.

Fig. 1 is the applied environment figure of image synthesizing method in one embodiment.Referring to Fig.1, the image synthesizing method application In image synthesizing system.The image synthesizing system includes terminal 110 and server 120, and terminal 110 and server 120 pass through net Network connection, terminal 110 specifically can be terminal console or mobile terminal, and mobile terminal specifically can be mobile phone, tablet computer, pen Remember at least one of this computer etc..Server 120 can use the service of the either multiple server compositions of independent server Device cluster is realized.Terminal 110 sends Video Composition request to server 120, includes target information in Video Composition request, After server 120 gets target information, video analysis data is obtained, video analysis data has recorded candidate information, candidate piece The matching relationship of segment limit and candidate video mark, searches target video mark corresponding with target information from video analysis data Know and target fragment range, corresponding target video is obtained according to target video mark, according to target fragment range from corresponding Corresponding target video segment is extracted in target video, and each target video segment extracted is spliced to obtain synthesis view Frequently, obtained synthetic video is sent to terminal 110.

As shown in Fig. 2, in one embodiment, providing a kind of image synthesizing method, this method both can be applied to take Business device, also can be applied to terminal, and the present embodiment is illustrated with being applied to terminal.The image synthesizing method specifically includes as follows Step:

Step S202 obtains target information.

Wherein, target information refers to the information as querying condition.Target information can be word, be also possible to word Phonetic, such as " zhibo " can also be the name etc. of personage.The acquisition of target information can be by receiving user's input Information is also possible to the information that Automatic sieve is selected, for example, can will be gone out in text by identifying to one section of text information The existing highest word of frequency is as target information.

Step S204, obtains video analysis data, and video analysis data has recorded candidate information, candidate segment range and time Select the matching relationship of video identifier.

Wherein, video analysis data refer to by video audio data or video data carry out discriminance analysis obtain Data.Video analysis data has recorded the matching relationship between candidate information, candidate segment range and candidate video mark.It waits Information is selected to refer to the pre-stored information as search criterion, candidate information can be word, it is also possible to the phonetic of word, It can also be the name etc. of personage.Video identifier is used for one video of unique identification, and candidate segment range refers to candidate information pair Position of the video clip answered in candidate video.Video clip range can be put and terminate using at the beginning of video clip Time point indicates, can also be indicated using the starting video frame identification and termination video frame identification of video clip.Video frame Mark refers to the number of video frame, is used for one video frame of unique identification.Specifically, discriminance analysis is carried out to candidate video in advance Candidate information is obtained, and obtains candidate information corresponding video clip range in candidate video, by candidate information, piece of video Segment limit and corresponding candidate video mark is associated storage, i.e., the matching relationship of three is stored, be convenient for it is subsequent into Row is searched.

In one embodiment, before obtaining video analysis data, further includes: selected range of video is obtained, according to Selected range of video obtains video analysis data corresponding to the video in the range of video, then only in the selected video Target video segment corresponding with target information is searched in range.It is of course also possible to using all videos in database as lookup Range of video.Since the candidate video in database may be very much, by the way that video retrieval range is arranged, be conducive to more quickly Find target video corresponding with target information and corresponding target fragment range.

Step S206 searches target video mark corresponding with target information and target fragment model from video analysis data It encloses.

Wherein, it is had recorded in video analysis data between candidate information, candidate segment range and candidate video mark three Matching relationship, so corresponding with target information target video mark and target information can be found according to target information The corresponding target fragment range in target video.When target fragment range can be with start time point and the termination of video clip Between point to indicate, with the starting video frame identification of video clip and video frame identification can also be terminated indicate.With target information Corresponding target video mark can be one, be also possible to multiple, the mesh corresponding with target information in a target video Standard film segment limit can have one, can also have multiple.

Step S208 obtains corresponding target video according to target video mark.

Wherein, since video identifier is used for one video of unique identification, so after getting target video mark, so that it may To obtain target video corresponding with target video mark according to target video mark, it is convenient for subsequent progress editing and spelling It connects.

Step S210 extracts corresponding target video segment according to target fragment range from corresponding target video.

Wherein, the target fragment range according to target information in target video is extracted accordingly from corresponding target video Target video segment.Such as, it is assumed that target fragment range is the 10th millisecond (ms) to the 20th millisecond (ms), then just from target The video clip between 10ms to 20ms is extracted in video.

Step S212 is spliced each target video segment extracted to obtain synthetic video.

Wherein, synthetic video refers to the video for being spliced multiple video clips.Extraction obtains and target information After corresponding each target video segment, spliced each target video segment to obtain synthetic video.The mode of splicing can Can also be spliced according to the sequence for extracting each target video segment using random splicing, naturally it is also possible to according to Customized other splicings sequence is spliced, for example, can orderly be spliced according to the length of target video segment.

As shown in figure 3, in one embodiment, it is assumed that target information be " master ", find include " master " view Frequency has tri- videos of A, B, C, and " master " corresponding target fragment range is 10ms (millisecond) to 20ms in A video.B view " master " corresponding target fragment range is 10ms to 25ms, " master " corresponding target fragment range in C video in frequency There are two, respectively 20ms to 35ms, 60ms to 75ms, according to target fragment range respectively from corresponding video Video clip corresponding with " master " is extracted, is then spliced to obtain synthetic video S, as shown in figure 3, extracting each view After video clip in frequency, splicing has obtained the synthetic video S of a 55ms.

Then above-mentioned image synthesizing method obtains video analysis data, video analysis data note by obtaining target information The matching relationship for having recorded candidate information, candidate segment range and candidate video mark, can be from video point according to matching relationship It analyses and searches target video mark corresponding with target information and target fragment range in data, according to target video mark acquisition pair Then the target video answered extracts corresponding target video segment, so according to target fragment range from corresponding target video Spliced each target video segment extracted to obtain synthetic video afterwards.The method of above-mentioned Video Composition, according to target Information can obtain each target video segment automatically, then be spliced automatically, and whole process is participated in without artificial, production It is simple and at low cost, it is time saving and energy saving, and be the target video segment automatically determined according to target information, avoid manual operation Error improves the accuracy of extraction.

In one embodiment, target video mark corresponding with target information and target are searched from video analysis data Segment ranges, comprising: for target information for determining corresponding target speaker, searching from video analysis data includes target hair Multiple target videos of sound identify, and obtain target speaker corresponding target fragment range in each target video；It will extract Each target video segment spliced to obtain synthetic video, comprising: the multiple mesh corresponding with target speaker that will be extracted Mark video clip is spliced to obtain the synthetic video for repeating target speaker；

Wherein, target speaker refers to and pronunciation corresponding to target information.For example, if target information is " live streaming ", that Target speaker is pronunciation corresponding to " live streaming " the two words.Candidate information, candidate segment are had recorded in video analysis data Matching relationship between range and candidate video mark three.In one embodiment, the corresponding candidate pronunciation of candidate information, it is candidate The corresponding candidate segment range of information is segment ranges corresponding to candidate pronunciation in video.So from video analysis data Multiple target video marks corresponding with target information are searched, i.e., searching from video analysis data includes the more of target speaker Then a target video mark obtains target speaker in each target video in corresponding target fragment valve, by what is extracted Multiple target video segments corresponding with target speaker are spliced and combined to obtain the synthetic video for repeating the target speaker.

In one embodiment, target video mark corresponding with target information and target are searched from video analysis data Segment ranges, comprising: for target information for determining corresponding target picture, searching from video analysis data includes that target is drawn Multiple target videos in face identify, and obtain target picture corresponding target fragment range in each target video；It will extract Each target video segment spliced to obtain synthetic video, comprising: the multiple mesh corresponding with target picture that will be extracted Mark video clip is spliced to obtain the synthetic video for repeating target picture.

Wherein, target picture refers to and picture corresponding to target information.For example, if target information is " dog ", Target picture is the picture for including " dog ", if target information be " Sun Wukong ", target picture be include that " grandson realizes The picture of sky ".Between candidate information, candidate segment range and candidate video mark three is had recorded in video analysis data With relationship.In one embodiment, the corresponding candidate picture of candidate information, the corresponding candidate segment range of candidate information are candidate draw Face corresponding segment ranges in video.So searching multiple target views corresponding with target information from video analysis data Frequency marking is known, i.e., searches multiple target videos comprising target picture from video analysis data and identify, then obtain target picture The corresponding target fragment range in each target video is then proposed according within the scope of target fragment to each target video piece Section is spliced the multiple target video segments extracted to obtain the synthetic video for repeating target picture.

In one embodiment, before obtaining target information, further includes: determine candidate information, candidate segment range with Candidate video identifies the matching relationship between three.As shown in figure 4, specifically includes the following steps:

Step S214 obtains candidate video.

Wherein, candidate video refers to the video of storage being used for for lookup, can be according to target information in candidate video Find required target video.

Step S216 is at least one of step S216A, S216B, S216C.

Step S216A carries out speech recognition to the corresponding audio of candidate video and obtains corresponding multiple candidate text informations, Using each candidate text information as candidate information.

Wherein, video includes audio and video image.Audio refers to that the sound in video, video image refer in video Picture.Speech recognition refers to that by the content recognition in sound be text.Candidate text information refers to be obtained by speech recognition Text information.Candidate text information can be the word that identification obtains, and be also possible to the phonetic of word.It can in candidate text information Can also simultaneously include multiple words comprising a word.In some embodiments, by carrying out speech recognition to audio What is directly obtained is a long text, and the inside includes many words, can be carried out at word cutting to the text of the length Reason obtains multiple candidate words, using candidate word as candidate text information.

In one embodiment, the principle of speech recognition is as follows: important module there are three in speech recognition, acoustic model, Dictionary and language model.The basic procedure of speech recognition is such that audio stream becomes segment one by one, and each segment first passes through The operation of acoustic model matches to find with which sound, forms certain sounds of conjecture later, and by searching for dictionary, these sounds can be with Reflect into different words, later from a word to next word, then with arranging in pairs or groups between different words in language model Relationship guesses type path by constantly calculating acoustic model and language model, terminates when segments all in audio stream all compare, According to the component selections one highest recognition result for being exactly.

Step S216B carries out image recognition to the corresponding picture frame of candidate video and obtains corresponding multiple candidate targets, will Each candidate target is as candidate information.

Wherein, picture frame refers to that video image, video are made of the video image of a frame frame, and a picture frame is corresponding One video image.Multiple candidate targets are obtained by carrying out image recognition to the picture frame in candidate video, candidate target can To be to identify obtained personage, it is also possible to the animal that identification obtains, can also be other objects etc. that identification obtains.It is candidate right The identification of elephant can be identified using Object identifying model.Object identifying model is the model obtained by learning training, The target object for including in image for identification.Such as, it is assumed that it to identify in image whether include " Sun Wukong ", first pass through in advance Model training (for example, being trained using convolutional neural networks model) extracts the image characteristics about " Sun Wukong ", is then formed The identification model of " Sun Wukong " this personage is identified, later using the model to whether including Sun Wukong's progress in video pictures Identification.In one embodiment, multiple Object identifying models are preset to be respectively used to identify different objects, identification is obtained Candidate target as candidate information.

Step S216C carries out speech utterance Object identifying to the corresponding audio of candidate video, obtains multiple candidate speech hairs Sound object, using each candidate speech sounding object as candidate information.

Wherein, speech utterance object refers to sound sender.Identify that speech utterance object, seeking to identification speaks Sound is issued by whom.By being identified to obtain speech utterance corresponding to every section audio data to the audio in candidate video Object.Such as, it is assumed that occur the dialogue of four personages in one section of video, identifies the corresponding sounding personage of every a word, it should Identify that obtained sounding personage is exactly the speech utterance object that identification obtains, using the speech utterance object as candidate information.Than Such as, it is assumed that the speech utterance object identified is " Donald Trump ", then " Donald Trump " is just used as candidate information.

Step S218 obtains each candidate information candidate segment range corresponding in candidate video, candidate segment model It encloses including segment origin identification and segments end mark.

Wherein, candidate segment range refers to position range of the corresponding video clip of candidate information in candidate video.When When candidate information is candidate text information, the pronunciation of candidate text information candidate segment corresponding in candidate video is obtained. When candidate information is candidate target, obtains and identifies obtained candidate target corresponding candidate segment range in candidate video, Candidate segment range refers to the segment ranges that candidate target occurs in candidate video.When candidate information is candidate speech sounding pair As when, obtain each candidate speech sounding object corresponding candidate segment range, candidate segment range in candidate video and refer to The range that the corresponding audio data of candidate speech generating object occurs in candidate video.

Candidate segment range is indicated using segment origin identification and segments end mark.In one embodiment, It can be identified using start time point as segment origin identification using time point is terminated as segments end.For example, candidate letter The 14th second to the 16th second this period of time of corresponding video clip in candidate video is ceased, then between the 14th second to the 16th second Video is corresponding video clip.In another embodiment, using the starting video frame identification of video clip as segment Origin identification is identified using video frame identification is terminated as segments end.For example, the corresponding video clip of candidate information is in candidate The 20th frame in video between the 50th frame, then the 20th frame to the video between the 50th frame be corresponding video clip.

The video identifier of candidate information, candidate segment range and candidate video is associated storage and obtained by step S220 Matching relationship.

Wherein, it is searched for the ease of subsequent, by candidate information, the video identifier of candidate segment range and candidate video It is associated storage, obtains the matching relationship of three.The mode specifically stored customized can be arranged, in one embodiment, Candidate information, candidate video mark, candidate segment range are subjected to one-to-one correspondence storage.Assume that in video A include multiple Each candidate information is identified with the candidate video and corresponding segment ranges carries out one-to-one correspondence association by candidate information.Tool Body can be stored the matching relationship of three into database using following storage organization, as shown in table 1.

Table 1

Field name	Type	Description
			id	int	Number, major key
pronounce	varchar(256)	Candidate information
			duration	int	Pronounce duration (millisecond)
fileName	varchar(256)	Video identifier
			beginTime	int	Time started (millisecond)
endTime	int	End time (millisecond)

In another embodiment, it can quickly be searched in order to subsequent according to target information, by same candidate information It is associated storage with corresponding all candidate video marks, while again identifying each candidate video and corresponding candidate segment Range is associated.Such as, it is assumed that candidate information S, candidate video corresponding with candidate information S are identified with A, B, C tri-, Candidate information S is in A video there are two corresponding candidate segment ranges, respectively A1 and A2；Candidate information S is right in B video There are three the candidate segment ranges answered, respectively B1, B2 and B3；Candidate information S corresponding candidate segment range in C video has 1, be C1.Three can be associated using two-stage associated form, i.e. candidate information S is directly closed with video A, B and C Connection, then video A, B and C is associated with corresponding video clip range respectively.

It is illustrated in figure 5 the flow diagram of Video Composition in one embodiment, including two parts, first part are (right Should right-hand component in figure) it is to obtain candidate video, the audio or video in candidate video is analyzed to obtain and multiple candidate is believed Breath obtains candidate segment range corresponding with each candidate information, and candidate information, candidate segment range are regarded with corresponding candidate Matching relationship between frequency marking knowledge is stored to database.Second part (left-hand component in corresponding diagram) is to obtain target information, so Searched in the database according to target information afterwards corresponding with target information target video mark and with each target video mark Know corresponding target fragment range, then extracts corresponding target view from corresponding target video according to target fragment range Frequency segment is spliced each target video segment extracted to obtain synthetic video.

In one embodiment, image synthesizing method is applied in the scene of ghost poultry video, wherein ghost poultry video is one Relatively conventional original video type in kind video website, the type cooperate BGM (one with high level of synchronization, quick duplicate material Kind than faster rhythm) rhythm ghost as twitch reach brain wash or happiness sense effect, or by video (or audio) into Row editing, one section of rhythm being composed with the repetition picture (or sound) of very high frequency draw the high one kind of sync rates with synaeresis Video.Due to being often directed to the picture of some word being iteratively repeated in ghost poultry video, in order to rapidly synthesize terrible poultry Then video will search in the database mesh corresponding with target information in the core word that ghost raises video as target information Video identifier and target fragment range corresponding with each target video mark are marked, then according to target fragment range from correspondence Target video in extract corresponding target video segment, each target video segment extracted is spliced and is closed At ghost poultry video.In another embodiment, after the ghost poultry video synthesized, video in addition can also be raiseeed for the ghost and added Background music is added to build a kind of allegro comedy sense.

In one embodiment, target information is target text information；Voice knowledge is carried out to the corresponding audio of candidate video The step S216A for not obtaining corresponding multiple candidate text informations includes: to carry out speech recognition to the audio in candidate video to obtain To speech text, word cutting is carried out to speech text and handles to obtain multiple candidate text informations；Each candidate information is obtained in candidate Corresponding candidate segment range in video, candidate segment range include the steps that segment origin identification and segments end mark S218 includes: to obtain current candidate text information, obtains the corresponding audio session of current candidate text information, audio session Time point is terminated including Voice onset time point and pronunciation；Using Voice onset time point as segment origin identification, eventually by pronunciation Only time point identifies as segments end.

Wherein, target information is target text information, and target text information can be target word, be also possible to target word The phonetic of language.Speech recognition is carried out to the audio in candidate video and obtains speech text, includes multiple words in speech text, Multiple candidate text informations can be obtained by carrying out word cutting processing to speech text.In one embodiment, to speech text Before progress word cutting processing further include: speech recognition is pre-processed, pretreatment includes removal stop words, for example, " ", The meaningless word such as " ground ".

Current candidate text information refers to current text information to be processed, obtains the corresponding sound of current candidate text information Frequency period, the i.e. pronunciation of acquisition current candidate text information corresponding audio session, audio session in candidate video Time point is terminated including Voice onset time point and pronunciation, using Voice onset time point as segment origin identification, eventually by pronunciation Only time point identifies as segments end.After current candidate text information corresponding audio session has been determined, obtain next Candidate text information is as current candidate text information, subsequently into the corresponding audio session of acquisition current candidate text information The step of.For example, as shown in Figure 6, it is assumed that speech recognition is carried out to the audio in one section of video M and obtains 5 candidate words, Respectively word 1, word 2, word 3, word 4, word 5, wherein the corresponding audio session of word 1 is 5ms to the 15ms, the corresponding audio session of word 2 are 20ms to 25ms, and the corresponding audio session of word 3 is 35ms to the 45ms, the corresponding audio session of word 4 are 55ms to 70ms, and the corresponding audio session of word 5 is 75ms to the 90ms.It is subsequent that each candidate word is associated storage with corresponding video identifier and corresponding audio session respectively.

In one embodiment, target information is the corresponding target phonetic of target word；Audio corresponding to candidate video It carries out speech recognition and obtains the step S216A of corresponding multiple candidate text informations to include: to carry out the audio in candidate video Speech recognition obtains phonetic text, carries out word cutting to phonetic text and handles to obtain the corresponding candidate pinyin of multiple candidate words；It obtains The candidate segment range for taking each candidate information corresponding in candidate video, candidate segment range include segment origin identification and Segments end identification of steps S218 includes: to obtain current candidate phonetic, obtains the corresponding sound of audio data of current candidate phonetic Frequency period, audio session include that Voice onset time point and pronunciation terminate time point；Using Voice onset time point as piece Pronunciation is terminated time point as segments end and identified by section origin identification.

Wherein, target information is the corresponding target phonetic of target word.Speech recognition is carried out to the audio in candidate video Phonetic text is obtained, includes the phonetic of multiple words in phonetic text, carrying out word cutting processing to phonetic text can obtain The corresponding candidate pinyin of multiple candidate's words.Current candidate phonetic refers to current phonetic to be processed, obtains current candidate phonetic The corresponding audio session of audio data, audio data refer to the pronunciation of current candidate phonetic corresponding audio.Audio Period includes that Voice onset time point and pronunciation terminate time point, will using Voice onset time point as segment origin identification Pronunciation terminates time point as segments end mark.After current candidate phonetic corresponding audio session has been determined, under acquisition The corresponding candidate pinyin of a candidate's word is used as current candidate phonetic, when audio corresponding subsequently into acquisition current candidate phonetic Between section the step of.

It, can will be with video corresponding to identical phonetic but different terms by using target phonetic as target information Segment all extracts, and is conducive to extract more video clips, for example, target phonetic is " shifu ", qualified word Language includes: " master worker ", " master ", " restaurant ", " real to pay ", " poem tax " etc..Progress is equivalent to as target information using phonetic Video clip corresponding to same phonetic but different terms can all be extracted, be conducive to extract more by fuzzy query The video clip of multiple coincidence condition.

In one embodiment, target information further includes tone corresponding to the target phonetic of target word；Candidate is regarded Audio in frequency carries out speech recognition and obtains the step S216A of candidate pinyin corresponding to multiple candidate words to include: to regard candidate Audio in frequency carries out speech recognition and obtains the tone of each phonetic in phonetic text and phonetic text, cuts to phonetic text Word handles to obtain the corresponding candidate pinyin of multiple candidate words and the corresponding tone of candidate pinyin；Each candidate information is obtained to wait Candidate segment range corresponding in video is selected, candidate segment range includes the steps that segment origin identification and segments end mark S218 includes: to obtain current candidate phonetic and the corresponding tone of current candidate phonetic, according to current candidate phonetic and current candidate The corresponding tone of phonetic obtains corresponding audio session, and audio session includes that Voice onset time point and pronunciation terminate the time Point；Using Voice onset time point as segment origin identification, pronunciation is terminated into time point as segments end and is identified.

Wherein, in order to accurately extract video clip corresponding to the word of same pronunciation, in addition to packet in target information It also include tone corresponding with the target phonetic outside containing target phonetic.Voice knowledge is being carried out to the audio in candidate video When other, other than identification obtains phonetic text, it is also necessary to identify the tone of each phonetic in phonetic text.To phonetic text into Multiple candidate corresponding candidate pinyins of word and corresponding tone are obtained after the processing of row word cutting.It obtains current candidate phonetic and works as Then the corresponding tone of preceding candidate pinyin obtains corresponding sound according to current candidate phonetic and the corresponding tone of current candidate phonetic The frequency period.

By that can will have identical phonetic harmony using target phonetic and the corresponding tone of target phonetic as target information Video clip corresponding to the word of tune extracts.For example, target phonetic is " shifu ", it is first that corresponding tone, which is divided into, Sound and the falling tone.So qualified word includes: " master worker ", " master ", " poem tax " etc..By that will include in target information There is tone, be conducive to extract video clip corresponding to the identical word of pronunciation, relative to directly with target word work For the mode of target information, be conducive to extract more qualified video clips.

In one embodiment, target information is target object；Image recognition is carried out to the corresponding picture frame of candidate video The step S216A for obtaining corresponding multiple candidate targets includes: to carry out in the corresponding picture frame of candidate video to the object for including Identification obtains corresponding multiple candidate targets；Each candidate information candidate segment range corresponding in candidate video is obtained, Candidate segment range includes the steps that segment origin identification and segments end mark S218 include: to obtain current candidate object, obtains The video frame for taking current candidate object to occur in candidate video, by it is multiple include current candidate object successive video frames make For a video clip, video clip includes the mark of initial video frame and the mark for terminating video frame；By initial video frame Mark is used as segment origin identification, and the mark for terminating video frame is identified as segments end.

Wherein, target information is target object.Target object can be specified personage, be also possible to specified animal, It can also be other specified objects.By being identified to obtain multiple candidate targets to the object for including in picture frame.One In a embodiment, candidate target identifies to obtain using trained Object identifying model.By extracting different objects Feature training obtains the Object identifying model of different objects for identification.Such as, it is assumed that candidate target is " dog ", passes through extraction The feature of " dog " trains to obtain the identification model of " dog " for identification, using the model in video pictures whether include " dog " is identified.

Due to having usually contained multiple objects in one section of video, come pair so needing to preset multiple Object identifying models Each object in video is identified, then identifies which video frame obtains each object has appeared in, such as Fig. 7 institute Show, in one embodiment, it is assumed that there are 4 personages, respectively personage A altogether in one section of video N, personage B, personage C and Personage D, this 4 personages are the candidate targets of identification, using pre-set Object identifying model to 4 personages in video It is identified, identifies that each personage has appeared in which video frame in the video respectively, as shown in fig. 7, identification obtains Personage A has appeared in the 10th frame in video between the 20th frame, and personage B has appeared in the 25th frame to 40 frames in video, people Object C has appeared in 45 frames to 60 frames, and personage D has appeared in the 70th frame to the 85th frame.The subsequent personage that will be recognized with it is corresponding Video identifier and corresponding segment ranges carry out corresponding storage.

Current candidate object refers to current object to be processed, obtains the view that current candidate object occurs in candidate video Frequency frame, then using multiple successive video frames for including current candidate object as a video clip, and by the piece of video Initial video frame identification in section is as segment origin identification, using the termination video frame identification in the video clip as segment end Only identify.The candidate target that identification obtains is stored with candidate video mark and corresponding video clip, convenient for subsequent Target video mark corresponding with target object and target fragment range are searched according to target object, then according to target fragment model The target video segment extracted from target video with target object is enclosed, each target video segment extracted is synthesized Obtain synthetic video corresponding with target object.

In one embodiment, target information is speech utterance object；Voice hair is carried out to the corresponding audio of candidate video Sound Object identifying, the step S216A for obtaining multiple candidate sounding objects includes: to carry out speech utterance to the audio in candidate video Object identifying, identification obtain multiple candidate speech sounding objects；Obtain each candidate information time corresponding in candidate video Chip select segment limit, candidate segment range include the steps that segment origin identification and segments end mark S218 include: to obtain currently Candidate speech sounding object, obtains the corresponding audio session of current candidate speech utterance object, and audio session includes pronunciation Start time point and pronunciation terminate time point；Using Voice onset time point as segment origin identification, pronunciation is terminated into time point It is identified as segments end.

Wherein, target information is speech utterance object.Speech utterance object, that is, sound sender.By to candidate video In audio carry out speech utterance Object identifying, identify the multiple speech utterance objects occurred in candidate video, i.e., candidate language Sound sounding object.And obtain audio session corresponding to the audio data of each speech utterance object, audio time here Section can be one section, be also possible to multistage, if different location of some personage in candidate video has been said repeatedly, accordingly Just have multiple audio sessions.The time point that each audio session has recorded the time point that voice starts and voice terminates. Speech utterance Object identifying model that being identified by of speech utterance object pre-establishes identifies, due to different people Speech utterance feature possessed by object is different, it is possible to identify audio by learning the speech utterance feature of different personages Speech utterance object corresponding to data.For example, as shown in Figure 8, it is assumed that in one section of video K, four personages occur Dialogue, this four personages be respectively " Lucy ", " Joy ", " Coco " and " Mary ", it is corresponding with each personage using pre-establishing Speech utterance Object identifying model the audio in video is identified, identify audio data corresponding to each personage, Obtain audio session corresponding to respective audio data.As shown in figure 8, " Lucy " corresponding audio session is that 20ms is arrived 30ms, there are two " Joy " corresponding audio sessions, respectively 35ms to 45ms and 60ms to 75ms, " Coco " corresponding audio session is 50ms to 60ms, and " Mary " corresponding audio session is 80ms to 90ms.

As shown in figure 9, in one embodiment, obtaining target information, comprising:

Step S202A obtains the video of upload.

Wherein, in order to get target information from the video of upload automatically, terminal obtains the video of upload first.

Step S202B carries out speech recognition to the audio in the video of upload and obtains choosing text, and statistics is chosen in text The frequency that each word occurs.

Wherein, video includes audio, in the video of upload audio carry out speech recognition, the effect of speech recognition be by The Content Transformation for including in audio is text, and identification obtains after choosing text, carries out word cutting processing to text is chosen, obtains one A word, and count the frequency that each word occurs in the selection text.Frequency refers to the frequency occurred within the unit time It is secondary.The frequency that word occurs is higher, illustrates that the word relatively has feature.

Step S202C determines target word according to the frequency that each word occurs in choosing text, target word is made For target information.

Wherein, after obtaining the frequency that each word occurs in choosing text, target is determined according to the frequency of word The determination of word, target word can also will be greater than predeterminated frequency directly using the highest word of frequency as target word Word as target word, target word can have multiple.The target word of the determination is target information.

As shown in Figure 10, in one embodiment, each target video segment extracted is spliced and is synthesized The step 212 of video includes:

Step S212A obtains the duration for each target video segment extracted, and it is corresponding to calculate multiple target video segments Total duration.

Wherein, before synthetic video, the duration of each target video segment is calculated first, is then calculated each The total duration of a target video segment.Such as, it is assumed that extract 3 segments, when 3 seconds a length of, second segment of the 1st segment When it is 8 seconds a length of, the third fragment when it is 10 seconds a length of, then corresponding total duration be 3+8+10=21 seconds.

Step S212B, judges whether total duration is less than preset duration, if so, S212C is entered step, if it is not, then entering Step S212D.

Wherein, calculate the corresponding total duration of each target video segment be in order to judge duration corresponding to synthetic video, If total duration is too short, need to be extended.Specifically, if total duration is less than preset duration, the target extracted is replicated Video clip obtains duplication target video segment.If total duration is not less than preset duration, do not need to replicate.

Step S212C replicates the target video segment extracted and obtains duplication target video segment, will replicate target video Segment and the original object video clip of extraction are spliced, and synthetic video is obtained.

Wherein, duplication target video segment, which refers to, replicates target video segment.Original object piece of video Section refers to the target video segment directly extracted, in order to distinguish with duplication target video segment, so referred to as original object regards Frequency segment.The duplication target video segment that duplication obtains is spliced with original object video clip, obtains synthetic video.

Step S212D is spliced each target video segment to obtain synthetic video.

Wherein, when total duration is not less than preset duration, directly each target video segment extracted is spliced Obtain synthetic video.

In one embodiment, target video mark corresponding with target information and target are searched from video analysis data Segment ranges include: to search and each target from video analysis data in target information when including multiple target words The corresponding target video mark of word and corresponding target fragment range；The each target video segment extracted is spliced Obtain synthetic video, comprising: the target video segment corresponding with each target word extracted is spliced and is synthesized Video.

It wherein, may include multiple target words in target information, target word can be to be existed in the form of text, The form for being also possible to phonetic exists.When in target information including multiple target words, respectively from video analysis data Target video mark corresponding with each target word and corresponding target fragment range are searched, then according to target fragment range Extract target video segment corresponding with each target word.Then each target video segment extracted is spliced Obtain synthetic video.

The mode of splicing customized can be arranged, and the corresponding target video segment of same target word can preferentially be spliced Together, then together with target video fragment assembly corresponding with other target words.It can also be by different target word pair The target video segment answered carries out intersection splicing, naturally it is also possible to use other modes, such as the mode spliced at random.For example, Assuming that target word S1, S2 and S3, the corresponding target video segment of S1 have 3, respectively A1 comprising there are three in target information, B1 and C1.The corresponding target video segment of S2 has 4, respectively A2, B2, C2 and D2.The corresponding target video segment of S3 has 3 It is a, respectively A3, B3 and C3.So connecting method can use A1-B1-C1-A2-B2-C2-D2-A3-B3-C3, can also adopt With A1-A2-A3-B1-B2-B3-C1-C2-C3-D2, naturally it is also possible to use other connecting methods.

As shown in figure 11, in one embodiment it is proposed that a kind of image synthesizing method, this method comprises:

Step S1101 obtains candidate video.

Step S1102 carries out speech recognition to the corresponding audio of candidate video and obtains corresponding multiple candidate text informations, Using each candidate text information as candidate information.

Step S1103 carries out image recognition to the corresponding picture frame of candidate video and obtains corresponding multiple candidate targets, will Each candidate target is as candidate information.

Step S1104 carries out speech utterance Object identifying to the corresponding audio of candidate video, obtains multiple candidate speech hairs Sound object, using each candidate speech sounding object as candidate information.

Step S1105 obtains each candidate information candidate segment range corresponding in candidate video, candidate segment model It encloses including segment origin identification and segments end mark.

The video identifier of candidate information, candidate segment range and candidate video is associated storage and obtained by step S1106 Matching relationship.

Step S1107 obtains target information.

Step S1108, obtain video analysis data, video analysis data have recorded candidate information, candidate segment range and The matching relationship of candidate video mark.

Step S1109 searches target video mark corresponding with target information and target fragment from video analysis data Range.

Step S1110 obtains corresponding target video according to target video mark.

Step S1111 extracts corresponding target video segment according to target fragment range from corresponding target video.

Step S1112 is spliced each target video segment extracted to obtain synthetic video.

Above-mentioned image synthesizing method be all be that discriminance analysis has been carried out to candidate video in advance, then store candidate letter Relationship between breath, candidate video mark and candidate segment range, is searched according to the relationship of the storage.In following embodiment In, give a kind of method identified in real time to obtain target video segment corresponding with target information.

As shown in figure 12, in one embodiment it is proposed that a kind of image synthesizing method, this method comprises:

Step S1202 obtains target information.

Wherein, target information refers to the information as querying condition.Target information can be word, be also possible to word Phonetic can also be the name etc. of personage.The acquisition of target information can be the information by receiving user's input, be also possible to The information that Automatic sieve is selected, for example, can be by being identified to one section of text information, by the highest word of the frequency of occurrences in text Language is as target information.

Step S1204 obtains target video.

Wherein, target video refers to the specified video for being used to search target video segment corresponding with target information.Mesh Mark video can be one, be also possible to multiple.Obtaining for target video can be also possible to by receiving the video uploaded Several videos selected in the multiple range of video provided.

Step S1206 determines target fragment range corresponding with target information according to target information in target video.

Wherein, it after target information and target video has been determined, is carried out by the audio or video image to target video Analysis determines target fragment range corresponding with target information in target video.In one embodiment, in target video Audio carry out speech recognition when obtaining text information identical with target information, obtain the corresponding sound of pronunciation of text information The frequency period, audio session include pronunciation start time point and pronunciation termination time point, it is true according to the audio session Fixed corresponding target fragment range.Target fragment range refers to the corresponding target video segment of target information in target video Position.Target fragment range can be indicated with the start time point of video clip and termination time point, can also use piece of video The starting video frame identification of section is indicated with video frame identification is terminated.The target corresponding with target information in a target video Segment ranges can be one, be also possible to multiple.

Step S1208 extracts target video corresponding with target information according to target fragment range from target video Segment.

Step S1210 is spliced each target video segment extracted to obtain synthetic video.

Above-mentioned image synthesizing method, by obtaining target information and target video, according to target information in target video It is middle to search corresponding with target information target fragment range, it is then extracted from target video according to target fragment range and mesh The corresponding target video segment of information is marked, is spliced each target video segment extracted to obtain synthetic video.It is above-mentioned The method of Video Composition can be found in the target video got corresponding with target information automatically according to target information Then target fragment range is automatically extracted according to target fragment range, and then each target video segment that will be extracted Spliced, whole process is participated in without artificial, and production is simple and at low cost, time saving and energy saving.

In one embodiment, described to be determined and the target information in the target video according to the target information Corresponding target fragment range, comprising: the target information obtains target speaker described for determining corresponding target speaker Corresponding target fragment range in target video；The each target video segment that will be extracted is spliced to obtain synthesis view Frequently, comprising: spliced the multiple target video segments corresponding with the target speaker extracted to obtain the repetition mesh Mark the synthetic video of pronunciation；

Wherein, target speaker refers to and pronunciation corresponding to target information.For example, if target information is " live streaming ", that Target speaker is pronunciation corresponding to " live streaming " the two words.Lookup and target in given one or more target videos Corresponding target fragment range of pronouncing will then be extracted then according to the corresponding target video segment of target fragment range extraction To multiple target video segments corresponding with target speaker spliced to obtain and repeat the synthetic video of target speaker.

In one embodiment, described to be determined and the target information in the target video according to the target information Corresponding target fragment range, comprising: the target information obtains target picture described for determining corresponding target picture Corresponding target fragment range in target video；The each target video segment that will be extracted is spliced to obtain synthesis view Frequently, comprising: spliced the multiple target video segments corresponding with the target picture extracted to obtain the repetition mesh Mark the synthetic video of picture.

Wherein, target picture refers to and picture corresponding to target information.For example, if target information is " dog ", Target picture is the picture for including " dog ", if target information be " Sun Wukong ", target picture be include that " grandson realizes The picture of sky ".Target fragment range corresponding with target picture is searched in given one or more target videos, then root Corresponding target video segment is extracted according to target fragment range, all includes target in each video frame in target video segment Picture.Then the multiple target video segments corresponding with target picture extracted are spliced to obtain and repeats target picture Synthetic video.

In one embodiment, target information is target text information；Searched in target video according to target information with The step S1206 of the corresponding target fragment range of target information includes: to carry out speech recognition to the audio in target video, works as knowledge When being clipped to text information identical with target text information, it is corresponding in target video to obtain the audio data of text information Start time point and termination time point according to start time point and terminate time point determining target corresponding with target text information Segment ranges.

Wherein, target information is target text information, and target text information can be target word, be also possible to target word The phonetic of language.Speech recognition refers to that by the content recognition in audio be text, when recognizing text identical with target text information When this information, obtains the audio data start time point corresponding in target video of text information and terminate time point. Audio data refers to data corresponding to the pronunciation of text information.Such as, it is assumed that target text information is " zhibo ", then Data corresponding to the pronunciation of " zhibo " in target video are audio data.According to start time point and terminate time point Determine target fragment range corresponding with target text information.

In one embodiment, target information is target object；It is searched in target video according to target information and target The corresponding target fragment range of information, comprising: image recognition is carried out to the picture frame in target video, identification obtains including mesh The video frame for marking object, using multiple successive video frames for including target object as a video clip；According to video clip In initial video frame mark and terminate the mark of video frame and determine corresponding with target object target fragment range.

Wherein, target information is target object, and target object can be specified personage, be also possible to specified animal, It can also be specified object.Video frame refers to that video image, video are made of the video frame of a frame frame, each video frame For a video image, by carrying out image recognition to the picture frame in target video, identification obtains including target object Video frame.Identifying to the target object in video image can be known using trained recongnition of objects model Not, the target object that recongnition of objects model includes in image for identification, in one embodiment, recongnition of objects mould Type is to learn to obtain by extracting the feature of target object and being trained using convolutional neural networks model.Recongnition of objects Model can be realized using the prior art, be defined here and for how to obtain recongnition of objects model.Target object Be likely to occur in a video repeatedly, then the target fragment range correspondingly got also have it is multiple.It will include mesh The successive video frames of object are marked as a video clip, then according to the initial video frame identification and termination view in video clip Frequency frame identification determines corresponding target fragment range.For example, it is assumed that target object is " Sun Wukong ", and " Sun Wukong " is in target Compartment of terrain occurs 5 times in video, then correspondingly corresponding to 5 target fragment ranges.

In one embodiment, target information is target voice sounding object；It is looked into target video according to target information Look for target fragment range corresponding with target information, comprising: speech utterance Object identifying is carried out to the audio in target video, is known Audio fragment corresponding to target utterance object is not obtained；It is true according to the corresponding start time point of audio fragment and termination time point Set the goal segment ranges.

Wherein, target information is target voice sounding object.Speech utterance object, that is, sound sender.By to target Audio in video carries out speech utterance Object identifying, identifies audio data corresponding to target voice sounding object, obtains Audio fragment corresponding to the audio data, audio fragment here can be one section, be also possible to multistage, for example, target language Different location of the sound sounding object in target video has been said repeatedly, then correspondingly just having multiple audio fragments.Audio piece Section includes start time point and terminates time point, determines target fragment range, mesh according to start time point and termination time point Range corresponding to standard film segment limit, that is, target video segment.Target voice sounding object is identified by the mesh pre-established For mark speech utterance Object identifying model come what is identified, the speech utterance feature as possessed by different personages is different, so Audio data corresponding to target utterance object can be identified by the speech utterance feature of learning objective sounding object.

It should be understood that although each step in the flow chart of Fig. 2 to 12 is successively shown according to the instruction of arrow, It is these steps is not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps There is no stringent sequences to limit for rapid execution, these steps can execute in other order.Moreover, in Fig. 2 to 12 extremely Few a part of step may include that perhaps these sub-steps of multiple stages or stage are not necessarily same to multiple sub-steps Moment executes completion, but can execute at different times, and the execution sequence in these sub-steps or stage is also not necessarily It successively carries out, but in turn or can be handed over at least part of the sub-step or stage of other steps or other steps Alternately execute.

As shown in figure 13, in one embodiment it is proposed that a kind of Video Composition device, the device include:

The first information obtains module 1302, for obtaining target information；

Matching relationship obtains module 1304, and for obtaining video analysis data, the video analysis data has recorded candidate The matching relationship of information, candidate segment range and candidate video mark；

Searching module 1306, for searching target video corresponding with the target information from the video analysis data Mark and target fragment range；

First video acquiring module 1308, for obtaining corresponding target video according to target video mark；

First extraction module 1310, it is corresponding for being extracted from corresponding target video according to the target fragment range Target video segment；

First splicing module 1312, for being spliced each target video segment extracted to obtain synthetic video.

In one embodiment, the target information is for determining corresponding target speaker, and the searching module 1306 is also For from the video analysis data search include the target speaker multiple target videos mark, obtain the target Pronunciation corresponding target fragment range in each target video；First splicing module 1312 be also used to extract with The corresponding multiple target video segments of the target speaker are spliced to obtain the synthetic video for repeating the target speaker.

In one embodiment, the target information is for determining corresponding target picture, and the searching module 1306 is also For from the video analysis data search include the target picture multiple target videos mark, obtain the target Picture corresponding target fragment range in each target video；First splicing module 1312 be also used to extract with The corresponding multiple target video segments of the target picture are spliced to obtain the synthetic video for repeating the target picture.

As shown in figure 14, in one embodiment, above-mentioned Video Composition device further include:

Candidate video obtains module 1314, for obtaining candidate video；

Identification module 1316, wherein identification module 1316 includes the first identification module 1316A, the second identification module At least one module in 1316B, third identification module 1316C；

First identification module 1316A, it is corresponding for being obtained to the progress speech recognition of the candidate video corresponding audio Multiple candidate's text informations, using each candidate text information as the candidate information；And/or

Second identification module 1316B is corresponded to for carrying out image recognition to the corresponding picture frame of the candidate video Multiple candidate targets, using each candidate target as the candidate information；And/or

Third identification module 1316C is obtained for carrying out speech utterance Object identifying to the corresponding audio of the candidate video To multiple candidate speech sounding objects, using each candidate speech sounding object as the candidate information；

Segment ranges obtain module 1318, for obtaining each candidate information candidate corresponding in the candidate video Segment ranges, the candidate segment range include segment origin identification and segments end mark；

Memory module 1320, for by the video identifier of the candidate information, candidate segment range and the candidate video It is associated storage and obtains the matching relationship.

In one embodiment, the target information is target text information；First identification module is also used to institute The audio progress speech recognition stated in candidate video obtains speech text, handles to obtain to speech text progress word cutting multiple Candidate text information；The segment ranges obtain module and are also used to obtain current candidate text information, obtain the current candidate The corresponding audio session of text information, the audio session include that Voice onset time point and pronunciation terminate time point, will The Voice onset time point terminates time point as the segments end mark as the segment origin identification, using the pronunciation Know.

In one embodiment, the target information is the corresponding target phonetic of target word；First identification module It is also used to carry out speech recognition to the audio in the candidate video to obtain phonetic text, the phonetic text is carried out at word cutting Reason obtains the corresponding candidate pinyin of multiple candidate words；The segment ranges obtain module and are also used to obtain current candidate phonetic, The corresponding audio session of the current candidate phonetic is obtained, the audio session includes that Voice onset time point and pronunciation are whole The pronunciation is terminated time point as institute using the Voice onset time point as the segment origin identification by only time point State segments end mark.

In one embodiment, the target information further includes tone corresponding to the target phonetic of the target word； First identification module is also used to carry out the audio in the candidate video speech recognition to obtain phonetic text and phonetic text The tone of each phonetic in this, to the phonetic text carry out word cutting handle to obtain the corresponding candidate pinyin of multiple candidate words and The corresponding tone of candidate pinyin；The segment ranges obtain module and are also used to obtain current candidate phonetic and current candidate phonetic pair The tone answered obtains corresponding audio session, institute according to the current candidate phonetic and the corresponding tone of current candidate phonetic Stating audio session includes that Voice onset time point and pronunciation terminate time point, using the Voice onset time point as described Section origin identification, terminates time point as the segments end for the pronunciation and identifies.

In one embodiment, the target information is target object；Second identification module is also used to the time The object for including in the corresponding picture frame of video is selected to be identified to obtain corresponding multiple candidate targets；The segment ranges obtain Module is also used to obtain current candidate object, obtains the video frame that current candidate object occurs in the candidate video, will be more For a successive video frames for including the current candidate object as a video clip, the video clip includes initial video The mark of frame and the mark for terminating video frame will be described using the mark of the initial video frame as the segment origin identification The mark for terminating video frame is identified as the segments end.

In one embodiment, the target information is speech utterance object；The third identification module is also used to institute It states the audio in candidate video and carries out speech utterance Object identifying, identification obtains multiple candidate speech sounding objects；The segment Range obtains module and is also used to obtain current candidate speech utterance object, and it is corresponding to obtain the current candidate speech utterance object Audio session, the audio session includes that Voice onset time point and pronunciation terminate time point, when the pronunciation is originated Between point be used as the segment origin identification, pronunciation termination time point is identified as the segments end.

In one embodiment, the first information obtains the video that module is also used to obtain upload, to the upload Video carries out speech recognition and obtains choosing text, counts described and chooses the frequency that each word occurs in text, according to each word The frequency that language occurs in the selection text determines target word, using the target word as the target information.

In one embodiment, first splicing module be also used to obtain extract each target video segment when It is long, calculate the corresponding total duration of the multiple target video segment；When the total duration is less than preset duration, duplication is extracted The target video segment obtain duplication target video segment, by it is described duplication target video segment and extraction original object Video clip is spliced, and synthetic video is obtained.

In one embodiment, it includes multiple target words that the searching module, which is also used to work as in the target information, When, target video mark corresponding with each target word and corresponding target fragment model are searched from the video analysis data It encloses；The target video segment corresponding with each target word that first splicing module is also used to extract splice To synthetic video.

As shown in figure 15, in one embodiment it is proposed that a kind of Video Composition device, the device include:

Second data obtaining module 1502, for obtaining target information；

Second video acquiring module 1504, for obtaining target video；

Determining module 1506, for being determined and the target information pair in the target video according to the target information The target fragment range answered；

Second extraction module 1508, for according to the target fragment range extracted from the target video with it is described The corresponding target video segment of target information；

Second splicing module 1510, for being spliced each target video segment extracted to obtain synthetic video.

In one embodiment, the target information is for determining that corresponding target speaker, the determining module are also used to Obtain target speaker corresponding target fragment range in the target video；Second splicing module is also used to extract Multiple target video segments corresponding with the target speaker spliced to obtain the synthetic video for repeating the target speaker.

In one embodiment, the target information is for determining that corresponding target picture, the determining module are also used to Obtain target picture corresponding target fragment range in the target video；Second splicing module is also used to extract Multiple target video segments corresponding with the target picture spliced to obtain the synthetic video for repeating the target picture.

In one embodiment, the target information is target text information；The determining module is also used to the mesh The audio marked in video carries out speech recognition, when recognizing text information identical with the target text information, obtains institute It states the audio data start time point corresponding in the target video of text information and terminates time point, according to described Begin time point and the termination time point determining target fragment range corresponding with the target text information.

In one embodiment, the target information is target object；The determining module is also used to regard the target Video frame in frequency carries out image recognition, identification obtain include the target object video frame, by it is multiple include described The successive video frames of target object are as a video clip, according to initial video frame identification in the video clip and described It terminates video frame identification and determines target fragment range corresponding with the target object.

In one embodiment, the target information is target voice sounding object；The determining module is also used to institute It states the audio in target video and carries out speech utterance Object identifying, identification obtains audio piece corresponding to the target utterance object Section determines the target fragment range according to the corresponding start time point of the audio fragment and termination time point.

Figure 16 shows the internal structure chart of computer equipment in one embodiment.The computer equipment is either end End, is also possible to server.As shown in figure 16, which includes processor, the memory connected by system bus And network interface.Wherein, memory includes non-volatile memory medium and built-in storage.The non-volatile of the computer equipment is deposited Storage media is stored with operating system, can also be stored with computer program, when which is executed by processor, may make place It manages device and realizes image synthesizing method.Computer program can also be stored in the built-in storage, which is held by processor When row, processor may make to execute image synthesizing method.It will be understood by those skilled in the art that structure shown in Figure 16, only It is only the block diagram of part-structure relevant to application scheme, does not constitute the computer being applied thereon to application scheme The restriction of equipment, specific computer equipment may include than more or fewer components as shown in the figure, or the certain portions of combination Part, or with different component layouts.

In one embodiment, image synthesizing method provided by the present application can be implemented as a kind of shape of computer program Formula, computer program can be run in computer equipment as shown in figure 16.Composition can be stored in the memory of computer equipment Each program module of the Video Composition device, for example, the first information of Figure 13 obtains module 1302, matching relationship obtains module 1304, searching module 1306, the first video acquiring module 1308, the first extraction module 1310 and the first splicing module 1312.Respectively The computer program that a program module is constituted makes processor execute the view of each embodiment of the application described in this specification Step in frequency synthesizer.For example, computer equipment shown in Figure 16 can pass through Video Composition device as shown in fig. 13 that The first information obtain module 1302 obtain target information；Module 1304, which is obtained, by matching relationship obtains video analysis data, The video analysis data has recorded the matching relationship of candidate information, candidate segment range and candidate video mark；By searching for Module searches target video mark corresponding with the target information and target fragment range from the video analysis data；It is logical It crosses the first video acquiring module 1308 and corresponding target video is obtained according to target video mark；Pass through the first extraction module 1310 extract corresponding target video segment according to the target fragment range from corresponding target video；Pass through the first splicing Module 1312 is spliced each target video segment extracted to obtain synthetic video.

In one embodiment it is proposed that a kind of computer equipment, including memory and processor, the memory storage There is computer program, when the computer program is executed by the processor, so that the processor executes following steps: obtaining Target information；Video analysis data is obtained, the video analysis data has recorded candidate information, candidate segment range and candidate view The matching relationship that frequency marking is known；Target video mark corresponding with the target information and mesh are searched from the video analysis data Standard film segment limit；Corresponding target video is obtained according to target video mark；According to the target fragment range from correspondence Target video in extract corresponding target video segment；The each target video segment extracted is spliced and is synthesized Video.

In one embodiment, target video mark corresponding with the target information is searched from the video analysis data Know and target fragment range, comprising: the target information is for determining corresponding target speaker, from the video analysis data Lookup includes multiple target videos mark of the target speaker, and it is corresponding in each target video to obtain the target speaker Target fragment range；Or the target information is looked into from the video analysis data for determining corresponding target picture Look for include the target picture multiple target videos mark, it is corresponding in each target video to obtain the target picture Target fragment range；The each target video segment that will be extracted is spliced to obtain synthetic video, comprising: will be extracted Multiple target video segments corresponding with the target speaker spliced to obtain the synthetic video for repeating the target speaker； Or the multiple target video segments corresponding with the target picture extracted are spliced to obtain and repeat the target picture The synthetic video in face.

In one embodiment, before the acquisition target information, further includes: obtain candidate video；To the candidate The corresponding audio of video carries out speech recognition and obtains corresponding multiple candidate text informations, using each candidate text information as institute State candidate information；And/or image recognition is carried out to the corresponding picture frame of the candidate video and obtains corresponding multiple candidate targets, Using each candidate target as the candidate information；And/or speech utterance object is carried out to the corresponding audio of the candidate video Identification, obtains multiple candidate speech sounding objects, using each candidate speech sounding object as the candidate information；It obtains Each candidate information candidate segment range corresponding in the candidate video, the candidate segment range include segment starting Mark and segments end mark；The candidate information, candidate segment range and the video identifier of the candidate video are closed Connection storage obtains the matching relationship.

In one embodiment, the target information is target text information；It is described to the corresponding sound of the candidate video Frequency carries out speech recognition and obtains corresponding multiple candidate text informations, comprising: carries out voice to the audio in the candidate video Identification obtains speech text, carries out word cutting to the speech text and handles to obtain multiple candidate text informations；The acquisition is each Candidate information candidate segment range corresponding in the candidate video, the candidate segment range includes segment origin identification It is identified with segments end, comprising: current candidate text information is obtained, when obtaining the corresponding audio of the current candidate text information Between section, the audio session include Voice onset time point and pronunciation terminate time point；The Voice onset time point is made For the segment origin identification, the pronunciation is terminated into time point as the segments end and is identified.

In one embodiment, the target information is the corresponding target phonetic of target word；It is described that the candidate is regarded Frequently corresponding audio carries out speech recognition and obtains corresponding multiple candidate text informations, comprising: to the sound in the candidate video Frequency carries out speech recognition and obtains phonetic text, carries out word cutting to the phonetic text and handles to obtain the corresponding time of multiple candidate words Select phonetic；It is described to obtain each candidate information candidate segment range corresponding in the candidate video, the candidate segment Range includes segment origin identification and segments end mark, comprising: obtains current candidate phonetic, obtains the current candidate phonetic Corresponding audio session, the audio session include that Voice onset time point and pronunciation terminate time point；By the pronunciation The pronunciation is terminated time point as the segments end and identified by start time point as the segment origin identification.

In one embodiment, the target information further includes tone corresponding to the target phonetic of the target word； Speech recognition is carried out to the audio in the candidate video and obtains candidate pinyin corresponding to multiple candidate words, comprising: to described Audio in candidate video carries out speech recognition and obtains the tone of each phonetic in phonetic text and phonetic text, to the phonetic Text carries out word cutting and handles to obtain the corresponding candidate pinyin of multiple candidate words and the corresponding tone of candidate pinyin；It is described to obtain respectively A candidate information candidate segment range corresponding in the candidate video, the candidate segment range include segment starting mark Know and segments end identifies, comprising: current candidate phonetic and the corresponding tone of current candidate phonetic is obtained, according to the current time Phonetic and the corresponding tone of current candidate phonetic is selected to obtain corresponding audio session, the audio session includes pronunciation starting Time point and pronunciation terminate time point；Using the Voice onset time point as the segment origin identification, eventually by the pronunciation Only time point identifies as the segments end.

In one embodiment, the target information is target object；The corresponding picture frame of the candidate video is carried out Image recognition obtains corresponding multiple candidate targets, comprising: to the object for including in the corresponding picture frame of the candidate video into Row identification obtains corresponding multiple candidate targets；It is described to obtain each candidate information candidate corresponding in the candidate video Segment ranges, the candidate segment range include segment origin identification and segments end mark, comprising: obtain current candidate pair As, obtain the video frame that occurs in the candidate video of current candidate object, by it is multiple include the current candidate object Successive video frames as a video clip, the video clip include initial video frame mark and terminate video frame mark Know；Using the mark of the initial video frame as the segment origin identification, using the mark for terminating video frame as described in Segments end mark.

In one embodiment, the target information is speech utterance object；To the corresponding audio of the candidate video into Row speech utterance Object identifying obtains multiple candidate sounding objects, comprising: carry out voice hair to the audio in the candidate video Sound Object identifying, identification obtain multiple candidate speech sounding objects；It is described to obtain each candidate information in the candidate video Corresponding candidate segment range, the candidate segment range include segment origin identification and segments end mark, comprising: are obtained Current candidate speech utterance object, obtains the corresponding audio session of the current candidate speech utterance object, when the audio Between section include Voice onset time point and pronunciation terminate time point；The Voice onset time point is originated as the segment and is marked Know, the pronunciation is terminated into time point as the segments end and is identified.

In one embodiment, the acquisition target information, comprising: obtain the video of upload；To the video of the upload It carries out speech recognition to obtain choosing text, counts described and choose the frequency that each word occurs in text；Existed according to each word The frequency occurred in the selection text determines target word, using the target word as the target information.

In one embodiment, each target video segment that will be extracted is spliced to obtain synthetic video, packet It includes: obtaining the duration for each target video segment extracted, calculate the corresponding total duration of the multiple target video segment；When When the total duration is less than preset duration, replicates the target video segment extracted and obtain duplication target video segment, it will The duplication target video segment and the original object video clip of extraction are spliced, and synthetic video is obtained.

In one embodiment, described that target view corresponding with the target information is searched from the video analysis data Frequency marking is known and target fragment range includes: when including multiple target words, from the video analysis in the target information Target video mark corresponding with each target word and corresponding target fragment range are searched in data；It is described to extract Each target video segment is spliced to obtain synthetic video, comprising: the target corresponding with each target word that will be extracted Video clip is spliced to obtain synthetic video.

In one embodiment it is proposed that a kind of computer equipment, including memory and processor, the memory storage There is computer program, when the computer program is executed by the processor, so that the processor executes following steps: obtaining Target information；Obtain target video；According to the target information, determination is corresponding with the target information in the target video Target fragment range；It is extracted from the target video according to the target fragment range corresponding with the target information Target video segment；Spliced each target video segment extracted to obtain synthetic video.

In one embodiment, described to be determined and the target information in the target video according to the target information Corresponding target fragment range, comprising: the target information obtains target speaker described for determining corresponding target speaker Corresponding target fragment range in target video；Or the target information obtains target for determining corresponding target picture Picture corresponding target fragment range in the target video；The each target video segment that will be extracted is spliced Obtain synthetic video, comprising: spliced to obtain by the multiple target video segments corresponding with the target speaker extracted Repeat the synthetic video of the target speaker；Or the multiple target video segments corresponding with the target picture that will be extracted Spliced to obtain the synthetic video for repeating the target picture.

In one embodiment, the target information is target text information；It is described according to the target information described Corresponding with target information target fragment range is determined in target video, comprising: to the audio in the target video into Row speech recognition obtains the audio of the text information when recognizing text information identical with the target text information Data start time point corresponding in the target video and termination time point；According to the start time point and the end Only time point determines target fragment range corresponding with the target text information.

In one embodiment, the target information is target object；It is described according to the target information in the target Target fragment range corresponding with the target information is determined in video, comprising: carry out to the video frame in the target video Image recognition, identification obtain include the target object video frame, by it is multiple include the target object continuous view Frequency frame is as a video clip；According in the video clip initial video frame identification and the termination video frame identification it is true Fixed target fragment range corresponding with the target object.

In one embodiment, the target information is target voice sounding object；It is described to be existed according to the target information Target fragment range corresponding with the target information is determined in the target video, comprising: to the sound in the target video Frequency carries out speech utterance Object identifying, and identification obtains audio fragment corresponding to the target utterance object, according to the audio The corresponding start time point of segment and termination time point determine the target fragment range.

In one embodiment it is proposed that a kind of computer readable storage medium, is stored with computer program, the calculating When machine program is executed by processor, so that the processor executes following steps: obtaining target information；Obtain video analysis number According to the video analysis data has recorded the matching relationship of candidate information, candidate segment range and candidate video mark；From described Target video mark corresponding with the target information and target fragment range are searched in video analysis data；According to the target Video identifier obtains corresponding target video；Corresponding mesh is extracted from corresponding target video according to the target fragment range Mark video clip；Spliced each target video segment extracted to obtain synthetic video.

In one embodiment it is proposed that a kind of computer readable storage medium, is stored with computer program, the calculating When machine program is executed by processor, so that the processor executes following steps: obtaining target information；Obtain target video；Root Target fragment range corresponding with the target information is determined in the target video according to the target information；According to the mesh Standard film segment limit extracts target video segment corresponding with the target information from the target video；It is each by what is extracted A target video segment is spliced to obtain synthetic video.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein Each embodiment used in any reference to memory, storage, database or other media, may each comprise non-volatile And/or volatile memory.Nonvolatile memory may include that read-only memory (ROM), programming ROM (PROM), electricity can be compiled Journey ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield all should be considered as described in this specification.

The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously The limitation to the application the scope of the patents therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the concept of this application, various modifications and improvements can be made, these belong to the guarantor of the application Protect range.Therefore, the scope of protection shall be subject to the appended claims for the application patent.

Claims

1. a kind of image synthesizing method, which comprises

Obtain target information；

Video analysis data is obtained, the video analysis data has recorded candidate information, candidate segment range and candidate video mark The matching relationship of knowledge；

Target video mark corresponding with the target information and target fragment range are searched from the video analysis data；

Corresponding target video is obtained according to target video mark；

Spliced each target video segment extracted to obtain synthetic video.

2. believing the method according to claim 1, wherein being searched from the video analysis data with the target Cease corresponding target video mark and target fragment range, comprising:

For the target information for determining corresponding target speaker, searching from the video analysis data includes the target Multiple target videos mark of pronunciation, obtains the target speaker corresponding target fragment range in each target video；Or Person

For the target information for determining corresponding target picture, searching from the video analysis data includes the target Multiple target videos of picture identify, and obtain the target picture corresponding target fragment range in each target video；

The each target video segment that will be extracted is spliced to obtain synthetic video, comprising:

The multiple target video segments corresponding with the target speaker extracted are spliced to obtain and repeat the target hair The synthetic video of sound；Or

The multiple target video segments corresponding with the target picture extracted are spliced to obtain and repeat the target picture The synthetic video in face.

3. the method according to claim 1, wherein before the acquisition target information, further includes:

Obtain candidate video；

Speech recognition is carried out to the corresponding audio of the candidate video and obtains corresponding multiple candidate text informations, by each candidate Text information is as the candidate information；And/or

Image recognition is carried out to the corresponding picture frame of the candidate video and obtains corresponding multiple candidate targets, each candidate is right As the candidate information；And/or

Speech utterance Object identifying is carried out to the corresponding audio of the candidate video, obtains multiple candidate speech sounding objects, it will Each candidate speech sounding object is as the candidate information；

Each candidate information candidate segment range corresponding in the candidate video is obtained, the candidate segment range includes Segment origin identification and segments end mark；

The video identifier of the candidate information, candidate segment range and the candidate video is associated storage and obtains described With relationship.

4. according to the method described in claim 3, it is characterized in that, the target information is target text information；

It is described that corresponding multiple candidate text informations are obtained to the progress speech recognition of the candidate video corresponding audio, comprising:

Speech recognition is carried out to the audio in the candidate video and obtains speech text, word cutting processing is carried out to the speech text Obtain multiple candidate text informations；

It is described to obtain each candidate information candidate segment range corresponding in the candidate video, the candidate segment range It is identified including segment origin identification and segments end, comprising:

Current candidate text information is obtained, obtains the corresponding audio session of the current candidate text information, when the audio Between section include Voice onset time point and pronunciation terminate time point；

Using the Voice onset time point as the segment origin identification, the pronunciation is terminated into time point as the segment Terminate mark.

5. according to the method described in claim 3, it is characterized in that, the target information is that the corresponding target of target word is spelled Sound；

Speech recognition is carried out to the audio in the candidate video and obtains phonetic text, word cutting processing is carried out to the phonetic text Obtain the corresponding candidate pinyin of multiple candidate words；

Current candidate phonetic is obtained, obtains the corresponding audio session of the current candidate phonetic, the audio session includes Voice onset time point and pronunciation terminate time point；

6. according to the method described in claim 5, it is characterized in that, the target information further includes the target of the target word Tone corresponding to phonetic；

Speech recognition is carried out to the audio in the candidate video and obtains candidate pinyin corresponding to multiple candidate words, comprising:

Speech recognition is carried out to the audio in the candidate video and obtains the tone of each phonetic in phonetic text and phonetic text, Word cutting is carried out to the phonetic text to handle to obtain the corresponding candidate pinyin of multiple candidate words and the corresponding tone of candidate pinyin；

Current candidate phonetic and the corresponding tone of current candidate phonetic are obtained, is spelled according to the current candidate phonetic and current candidate The corresponding tone of sound obtains corresponding audio session, when the audio session includes Voice onset time point and pronunciation termination Between point；

7. according to the method described in claim 3, it is characterized in that, the target information is target object；

Image recognition is carried out to the corresponding picture frame of the candidate video and obtains corresponding multiple candidate targets, comprising:

The object for including in the corresponding picture frame of the candidate video is identified to obtain corresponding multiple candidate targets；

Current candidate object is obtained, the video frame that current candidate object occurs in the candidate video is obtained, includes by multiple There are the successive video frames of the current candidate object as a video clip, the video clip includes the mark of initial video frame Know and terminate the mark of video frame；

Using the mark of the initial video frame as the segment origin identification, using the mark for terminating video frame as described in Segments end mark.

8. according to the method described in claim 3, it is characterized in that, the target information is speech utterance object；

Speech utterance Object identifying is carried out to the corresponding audio of the candidate video, obtains multiple candidate sounding objects, comprising:

Speech utterance Object identifying is carried out to the audio in the candidate video, identification obtains multiple candidate speech sounding objects；

Current candidate speech utterance object is obtained, the corresponding audio session of the current candidate speech utterance object, institute are obtained Stating audio session includes that Voice onset time point and pronunciation terminate time point；

9. the method according to claim 1, wherein the acquisition target information, comprising:

Obtain the video uploaded；

Speech recognition is carried out to the video of the upload to obtain choosing text, is counted described and is chosen what each word in text occurred Frequency；

Target word is determined according to the frequency that each word occurs in the selection text, using the target word as described in Target information.

10. the method according to claim 1, wherein described carry out each target video segment extracted Splicing obtains synthetic video, comprising:

The duration for obtaining each target video segment extracted, calculates the corresponding total duration of the multiple target video segment；

When the total duration is less than preset duration, replicates the target video segment extracted and obtain duplication target video piece The original object video clip of the duplication target video segment and extraction is spliced, obtains synthetic video by section.

11. the method according to claim 1, wherein it is described from the video analysis data search with it is described The corresponding target video mark of target information and target fragment range include:

When in the target information including multiple target words, searched and each target word from the video analysis data The corresponding target video mark of language and corresponding target fragment range；

Spliced the target video segment corresponding with each target word extracted to obtain synthetic video.

12. a kind of image synthesizing method, which comprises

Obtain target information；

Obtain target video；

Target fragment range corresponding with the target information is determined in the target video according to the target information；

Target video piece corresponding with the target information is extracted from the target video according to the target fragment range Section；

Spliced each target video segment extracted to obtain synthetic video.

13. according to the method for claim 12, which is characterized in that it is described according to the target information in the target video Middle determination target fragment range corresponding with the target information, comprising:

The target information obtains target speaker corresponding target in the target video for determining corresponding target speaker Segment ranges；Or

The target information obtains target picture corresponding target in the target video for determining corresponding target picture Segment ranges；

14. according to the method for claim 12, which is characterized in that the target information is target text information；

It is described to determine target fragment range corresponding with the target information in the target video according to the target information, Include:

Speech recognition is carried out to the audio in the target video, when recognizing text envelope identical with the target text information When breath, obtains the audio data start time point corresponding in the target video of the text information and terminate the time Point；

Target fragment model corresponding with the target text information is determined according to the start time point and the termination time point It encloses.

15. according to the method for claim 12, which is characterized in that the target information is target object；

To in the target video video frame carry out image recognition, identification obtain include the target object video frame, Using multiple successive video frames for including the target object as a video clip；

According to the initial video frame identification and termination video frame identification determination and the target object in the video clip Corresponding target fragment range.

16. according to the method for claim 12, which is characterized in that the target information is target voice sounding object；

Speech utterance Object identifying is carried out to the audio in the target video, identification obtains corresponding to the target utterance object Audio fragment, according to the corresponding start time point of the audio fragment and terminate time point determine the target fragment range.

17. a kind of Video Composition device, described device include:

The first information obtains module, for obtaining target information；

Matching relationship obtains module, and for obtaining video analysis data, the video analysis data has recorded candidate information, candidate The matching relationship of segment ranges and candidate video mark；

Searching module, for searching target video mark corresponding with the target information and mesh from the video analysis data Standard film segment limit；

First extraction module, for extracting corresponding target video from corresponding target video according to the target fragment range Segment；

18. device according to claim 17, which is characterized in that described device further include:

Candidate video obtains module, for obtaining candidate video；

First identification module obtains corresponding multiple candidate texts for carrying out speech recognition to the corresponding audio of the candidate video This information, using each candidate text information as the candidate information；And/or

Second identification module obtains corresponding multiple candidates for carrying out image recognition to the corresponding picture frame of the candidate video Object, using each candidate target as the candidate information；And/or

Third identification module obtains multiple times for carrying out speech utterance Object identifying to the corresponding audio of the candidate video Speech utterance object is selected, using each candidate speech sounding object as the candidate information；

Segment ranges obtain module, for obtaining each candidate information candidate segment model corresponding in the candidate video It encloses, the candidate segment range includes segment origin identification and segments end mark；

Memory module, for the candidate information, candidate segment range and the video identifier of the candidate video to be associated Storage obtains the matching relationship.

19. device according to claim 18, which is characterized in that the target information is target text information；

First identification module is also used to carry out speech recognition to the audio in the candidate video to obtain speech text, to institute Speech text progress word cutting is stated to handle to obtain multiple candidate text informations；

The segment ranges obtain module and are also used to obtain current candidate text information, obtain the current candidate text information pair The audio session answered, the audio session include Voice onset time point and pronunciation terminate time point, by it is described pronounce Time point begin as the segment origin identification, the pronunciation is terminated into time point as the segments end and is identified.

20. device according to claim 18, which is characterized in that the target information is that the corresponding target of target word is spelled Sound；

First identification module is also used to carry out speech recognition to the audio in the candidate video to obtain phonetic text, to institute Phonetic text progress word cutting is stated to handle to obtain the corresponding candidate pinyin of multiple candidate words；

The segment ranges obtain module and are also used to obtain current candidate phonetic, obtain the corresponding audio of the current candidate phonetic Period, the audio session includes that Voice onset time point and pronunciation terminate time point, by the Voice onset time point As the segment origin identification, the pronunciation is terminated into time point as the segments end and is identified.

21. device according to claim 20, which is characterized in that the target information further includes the mesh of the target word Mark tone corresponding to phonetic；

First identification module is also used to carry out speech recognition to the audio in the candidate video to obtain phonetic text and spelling The tone of each phonetic in sound text carries out word cutting to the phonetic text and handles to obtain the corresponding candidate spelling of multiple candidate words Sound and the corresponding tone of candidate pinyin；

The segment ranges obtain module and are also used to obtain current candidate phonetic and the corresponding tone of current candidate phonetic, according to institute It states current candidate phonetic and the corresponding tone of current candidate phonetic obtains corresponding audio session, the audio session includes Voice onset time point and pronunciation terminate time point, using the Voice onset time point as the segment origin identification, by institute State pronunciation terminate time point identified as the segments end.

22. device according to claim 18, which is characterized in that the target information is target object；

Second identification module is also used to be identified to obtain to the object for including in the corresponding picture frame of the candidate video Corresponding multiple candidate targets；

The segment ranges obtain module and are also used to obtain current candidate object, obtain current candidate object in the candidate video The video frame of middle appearance, it is described using multiple successive video frames for including the current candidate object as a video clip Video clip includes the mark of initial video frame and the mark for terminating video frame, using the mark of the initial video frame as described in Segment origin identification identifies the mark for terminating video frame as the segments end.

23. device according to claim 18, which is characterized in that the target information is speech utterance object；

The third identification module is also used to carry out speech utterance Object identifying to the audio in the candidate video, and identification obtains Multiple candidate speech sounding objects；

The segment ranges obtain module and are also used to obtain current candidate speech utterance object, obtain the current candidate voice hair The corresponding audio session of sound object, the audio session includes that Voice onset time point and pronunciation terminate time point, by institute Voice onset time point is stated as the segment origin identification, the pronunciation is terminated into time point as the segments end mark Know.

24. device according to claim 17, which is characterized in that the first information obtains module and is also used to obtain upload Video, speech recognition is carried out to the video of the upload and obtains choosing text, each word in the selection text is counted and goes out Existing frequency determines target word according to the frequency that each word occurs in the selection text, the target word is made For the target information.

25. a kind of Video Composition device, described device include:

Second data obtaining module, for obtaining target information；

Second video acquiring module, for obtaining target video；

Determining module, for determining target corresponding with the target information in the target video according to the target information Segment ranges；

Second extraction module, for being extracted from the target video and the target information according to the target fragment range Corresponding target video segment；

26. device according to claim 25, which is characterized in that the target information is target text information；

The determining module is also used to carry out speech recognition to the audio in the target video, when recognizing and target text When the identical text information of this information, the starting corresponding in the target video of the audio data of the text information is obtained Time point and termination time point, according to the start time point and termination time point determination and the target text information pair The target fragment range answered.

27. device according to claim 25, which is characterized in that the target information is target object；

The determining module is also used to carry out the video frame in the target video image recognition, and identification obtains including described The video frame of target object, using multiple successive video frames for including the target object as a video clip, according to institute The initial video frame identification and the termination video frame identification stated in video clip determine target corresponding with the target object Segment ranges.

28. device according to claim 25, which is characterized in that the target information is target voice sounding object；

The determining module is also used to carry out the audio in the target video speech utterance Object identifying, and identification obtains described Audio fragment corresponding to target utterance object is determined according to the corresponding start time point of the audio fragment and termination time point The target fragment range.

29. a kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor, So that the processor is executed such as the step of any one of claims 1 to 16 the method.

30. a kind of computer equipment, including memory and processor, the memory is stored with computer program, the calculating When machine program is executed by the processor, so that the processor is executed such as any one of claims 1 to 16 the method Step.