CN110769167A

CN110769167A - Method for video dubbing based on text-to-speech technology

Info

Publication number: CN110769167A
Application number: CN201911042390.8A
Authority: CN
Inventors: 陈阳; 鲁永春; 王周
Original assignee: Hefei Mingyang Information Technology Co Ltd
Current assignee: Hefei Mingyang Information Technology Co Ltd
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2020-02-07

Abstract

The invention discloses a method for dubbing video based on a text-to-speech technology, belonging to the technical field of text-to-speech and comprising the following steps: s1: selecting an original video file; s2: inserting dubbing texts; s3: step S2, the text is transmitted to the text-to-speech server section by section, the text-to-speech server generates the dubbing file, the dubbing file is transmitted back to the original text position, and an audio interval is formed; s4: inserting a blank audio file in the audio interval in the step 3; s5: synthesizing audio, namely synthesizing the dubbing file and the blank audio file in the step 3 into a synthesized audio file; s6: decomposing the video file into an original audio file and a new video file; s7: mixing sound, namely mixing the original audio file and the synthesized audio file to obtain a total audio file; s8: mixing the total audio file with the new video file to obtain a synthesized video file; the scheme makes the production of the video dubbing simple and easy to use, and can produce professional video dubbing without professional knowledge.

Description

Method for video dubbing based on text-to-speech technology

Technical Field

The invention relates to the technical field of text-to-speech, in particular to a method for dubbing video based on a text-to-speech technology.

Background

Text-to-speech conversion (TTS), also commonly referred to as continuous text-to-speech synthesis, allows an electronic device to receive an input text string and provide a converted representation of the text string in the form of synthesized speech.

In the field of digital multimedia processing, video dubbing belongs to post-production, and generally uses special software, and in a special recording studio, a dubber operates the software to complete dubbing. Firstly, stripping and removing original audio of a video, secondly, determining the interval duration of video frames of a section to be dubbed and the dubbing start time point, then, carrying out voice explanation and synchronous recording by a dubber, carrying out processing on the next section to be dubbed after the explanation is finished, and repeating the steps until all video dubbing is finished.

With the popularity of short videos such as fast tremble and fast hands, a large number of enthusiasts for video production emerge, the traditional video production method is too complex, professional dubbing personnel or professional equipment are needed for dubbing, and in order to meet the dubbing requirement of mass video production and reduce the threshold and cost of video production, the invention provides a video dubbing method, which solves the problems that manual dubbing is troublesome and expensive, and simultaneously solves the problems that manual dubbing has high requirement on equipment, is easy to generate noise, is inconvenient to operate and needs professional dubbing personnel. Meanwhile, the invention makes the production of the video dubbing simple and easy to use, and can produce professional video dubbing without professional knowledge.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a method for dubbing video based on a text-to-speech technology, which solves the problems of complex dubbing process and high requirement on dubbing equipment in the prior art.

The purpose of the invention can be realized by the following technical scheme:

a method for carrying out video dubbing based on a text-to-speech technology comprises the following steps:

s1: selecting an original video file, and storing and importing the video file from a mobile phone;

s2: inserting dubbing texts, and inserting dubbing character texts at different positions of the video;

s3: step S2, the text is transmitted to the text-to-speech server section by section, the text-to-speech server generates the dubbing file, the dubbing file is transmitted back to the original text position, and an audio interval is formed;

s4: inserting a blank audio file in the audio interval in the step 3;

s5: synthesizing audio, namely synthesizing the dubbing file and the blank audio file in the step 3 into a synthesized audio file;

s6: decomposing the video file into an original audio file and a new video file;

s7: mixing sound, namely mixing the original audio file and the synthesized audio file to obtain a total audio file;

s8: and mixing the total audio file with the new video file to obtain a synthesized video file.

As a preferred aspect of the present invention, in step S1, the method for selecting an original video file further includes shooting a video with a mobile phone camera.

As a preferred embodiment of the present invention, in step S2, the dubbing text is inserted by sequentially inserting the start time point and the end time point of the text with the time of the video as the coordinate, and then sequentially labeling the corresponding text in the time array.

In a preferred embodiment of the present invention, in step S2, an identifier of the time interval duration is inserted into the text of the dubbing.

As a preferred embodiment of the present invention, in step S3, the speed and the pitch of the text-to-speech are set.

As a preferable aspect of the present invention, in step S7, before mixing, the sound volumes of the original audio file and the synthesized audio file are set.

As a preferable embodiment of the present invention, step S8 further includes converting the text in step S2 into a subtitle file, and incorporating the subtitle file into the composite video.

As a preferred embodiment of the present invention, the text is converted into a subtitle file to set the size, color, and background color of the text.

The invention has the beneficial effects that:

the technical scheme converts the text into voice by means of a text-to-voice technology, and mixes the voice with the original video sound to generate a new video file. The method comprises the steps of converting characters into voice, setting the speed of voice, generating a blank audio file, selecting a position for starting playing, separating video voice, synthesizing original voice of the video and voice converted from the text into voice, adjusting the volume and splicing the audio and the video. The method and the device enable the production of the video dubbing to be simple and easy to use, and can produce professional video dubbing without professional knowledge.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a schematic diagram of text-to-speech according to the present invention;

FIG. 3 is a schematic diagram of the method of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1 and fig. 2, a method for dubbing a video based on a text-to-speech technology includes the following steps:

s4: inserting a blank audio file in the audio interval in the step 3;

In step S1, the method for selecting an original video file further includes shooting a video with a mobile phone camera.

In step S2, the dubbing text is inserted by sequentially inserting the start time point and the end time point of the text with the time of the video as the coordinate, and then sequentially labeling the corresponding text in the time array.

In step S2, an identifier of the customized time interval duration is inserted into the dubbed text to implement the speech pause function, and the text-to-speech server recognizes the identifier of the customized time interval duration and inserts a blank audio correspondingly.

In step S3, the speed and pitch of the text-to-speech are set, and the text-to-speech server adjusts the speed and pitch of the speech in the generated dubbing file according to the setting, which better meets the needs of the use scenario.

In step S7, before mixing, the volume of the original audio file and the synthesized audio file is set, the original audio file is the background music in the original video file, the synthesized audio file is the human voice, and the volume of the original audio file and the synthesized audio file is adjusted to make the human voice and the background music more suitable for the requirement of the use scene.

In step S8, the method further includes converting the text in step S2 into a subtitle file, setting the size, color, and background color of the text, and incorporating the subtitle file into the composite video to implement the subtitle function in the video file, and displaying different sizes and colors of the subtitle as needed.

As shown in fig. 3, after the text-to-speech server converts the text into the dubbing file, the audio and video of the original video are separated to obtain the original audio file and the video file with the same total time, then the volume of the original audio file and the volume of the dubbing file (synthesized audio file) are set according to the user's requirement, then the original audio file and the dubbing file (synthesized audio file) with the same two periods of time are mixed to obtain a synthesized total audio file, and then the synthesized total audio file and the video file are combined to generate the video file (synthesized video file).

The technical scheme converts the text into voice by means of a text-to-voice technology, and mixes the voice with the original video sound to generate a new video file. The method comprises the steps of converting characters into voice, setting the speed of voice, generating a blank audio file, selecting a position for starting playing, separating video voice, synthesizing original voice of the video and voice converted from the text into voice, adjusting the volume and splicing the audio and the video. The technical scheme solves the problems that manual dubbing is troublesome, labor-consuming and expensive, and simultaneously solves the problems that the manual dubbing has high requirements on equipment, is easy to generate noise, is inconvenient to operate and needs professional dubbing personnel and the like. Meanwhile, the scheme enables the making of the video dubbing to be simple and easy to use, and professional video dubbing can be made without professional knowledge.

In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed.

Claims

1. A method for video dubbing based on text-to-speech technology is characterized in that:

the method comprises the following steps:

s4: inserting a blank audio file in the audio interval in the step 3;

2. Method for video dubbing based on TTS technology according to claim 1, characterized in that: in step S1, the method for selecting an original video file further includes shooting a video with a mobile phone camera.

3. Method for video dubbing based on TTS technology according to claim 1, characterized in that: in step S2, the dubbing text is inserted by sequentially inserting the start time point and the end time point of the text with the time of the video as the coordinate, and then sequentially labeling the corresponding text in the time array.

4. Method for video dubbing based on TTS technology according to claim 1, characterized in that: in step S2, an identifier of a time interval duration defined by a user is inserted into the text of the dubbing.

5. Method for video dubbing based on TTS technology according to claim 1, characterized in that: in step S3, the speed and pitch of the text-to-speech are set.

6. Method for video dubbing based on TTS technology according to claim 1, characterized in that: in step S7, before mixing, the sound volumes of the original audio file and the synthesized audio file are set.

7. Method for video dubbing based on TTS technology according to claim 1, characterized in that: in step S8, the method further includes converting the text in step S2 into a subtitle file, and incorporating the subtitle file into the composite video.

8. Method for video dubbing based on TTS technology according to claim 7, characterized in that: the text is converted into a subtitle file to set the size, color and background color of the text.