WO2018113535A1

WO2018113535A1 - Method and apparatus for automatically generating dubbing characters, and electronic device

Info

Publication number: WO2018113535A1
Application number: PCT/CN2017/115194
Authority: WO
Inventors: 阳鹤翔
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2016-12-22
Filing date: 2017-12-08
Publication date: 2018-06-28
Also published as: CN108228658B; TW201832222A; TWI749045B; CN108228658A

Abstract

A method and apparatus for automatically generating dubbing characters, and an electronic device. The method for generating dubbing characters comprises: identifying audio information to acquire starting and ending time information about each identified audio basic semantic unit (S101); acquiring text information corresponding to the audio information, and identifying the text information, so as to acquire a text basic semantic unit (S103); recording the starting and ending time information about each of the audio basic semantic units in the corresponding text basic semantic unit (S105); and processing the text basic semantic unit into which the starting and ending time information is recorded, so as to generate dubbing characters corresponding to the audio information (S107). By means of the method, a dynamic lyrics file can be produced without using a manual method, thereby improving the production efficiency, reducing the production cost, and simplifying the production procedure.

Description

Method, device and electronic device for automatically generating voice-over text

The present application claims priority to Chinese Patent Application No. 201611196447.6, entitled "A Method, Apparatus, and Electronic Apparatus for Automatically Generating Dubbing Characters", filed on Dec. 22, 2016, the entire contents of which is incorporated herein by reference. In the application.

Technical field

The present application relates to the field of computer technology, and in particular, to a method for automatically generating a voice-over text; the present application also relates to an apparatus for automatically generating a voice-over text and an electronic device.

Background technique

With the development of audio processing technology, users have higher requirements for the audition experience, not only requiring the audio playback application to play audio files, but also the audio playback application to simultaneously display the lyrics files corresponding to the audio files. The audio playback synchronous display lyrics function allows people to see the lyrics of the audio file while listening to the beautiful melody. This function has become one of the essential functions of the audio playback application and the player.

In order to meet the needs of users, the lyrics currently used for synchronous display of audio playback are mainly performed manually. By manually listening to the audio, the lyrics are marked with time, and corresponding lyric files are generated for each audio file in the audio file database, and The generated lyrics file is imported into the audio playing application, so that when the audio file is played, the corresponding lyric file is synchronously displayed.

It can be seen that under the existing production scheme for lyrics for synchronous display of audio playback, the process of manually generating lyrics files is cumbersome, and is not only inefficient and costly. As the scale of audio music libraries continues to expand, the drawbacks of manual methods are becoming more and more serious.

Summary of the invention

The present application provides a method for automatically generating voice-over characters to solve the above problems in the prior art. The present application also relates to an apparatus for automatically generating dubbed characters and an electronic device.

The embodiment of the present application provides a method for automatically generating a voice-over text, and the method for automatically generating a voice-over text includes:

Identifying the audio information, and acquiring the start and end time information of the identified basic semantic units of each audio;

Obtaining text information corresponding to the audio information, and identifying the text information, thereby acquiring a basic semantic unit of the text;

Recording start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text;

The text basic semantic unit in which the start and end time information is recorded is processed to generate a voice-over character corresponding to the audio information.

Optionally, the processing, the text basic semantic unit that records the start and end time information is processed, and generating a voice-over text corresponding to the audio information, including:

Obtaining a basic semantic unit of the text constituting the single sentence for each single sentence in the text information;

Determining start and end time information of the single sentence according to the start and end time information recorded in the basic semantic unit of the text that has been obtained;

The single sentence in which the start and end time information is determined is integrated to form a voice-over character corresponding to the audio information and having start and end time information of each single sentence.

Optionally, when the basic semantic unit of the text constituting the single sentence is obtained for each single sentence in the text information, if at least two sets of start and end time information are recorded in the basic semantic unit of the text, the start and end time information is used. The number of groups, respectively, forms the basic semantic unit group of texts that make up the single sentence.

Optionally, after the step of forming the text basic semantic unit group of the single sentence in the group according to the start and end time information, the method includes:

According to a predetermined calculation method, all start and end time information of each text basic semantic unit in each text basic semantic unit group is filtered, and a text basic semantic unit group constituting the single sentence is determined.

Optionally, the predetermined calculation method includes:

Calculating a time interval between a start time in a basic semantic unit of each text and a termination time of a last text basic semantic unit of the text basic semantic unit in each of the text basic semantic unit groups, and acquiring each of the texts The sum of the time interval of the start time and the end time in the basic semantic unit group, and the sum of the time intervals is used as an error value of the text basic semantic unit group.

Optionally, the user selects, for each basic textual semantic unit group of the text, all start and end time information of each text basic semantic unit, and determines a text basic semantic unit group that constitutes the single sentence, including:

Each of the text basic semantic unit groups is filtered, and a text basic semantic unit group whose error value is lower than a preset threshold is retained.

Optionally, after the step of the text basic semantic unit group in which the retention error value is lower than a preset threshold, the method includes:

Calculating the retention of the text in the basic semantic unit group, the start time in each text's basic semantic unit is greater than The number of times of termination of the last text basic semantic unit of the basic semantic unit of the text, and the text basic semantic unit group having the largest number of times is obtained.

Optionally, the identifying the text information to obtain a basic semantic unit of the text includes:

From the text information, the basic semantic unit of the text in the text information is obtained by identifying each word in each sentence.

Optionally, when the start and end time information of each of the audio basic semantic units is recorded in the corresponding basic semantic unit of the text, if the start and end time information of the audio basic semantic unit is a null value, The value of the basic semantic unit of the text corresponding to the audio basic semantic unit is a null value.

Optionally, after the step of determining a basic semantic unit group of texts constituting the single sentence, the method includes:

The start and end time information is estimated for the basic semantic unit of the text whose value is a null value according to a predetermined calculation manner.

Optionally, the predetermined calculation manner includes:

Calculating average time information of basic semantic units of text in the basic semantic unit group of the text;

Putting the text basic semantic unit whose value is a null value, the end time in the basic semantic unit of the previous text, into the start time of the basic semantic unit of the text whose value is null;

After the termination time is added to the average time information, the end time of the basic semantic unit of the text whose value is null is placed.

Correspondingly, the embodiment of the present application further provides an apparatus for automatically generating a voice-over text, and the apparatus for automatically generating a voice-over text includes:

An audio recognition unit, configured to identify the audio information, and obtain the start and end time information of the identified basic audio semantic units of each audio;

a text recognition unit, configured to acquire text information corresponding to the audio information, and identify the text information, thereby acquiring a basic semantic unit of the text;

a time writing unit, configured to record start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text;

The voice-over character generating unit is configured to process the text basic semantic unit in which the start and end time information is recorded, and generate a voice-over text corresponding to the audio information.

Optionally, the voice-over text generating unit includes:

a text semantic acquisition subunit, configured to obtain, for each single sentence in the text information, a basic semantic unit of text constituting the single sentence;

a time information determining subunit for determining a start and end time recorded in a basic semantic unit of the text that has been acquired The information determines the start and end time information of the single sentence;

The voice-over character generating sub-unit is configured to integrate the single sentences that determine the start and end time information to form a voice-over text corresponding to the audio information and having start and end time information of each single sentence.

Optionally, the time text semantic acquisition subunit is specifically configured to: when each of the single sentences in the text information is obtained, obtain the basic semantic unit of the text that constitutes the single sentence, if at least two of the basic semantic units of the text are recorded When the group start and stop time information is used, the text basic semantic unit group constituting the single sentence is respectively formed according to the number of groups of the start and end time information.

Optionally, the device for automatically generating a voice-over text further includes:

a text semantic screening subunit, configured to form a basic semantic unit group of the text constituting the single sentence after the number of groups according to the start and end time information, according to a predetermined calculation method, in each of the text basic semantic unit groups All the start and end time information of the basic semantic units of each text is filtered to determine the basic semantic unit group of texts constituting the single sentence.

Optionally, the text semantic screening subunit includes:

An error calculation subunit, configured to calculate a time between a start time in a basic semantic unit of each text and a termination time of a basic semantic unit of a text of the basic semantic unit of the text in each of the text basic semantic unit groups The spacing is obtained by obtaining a sum of the time intervals of the start time and the end time in each of the text basic semantic unit groups, and the sum of the time intervals is used as an error value of the text basic semantic unit group.

Optionally, the text semantic screening subunit further includes:

The filtering subunit is configured to filter each of the text basic semantic unit groups, and retain a text basic semantic unit group whose error value is lower than a preset threshold.

Optionally, the text semantic screening subunit further includes:

a time number calculation subunit, configured to calculate, in the basic semantic unit group of the text, the basic semantic unit of each text after the text basic semantic unit group whose retention error value is lower than a preset threshold value The number of times the start time is greater than the end time of the last text basic semantic unit of the basic semantic unit of the text, and the text basic semantic unit group having the largest number of times is obtained.

Optionally, the text recognition unit is specifically configured to: obtain, from the text information, the basic semantic unit of the text in the text information according to the order of each word in each sentence.

Optionally, the time writing unit is specifically configured to: when the start and end time information of each of the audio basic semantic units is recorded into the corresponding basic semantic unit of the text, if the basic semantic unit of the audio starts and ends The time information is a null value, so that the value of the text basic semantic unit corresponding to the audio basic semantic unit is null.

And a time estimating unit, configured to calculate start and end time information on the basic semantic unit of the text whose value is a null value, after determining the text basic semantic unit group constituting the single sentence, according to a predetermined calculation manner.

Optionally, the time estimating unit includes:

An average time calculation subunit for calculating average time information of a basic semantic unit of text in the text basic semantic unit group;

The start time is written into the subunit, and is used to put the text basic semantic unit whose value is a null value, and the end time in the basic semantic unit of the previous text into the basic semantic unit of the text whose value is a null value. In the start time;

The termination time is written into the subunit, and after adding the termination time to the average time information, the termination time is entered into the basic semantic unit of the text whose value is a null value.

In addition, the embodiment of the present application further provides an electronic device, including:

monitor;

processor;

a memory for storing a voice-over character generating program, the program, when being read and executed by the processor, performing the following operations: identifying the audio information, acquiring the start and end time information of the identified basic audio semantic units of each audio; acquiring and Corresponding text information corresponding to the audio information, and identifying the text information, thereby acquiring a basic semantic unit of the text; recording start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text; The text basic semantic unit of the start and end time information is processed to generate a voice-over character corresponding to the audio information.

Compared with the prior art, the present application has the following advantages:

The method, the device and the electronic device for automatically generating the voice-over text provided by the present application obtain the start and end time information of the identified basic audio semantic units of each audio by identifying the audio information; and acquiring the text information corresponding to the audio information, And identifying the text information to obtain a text basic semantic unit; recording start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text; and recording the start and end time information The text basic semantic unit is processed to generate a voice-over text corresponding to the audio information. The technical solution acquires the start and end time information of each audio basic semantic unit in the audio information by performing voice recognition on the audio information, and determines the basic semantic unit of the text in each single sentence in the text information by identifying the text information corresponding to the audio information. The quantity and the glyph, the basic semantic unit of the audio identified in the audio information is corresponding to the basic semantic unit of the text identified in the text information, and after establishing the correspondence, according to the basic semantics of each audio in the audio information The unit start and end time information determines the time information of the corresponding single sentence in the text information, so that each of the texts The single sentence has time information, so that the dynamic lyrics file is no longer produced manually, which improves the production efficiency, reduces the production cost, and simplifies the production process.

DRAWINGS

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description are only These are some of the embodiments described in this application, and other figures can be obtained from those of ordinary skill in the art in view of these drawings.

1 shows a flow chart of a method of automatically generating dubbed characters provided in accordance with an embodiment of the present application;

FIG. 2 is a flowchart showing processing of the text basic semantic unit in which the start and end time information is recorded, and generating a voice-over text corresponding to the audio information, according to an embodiment of the present application;

3 shows a schematic diagram of an apparatus for automatically generating voice-over characters provided in accordance with an embodiment of the present application;

FIG. 4 shows a schematic diagram of an electronic device provided in accordance with an embodiment of the present application.

detailed description

Numerous specific details are set forth in the description below in order to provide a thorough understanding of the application. However, the present application can be implemented in many other ways than those described herein, and those skilled in the art can make similar promotion without departing from the scope of the present application, and thus the present application is not limited by the specific embodiments disclosed below.

The above objects, features and advantages of the present application will be more clearly understood from the following description of the appended claims. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict.

The embodiment of the present application provides a method for automatically generating a voice-over text, and an embodiment of the present application simultaneously provides an apparatus for automatically generating a voice-over text and an electronic device. Detailed description will be made one by one in the following embodiments.

At present, the lyrics used for synchronous display of audio playback are mainly performed by manual method. Manually, the lyrics are marked with time while listening to audio, and corresponding lyric files are generated for each audio file in the audio file database, and the generated lyrics file is generated. Imported into the audio playback application, so that when the audio file is played, the corresponding lyrics file is displayed synchronously. It can be seen that under the existing production scheme for lyrics for synchronous display of audio playback, the process of manually generating lyrics files is cumbersome, and is not only inefficient and costly. As the scale of audio music libraries continues to expand, the drawbacks of manual methods are becoming more and more serious. To solve this problem, the technical solution of the present application obtains the start and end time information of each audio basic semantic unit in the audio information by performing voice recognition on the audio information, by identifying the audio. The text information corresponding to the information determines the number and glyph of the basic semantic units of the text in each single sentence in the text information, so that the basic semantic unit of the audio identified in the audio information is related to the basic semantic unit of the text identified in the text information. Correspondingly, after establishing the correspondence relationship, determining time information of the corresponding single sentence in the text information according to the start and end time information of each audio basic semantic unit in the audio information, so that the lyrics in the text have time information, thereby realizing automatic creation of dynamic lyrics The function of the file.

Before describing the specific steps of the embodiment in detail, the dynamic lyrics involved in the technical solution are briefly described.

The dynamic lyrics are edited by the editor to sort the lyrics according to the time when the song lyrics appear, and then the lyrics are displayed in synchronization when the song is played. Commonly used dynamic lyrics files include: lrc, qrc, etc.

Lrc is an abbreviation for English lyric (lyrics) and is used as an extension of dynamic lyrics files. The lyrics file with the extension lrc can be displayed synchronously in various digital players. The lrc lyrics are ones that contain "*:*:*" (where "*" refers to a wildcard that is used to replace one or more real characters. In the actual lyrics file, "*" refers to the time of the lyrics ( That is, the time content), for example: "01:01:00" means 1 minute and 1 second; ":" is used to divide the time information of minutes, seconds, and milliseconds) in the form of "tag" based on plain text. A lyrics-specific format. This lyrics file can be viewed and edited by the word processing software (after writing in the above format with Notepad, the extension name can be changed to lrc to make the lyric file of "file name.LRC"). The standard format of the Lrc dynamic lyrics file is [minutes: seconds: milliseconds] lyrics.

The lrc lyrics text contains two types of tags:

The first is the identification label, whose format is "[Identifier Name: Value]", which mainly contains the following predefined labels:

[ar: singer name], [ti: song name], [al: album name], [by: editor (referring to the producer of lrc lyrics)].

The second is the time tag, the form is "[mm:ss]" or "[mm:ss.ff]", the time tag needs to be in the first part of the sentence in a line of lyrics, and a line of lyrics can contain multiple time tags (such as in the lyrics) The part of the sentence). When the song reaches a certain point in time, it will search for the corresponding time label and display the lyrics text behind the label, thus completing the function of “synchronization of lyrics”.

The lrc dynamic lyrics file requires the same file name of the song and lrc dynamic lyrics file (ie, except for the extensions .mp3, .wma, .lrc, etc., the text and text format in front of the dot must be exactly the same) and placed in the same Under the directory (that is, in the same folder), the lyrics can be displayed simultaneously when playing a song with a player with the function of displaying lyrics.

An embodiment of the present application provides a method for generating a voice-over text, and the method for generating a voice-over text is implemented as follows:

Please refer to FIG. 1 , which illustrates a flow chart of a method for automatically generating voice-over characters provided in accordance with an embodiment of the present application.

The method for automatically generating a voice-over text includes:

Step S101: Identify the audio information, and obtain the start and end time information of the identified basic audio semantic units of each audio.

In this embodiment, the identifying the audio information is mainly converting the voice signal of the audio information into identifiable text information, for example, acquiring, in the form of text information, converting the voice signal of the audio information into The basic semantic unit of audio that can be recognized. The audio basic semantic units include: Chinese characters, Chinese words, pinyin, numbers, English characters, and/or English words. Specifically, the speech recognition process may adopt a speech recognition method such as a statistical pattern recognition technology.

In a specific implementation, the audio information may be voice-recognized by a CMU-Sphinx speech recognition system. CMU-Sphinx is a large vocabulary speech recognition system modeled by continuous implicit Markov model CHMM. Supports multiple modes of operation, high precision mode flat decoder and fast search mode tree decoder.

It should be noted that the text information includes an audio basic semantic unit recognized from the audio information and start and end time information of the audio basic semantic unit in the audio information. It can be understood that the audio information may be a song file of mp3 or other music format, and the mp3 file is an audio file that directly records the real sound for a certain period of time, so in the recognition of the mp3 file, the basic semantic unit of the audio to be recognized will be recognized. When the output is performed in the form of text information, the identified start and end time information of the audio basic semantic unit when played in the audio information is recorded.

In this embodiment, the recognized basic audio semantic unit and the time information of the audio basic semantic unit are recorded in the text information outputted after the audio information is identified: <word, TIMECLASS>. Wherein, word refers to the identified basic semantic unit of audio, and TIMECLASS refers to the time annotation, which records the basic semantic unit of the audio in the form of the start time and the end time {startTime, endTime}. The time information when the time is present, that is, the offset amount at the time of starting the playback of 0 with respect to the audio information, in milliseconds.

The following describes a method for generating a voice-over text by a specific example. For example, the audio information is an mp3 file, and the mp3 file is often 10 seconds during playback, and the lyrics appear when the mp3 file is played until 1 second: I think and think again, the identified basic audio semantic unit and the time information of the audio basic semantic unit recorded in the text information obtained by identifying the audio information are:

<word: "I", {startTime: 1000, endTime: 1100}>;

<word: "Think", {startTime:1200,endTime:1300}>;

<word: "了", {startTime:1400,endTime:1500}>;

<word: "again", {startTime:1600,endTime:1700}>;

<word: "Think", {startTime:1800,endTime:1900}>.

It should be noted that if the audio information is Chinese audio information, the basic audio semantic unit of the recognized audio recorded in the text information outputted after the audio information is identified is a single Chinese character; the same reason If the audio information is audio information in English, the recognized basic audio semantic unit recorded in the text information outputted after the audio information is recognized is a single English word.

It can be understood that the start and end time information of the basic semantic unit of the audio is recorded in units of milliseconds, and the lyrics: "I think and think" is when the mp3 file is played until 1 second, then the audio basic semantic unit "I" appears when the mp3 file is played for 1 second to 1.1 seconds, so the time information of the recorded basic semantic unit "I" of the audio is {startTime: 1000, endTime: 1100}.

Step S103: Acquire text information corresponding to the audio information, and identify the text information, thereby acquiring a basic semantic unit of the text.

In this embodiment, the obtaining the text information corresponding to the audio information, and identifying the text information, thereby acquiring the basic semantic unit of the text, may be implemented by searching for text information corresponding to the audio information through the Internet. After obtaining the text information, identifying each basic semantic unit in the text information, forming a basic semantic unit of the text whose time information is null for each of the identified basic semantic units, and acquiring the basic semantics of the text. unit.

It should be noted that the basic semantic unit is single-word information in the text information, including: Chinese characters, Chinese words, pinyin, numbers, English characters, and/or English words.

The specific example is described above: the audio information is an mp3 file, and the lyric text corresponding to the mp3 file is searched through the internet, and the specific content of the lyric text is: “I think and think”, and the mp3 file is obtained. After the corresponding lyrics text, each basic semantic unit in the text information is identified, and the basic semantic unit of the text whose time information is null is formed for each of the identified basic semantic units:

<word: "I", timeList{}>;

<word: "think", timeList{}>;

<word: "了", timeList{}>;

<word: "again", timeList{}>;

<word: "Think", timeList{}>.

Step S105, recording start and end time information of each of the audio basic semantic units into the corresponding basic semantic unit of the text.

In this embodiment, the start and end time information of each of the audio basic semantic units is recorded into the corresponding basic semantic unit of the text, and may be implemented by: identifying the audio information after identifying the audio information. Each of the audio basic semantic units is matched with a text basic semantic unit formed by recognizing each basic semantic unit from the text information corresponding to the audio information, and the start and end time information of the audio basic semantic unit is put into To the basic semantic unit of text corresponding to the basic semantic unit of the audio.

For example, the identified basic audio semantic unit recorded in the text information obtained by identifying the audio information and the time information of the audio basic semantic unit are:

<word: "I", {startTime: 1000, endTime: 1100}>;

<word: "Think", {startTime:1200,endTime:1300}>;

Identifying each basic semantic unit in the text information, and forming a basic semantic unit of the text whose time information is null for each of the identified basic semantic units is:

<word: "I", timeList{}>;

<word: "think", timeList{}>;

Matching the basic semantic units of the text formed to match

The basic semantic units of the text "I" and "Imagine" formed after the audio information is identified by the basic semantic units "I" and "Think" and the basic semantic unit of the lyrics in the lyric text. The same glyphs are used to put the start and end time information of the audio basic semantic units "I" and "think" into the text basic semantic units "I" and "Think":

<word: "I", timeList{startTime: 1000, endTime: 1100}>;

<word: "Think", timeList{startTime: 1200, endTime: 1300}>.

It should be noted that, since the number of occurrences of the same audio basic semantic unit in the audio information may not be unique, for example, in a song, a certain word may appear multiple times, so each of the audios is performed in step S105. When the start and end time information of the basic semantic unit is recorded into the corresponding basic semantic unit of the text, when having the same basic audio semantic unit, the following can be implemented: the basic semantic unit of the audio to be obtained from the audio information. The start and end time information is placed in each of the basic text semantic units of the same basic semantic unit of the audio.

The specific example is described above: the identified basic audio semantic unit recorded in the text information obtained by identifying the audio information and the time information of the audio basic semantic unit are:

<word: "I", {startTime: 1000, endTime: 1100}>;

<word: "Think", {startTime:1200,endTime:1300}>;

<word: "了", {startTime:1400,endTime:1500}>;

<word: "again", {startTime:1600,endTime:1700}>;

<word: "Think", {startTime:1800,endTime:1900}>.

After obtaining the text information, each basic semantic unit in the text information is identified, and a basic semantic unit of the text whose time information is null for each of the identified basic semantic units is:

<word: "I", timeList{}>;

<word: "think", timeList{}>;

<word: "了", timeList{}>;

<word: "again", timeList{}>;

<word: "Think", timeList{}>.

The basic semantic units of the audio identified by the audio information after recognition are "I", "Think", "Yes", "Yes" and "Think" and after extracting the textual semantic units of the lyrics in the lyric text The formed basic semantic units of the text "I", "Think", "Y", "Yes" and "Think" have the same time-concentrated glyphs, and then put the start and end time information of the above basic audio semantic units into the corresponding text basic semantics. In the unit:

<word: "I", timeList{startTime: 1000, endTime: 1100}>;

<word: "think", timeList{startTime: 1200, endTime: 1300}, {startTime: 1800, endTime: 1900}>;

<word: "了", timeList{startTime:1400,endTime:1500}>;

<word: "again", timeList{startTime:1600,endTime:1700}>;

<word: "think", timeList{startTime: 1200, endTime: 1300}, {startTime: 1800, endTime: 1900}>.

It can be understood that, in the above example, since the word "think" appears twice in the audio information and the text, the start and stop time information of the "thinking" obtained from the audio information is respectively placed in The basic semantic unit of the text corresponding to the word "think" is "thinking".

Step S107, processing the text basic semantic unit in which the start and end time information is recorded, and generating a voice-over character corresponding to the audio information.

In this embodiment, the processing the basic semantic unit of the text that records the start and stop time information, and generating the voice-over text corresponding to the audio information, may be implemented in the following manner: according to the specific information in the text information The single sentence determines the basic semantic unit of the text constituting the single sentence, and determines the start and end time information of the single sentence according to the start and end time information in the basic semantic unit of the text constituting the single sentence, and organizes the start and end time information of all the single sentences. The voice-over text corresponding to the audio information and determining the start and end time information of all the single sentences.

It should be noted that when a single sentence is determined in the text information, each single sentence in the text can be distinguished by a newline between a single sentence and a single sentence.

And processing the text basic semantic unit that records the start and end time information to generate a voice-over text corresponding to the audio information, specifically including steps S107-1 to S107-3, which are further described below with reference to FIG. 2 .

Please refer to FIG. 2, which illustrates a flowchart for processing the text basic semantic unit in which the start and end time information is recorded, and generating a voice-over text corresponding to the audio information, according to an embodiment of the present application.

And processing the basic semantic unit of the text that records the start and end time information, and generating a voice-over text corresponding to the audio information, including:

Step S107-1, for each single sentence in the text information, obtain a basic semantic unit of text constituting the single sentence.

In this embodiment, the basic semantic unit of the text that constitutes the single sentence is obtained for each single sentence in the text information, and may be implemented by: distinguishing each single sentence in the text information according to a newline character, And the basic semantic unit of the text constituting the single sentence is obtained for a specific single sentence.

For example, the specific single sentence in the text message is: "I want" and "you", then the basic semantic units of the text that make up the single sentence are "I" and "Think" and "You" and "Yes", and the text The basic semantic units "I" and "Think" are:

<word: "I", timeList{startTime: 1000, endTime: 1100}>;

<word: "think", timeList{startTime:1200,endTime:1300}>;

The basic semantic units of the text "you" and "out" are:

<word: "you", timeList{startTime:1400,endTime:1500}>;

<word: "了", timeList{startTime:1600,endTime:1700}>;.

Step S107-2, determining start and end time information of the single sentence according to the start and end time information recorded in the basic semantic unit of the text that has been acquired.

In this embodiment, the determining the start and end time information of the single sentence according to the acquired start and end time information in the basic semantic unit of the text may be implemented by: forming a basic semantic unit of the text of the single sentence The earliest time information of the start time is used as the start time of the single sentence, and the latest time information of the time set end time of the text basic semantic unit constituting the single sentence is used as the end time of the single sentence, and the single sentence is The start time and the end time are used as the start and end time information of the single sentence.

For example, the time information of the single sentence "I want" determined according to the time information of the basic semantic units of the above two texts for:

timeList{startTime:1000,endTime:1300},

The time information of the single sentence "you" determined according to the time information of the basic semantic units of the above two texts is:

timeList{startTime: 1400, endTime: 1700}.

In step S107-3, the single sentence in which the start and end time information is determined is integrated to form a voice-over character corresponding to the audio information and having start and end time information of each single sentence.

For example, after determining the time information of all the single sentences "I want" and "you" in the text, output the text with the time information of the above two sentences (ie: dynamic lyrics lrc):

[00:01:00]I think

[00:01:40]You are.

It can be understood that, when the audio information is played, when the display time of each of the single sentences is reached, a corresponding single sentence in the voice-over text is displayed.

In this embodiment, since the number of occurrences of the same audio basic semantic unit in the audio information may not be unique, for example, in a song, a certain word may appear multiple times, so step S107-1 is performed for Each single sentence in the text information, when obtaining the basic semantic unit of the text constituting the single sentence, when having the same basic semantic unit, may be implemented as follows: if at least two sets of start and end time are recorded in the basic semantic unit of the text The information forms the basic semantic unit group of the text constituting the single sentence according to the number of groups of the start and end time information.

The specific example above is used to illustrate: the specific single sentence in the text is: "I think and think", then the basic semantic units of the text that make up the single sentence are "I", "Think", "Y", "And" and "Think" is:

<word: "I", timeList{startTime: 1000, endTime: 1100}>;

<word: "了", timeList{startTime:1400,endTime:1500}>;

<word: "again", timeList{startTime:1600,endTime:1700}>;

Since the two basic semantic units "thinking" of the two sentences "I think and think" each have two sets of time information, the basic semantic unit group of the texts constituting the single sentence is respectively formed according to the number of groups of the start and end time information. The following four groups: The first group is:

<word: "I", timeList{startTime: 1000, endTime: 1100}>;

<word: "think", timeList{startTime:1200,endTime:1300}>;

<word: "了", timeList{startTime:1400,endTime:1500}>;

<word: "again", timeList{startTime:1600,endTime:1700}>;

<word: "think", timeList{startTime:1200,endTime:1300}>;

The second group is:

<word: "I", timeList{startTime: 1000, endTime: 1100}>;

<word: "think", timeList{startTime:1200,endTime:1300}>;

<word: "了", timeList{startTime:1400,endTime:1500}>;

<word: "again", timeList{startTime:1600,endTime:1700}>;

<word: "think", timeList{startTime:1800,endTime:1900}>;

The third group is:

<word: "I", timeList{startTime: 1000, endTime: 1100}>;

<word: "think", timeList{startTime:1800,endTime:1900}>;

<word: "了", timeList{startTime:1400,endTime:1500}>;

<word: "again", timeList{startTime:1600,endTime:1700}>;

<word: "think", timeList{startTime:1200,endTime:1300}>;

The fourth group is:

<word: "I", timeList{startTime: 1000, endTime: 1100}>;

<word: "think", timeList{startTime:1800,endTime:1900}>;

<word: "了", timeList{startTime:1400,endTime:1500}>;

<word: "again", timeList{startTime:1600,endTime:1700}>;

<word: "Think", timeList{startTime:1800,endTime:1900}>.

Since the textual basic semantic unit of the single sentence should have only one kind of time information, it is necessary to filter out the text basic semantic unit group whose time information is unreasonable, so the composition is formed separately after executing the number of groups according to the start and end time information. After the step of the text basic semantic unit group of the single sentence, the method further includes the following steps:

In this embodiment, the predetermined calculation method is performed by calculating a starting time in a basic semantic unit of each text and a basic semantic unit of the text in each of the text basic semantic unit groups. The time interval between the end times of a textual semantic unit, obtaining the basic semantic unit group of each of the texts And a sum of the time intervals of the start time and the end time, and the sum of the time intervals is used as an error value of the text basic semantic unit group.

It should be noted that the time interval refers to a time interval between a start time in a basic semantic unit of each text and a termination time of a basic semantic unit of a text of a basic semantic unit of the text, due to the formation When the text basic semantic unit group of the single sentence is composed, the start time of the basic semantic unit of the text may be smaller than the termination time of the basic semantic unit of the previous text, in order to prevent the negative time interval occurring in calculating the error value from affecting the error value The calculation needs to obtain a positive value of the time interval.

The method for obtaining the positive value of the time interval includes: taking an absolute value, taking a square, etc., and the following is a description of obtaining a positive value of the time interval by taking a square. It can be understood that since the time interval between the start time in the basic semantic unit of each text and the end time of the basic semantic unit of the previous text is to be obtained, the positive value of the time interval is obtained by the calculation of the difference square.

Specifically, the mathematical algorithm of the predetermined calculation method is:

Error value = (startTime2-endTime1) ² + (startTime ³ -endTime2) ² ... + (startTime n-endTime n-1) ²

The calculation of the above four sets of time sets will be described in detail below. (For the convenience of calculation, the calculation is performed in seconds in the calculation)

The first group: (1.2-1.1) ² + (1.4-1.3) ² + (1.6-1.5) ² + (1.2-1.7) ² =0.28

The second group: (1.2-1.1) ² + (1.4-1.3) ² + (1.6-1.5) ² + (1.8-1.7) ² = 0.04

The third group: (1.8-1.1) ² + (1.4-1.9) ² + (1.6-1.5) ² + (1.2-1.7) ² =1

Group 4: (1.8-1.1) ² + (1.4-1.9) ² + (1.6-1.5) ² + (1.8-1.7) ² = 0.76

In this embodiment, the preset threshold may be a reasonable value configured by a person skilled in the art according to experience, or the preset threshold is the smallest error value, after the error value is calculated. Filtering each of the text basic semantic unit groups, and retaining a text basic semantic unit group whose error value is lower than a preset threshold.

When the preset threshold is the smallest value of the error value, the basic semantic unit group of each of the texts is filtered, and the text basic semantic unit group whose error value is lower than a preset threshold is retained, which may be implemented as follows The text basic semantic unit group constituting the single sentence is kept with the smallest error value, and the other text basic semantic unit groups constituting the single sentence are filtered out.

It should be noted that, when filtering the basic semantic unit group of texts constituting the single sentence, a basic semantic unit group of texts constituting the single sentence having the same error value may appear, and then filtering after the error value is used. Unable to get a single textual semantic unit group with only one type of time information, in order to solve the above problem, An embodiment of the present application provides a preferred implementation manner. In a preferred manner, performing the filtering on each of the text basic semantic unit groups, and retaining a text basic semantic unit group whose error value is lower than a preset threshold. After the step, it is also required to calculate the number of times in the basic semantic unit group of the text that the start time of each text basic semantic unit is greater than the last text basic semantic unit of the text basic semantic unit, and obtain The most basic textual semantic unit group of this number.

The following is a specific example.

The basic semantic unit group of texts that make up the single sentence also includes the fifth group:

<word: "I", timeList{startTime: 1000, endTime: 1100}>;

<word: "think", timeList{startTime:1200,endTime:1300}>;

<word: "了", timeList{startTime:1400,endTime:1500}>;

<word: "again", timeList{startTime:1600,endTime:1700}>;

<word: "think", timeList{startTime:1600,endTime:1700}>;

Then the error value of the fifth group is:

(1.2-1.1) ² +(1.4-1.3) ² +(1.6-1.5) ² +(1.6-1.7) ² =0.04

After filtering the error value, the basic semantic unit group of the text that constitutes the single sentence with the smallest remaining error value is the second group and the fifth group, and the basic semantics of the second sentence and the fifth group according to the single sentence The chronological order of the units is judged by rationality, that is, the number of times in the basic semantic unit of each text of the single sentence is greater than the number of times the base text of the last text in the single sentence is terminated.

For example, the start time of the second set of "think" words is greater than the end time of the basic semantic unit "I" of the text "want"; the start time of the "y" character is greater than the basic semantic unit of a text on the "了" The end time of "thinking"; the starting time of the word "again" is greater than the ending time of the basic semantic unit "a" of the text "又又"; the starting time of the word "thinking" is greater than the text of the "thinking" word. For the termination time of the semantic unit "again", the reasonable number of the second group is 4 times; the same reason, the reasonable number of the fifth group is 3 times, then the text of the single sentence is obtained as a reasonable number of times. A time set of semantic units.

As a preferred embodiment, in the method for automatically generating the voice-over text provided by the embodiment of the present application, when the text information corresponding to the audio information is acquired in step S103, and the text information is obtained, the basic semantic unit of the text is obtained. In the text information, the basic semantic unit of the text in the text information is obtained by identifying each word in each sentence.

As a preferred embodiment, in the method for automatically generating a dubbed character provided by the embodiment of the present application, since the recognition rate exists in the speech recognition, that is, the audio information may not be accurately identified, so in the step When the audio information is identified in S101, there may be an unrecognized audio basic semantic unit, and in step S103, text information corresponding to the audio information is acquired, and the basic information unit of the text information is obtained. When the information in the text information is a character string recognizable by the computer, each basic semantic unit in the text information can be identified and formed into a text basic semantic unit, so each of the audio basics is executed in step S105. When the start and end time information of the semantic unit is recorded into the corresponding basic semantic unit of the text, if the start and end time information of the audio basic semantic unit is a null value, the basic semantics of the text corresponding to the audio basic semantic unit is made The value of the unit is null.

It can be understood that if the audio information is in the process of identification, there is an unrecognized audio basic semantic unit, that is, the audio basic semantic unit is empty, and the value of the start and end time information in the basic semantic unit of the audio is also If the value is a null value, when the start and end time information of each of the audio basic semantic units is recorded in the corresponding basic semantic unit of the text in step S105, the number of basic semantic units of the text formed is greater than the basic audio of the voice recognition. The number of semantic units is such that the value of the start and end time information in the basic semantic unit of the text on the unmatched value is a null value.

For example, the basic information unit of the audio identified by the identification of the audio information and the time information of the basic semantic unit of the audio are:

<word: "I", {startTime: 1000, endTime: 1100}>;

<word: "Think", {startTime:1200,endTime:1300}>;

<word: "again", {startTime:1600,endTime:1700}>;

The basic semantic unit of the text in which the basic semantic unit of each text of the lyrics in the lyric text forms a null value is:

<word: "I", timeList{}>;

<word: "think", timeList{}>;

<word: "了", timeList{}>;

<word: "again", timeList{}>;

Since the audio information is identified, only "I", "think" and "again" are recognized, and the basic semantic unit of the text formed by recognizing the basic semantic unit of the lyrics in the lyric text is: "I "," "think", "to", "again", the time information of the above basic audio semantic units is put into the corresponding text basic semantic unit:

<word: "I", timeList{startTime: 1000, endTime: 1100}>;

<word: "think", timeList{startTime:1200,endTime:1300}>;

<word: "了", timeList{}>;

<word: "again", timeList{startTime:1600,endTime:1700}>.

As a preferred embodiment, in the method for automatically generating the voice-over text provided by the embodiment of the present application, when the step S107-1 is performed, for each single sentence in the text information, the basic semantic unit of the text constituting the single sentence is obtained, if When the value is a textual basic semantic unit of a null value, after the step of determining the basic semantic unit group of the text constituting the single sentence, in order to make the basic semantic unit of each text have start and end time information, according to a predetermined calculation manner, The start and end time information is estimated for the basic semantic unit of the text whose value is a null value.

The predetermined calculation method includes:

Putting the termination time in the last basic semantic unit of the basic semantic unit of the text whose value is a null value into the start time of the basic semantic unit of the text whose value is null;

In this embodiment, the calculating the average time information of the basic semantic units of the text in the basic semantic unit group of the text may be implemented by: reducing the termination time in the basic semantic unit of each text constituting the single sentence. Going to the start time, obtaining the playing time of the basic semantic unit of each text in the audio information, and calculating the single sentence according to the sum of the playing time of the basic semantic unit of the text in the single sentence divided by the number of basic semantic units of the text in the single sentence The average time information of the basic semantic units of the text.

It can be understood that since the basic semantic unit of the text is formed in the order of each basic semantic unit in the single sentence of the text information, the time of the basic semantic unit of the previous text can be passed through the basic semantic unit of the text of the null value. The termination time in the information is estimated by time, and the termination time in the basic semantic unit of the text in the basic semantic unit of the text whose value is null is placed in the start time of the basic semantic unit of the text whose value is null. That is, the end time of the basic semantics of the text adjacent to the basic semantics of the text whose value is null is taken as the start time of the basic semantics of the text whose value is null.

After determining the start time of the basic semantics of the text whose value is null, determining the end time of the basic semantic unit of the text whose value is null according to the average playing time of the basic semantic unit of each text in the single sentence, That is, the start time of the basic semantic unit of the text whose value is a null value is added to the average time information, and then the end time of the basic semantics of the text whose value is null is put.

It should be noted that, when the step S103 is performed to acquire the text information corresponding to the audio information, and the text information is recognized, the basic semantic unit of the text is obtained, from the text information, according to each word in each sentence. Shun If the basic semantic unit of the text in the text information is obtained by the sequence, the start and end time information of the basic semantic unit of the text whose value is null may be implemented in another manner: directly adopting the text of the blank value The termination time in the time information of the basic semantic unit of the previous text of the semantic unit and the start time in the time information of the basic semantic unit of the next text in the basic semantic unit of the text of the null value, respectively, as the value is null The start time and the end time in the time information of the basic semantic unit of the text.

It can be understood that since the basic semantic unit of the text is formed in the order of the basic semantic units of each text in the text sentence, the basic semantic unit of the basic semantic unit of the text whose value is null is the text before and after the text. Between the basic semantic units, so it is possible to time the text basic semantic unit of the null value by the end time in the time information of the basic semantic unit of the previous text and the start time in the time information of the basic semantic unit of the next text. Estimate.

In the above embodiment, a method for automatically generating a voice-over character is provided. Corresponding to the above method for automatically generating a voice-over character, the present application also provides an apparatus for automatically generating a voice-over character. Since the embodiment of the device is substantially similar to the embodiment of the method, the description is relatively simple, and the relevant portions can be referred to the description of the method embodiment. The device embodiments described below are merely illustrative. The device for automatically generating the voice-over text is implemented as follows:

Please refer to FIG. 3, which shows a schematic diagram of an apparatus for automatically generating voiceover characters according to an embodiment of the present application.

The device for automatically generating a voice-over character includes: an audio recognition unit 301, a text recognition unit 303, a time writing unit 305, and a voice-over character generating unit 307;

The audio identification unit 301 is configured to identify the audio information, and acquire the start and end time information of the identified basic audio semantic units of each audio;

The text identification unit 303 is configured to acquire text information corresponding to the audio information, and identify the text information, thereby acquiring a basic semantic unit of the text;

The time writing unit 305 is configured to record start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text;

The voice-over character generating unit 307 is configured to process the text basic semantic unit in which the start and end time information is recorded, and generate a voice-over character corresponding to the audio information.

Optionally, the time recording unit includes: a text semantic acquisition subunit, a time information determination subunit, and a dubbing text generation subunit;

The text semantic acquisition subunit is configured to acquire, for each single sentence in the text information, a composition of the single sentence Basic semantic unit of text;

The time information determining subunit is configured to determine start and end time information of the single sentence according to the start and end time information recorded in the basic semantic unit of the text that has been acquired;

Optionally, the device for automatically generating a voice-over text further includes: a text semantic screening sub-unit;

The text semantic screening subunit is configured to form a basic semantic unit of each of the texts according to a predetermined calculation method after forming the text basic semantic unit group constituting the single sentence respectively according to the number of groups according to the start and end time information In the group, all the start and end time information of the basic semantic units of each text are filtered to determine the basic semantic unit group of texts constituting the single sentence.

Optionally, the time set grouping subunit includes: an error calculation subunit;

The error calculation subunit is configured to calculate, between each of the basic semantic unit groups of the text, a start time in a basic semantic unit of each text and a termination time of a basic semantic unit of a text of the text basic semantic unit The time interval is obtained by obtaining a sum of the time intervals of the start time and the end time in each of the text basic semantic unit groups, and the sum of the time intervals is used as an error value of the text basic semantic unit group.

Optionally, the time set group filtering subunit further includes: a filtering subunit;

Optionally, the time set group filtering subunit further includes: a time number calculation subunit;

The time count calculation subunit is configured to calculate, in the basic semantic unit of each text, after the text basic semantic unit group whose retention error value is lower than a preset threshold value The start time is greater than the number of times of the last text basic semantic unit of the basic semantic unit of the text, and the text basic semantic unit group having the largest number of times is obtained.

Optionally, the text identification unit 303 is specifically configured to: obtain, from the text information, the basic semantic unit of the text in the text information according to the order of each word in each sentence.

Optionally, the time writing unit 305 is specifically configured to start and stop the basic semantic units of each of the audios. Time information, when recorded in the corresponding basic semantic unit of the text, if the start and end time information of the audio basic semantic unit is a null value, the basic semantic unit of the text corresponding to the audio basic semantic unit is taken The value is null.

a time estimating unit, configured to calculate a start and end time information of the basic semantic unit of the text whose value is a null value, after determining the text basic semantic unit group constituting the single sentence, according to a predetermined calculation manner

Optionally, the time estimating unit includes:

In the above embodiments, a method for automatically generating a voice-over character and a device for automatically generating a voice-over text are provided. In addition, the present application further provides an electronic device; the electronic device implementation is as follows:

Please refer to FIG. 4, which shows a schematic diagram of an electronic device provided in accordance with an embodiment of the present application.

The electronic device includes: a display 401; a processor 403; a memory 405;

The memory 405 is configured to store a voice-over character generating program. When the program is read and executed by the processor, the program performs the following operations: identifying the audio information, and acquiring the start and end time information of the identified basic audio semantic units of each audio. Obtaining text information corresponding to the audio information, and identifying the text information, thereby acquiring a text basic semantic unit; recording start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text And processing the text basic semantic unit in which the start and end time information is recorded, and generating a voice-over text corresponding to the audio information.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory. Memory is an example of a computer readable medium.

1. Computer readable media including both permanent and non-persistent, removable and non-removable media may be implemented by any method or technology. Information can be computer readable instructions, data structures, modules of programs, or other numbers according to. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media, such as modulated data signals and carrier waves.

2. Those skilled in the art will appreciate that embodiments of the present application can be provided as a method, system, or computer program product. Thus, the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware. Moreover, the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

The present application is disclosed in the above preferred embodiments, but it is not intended to limit the present application, and any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of the present application. The scope of protection should be based on the scope defined by the claims of the present application.

Claims

A method for automatically generating a voice-over text, comprising:

Identifying the audio information, and acquiring the start and end time information of the identified basic semantic units of each audio;

Obtaining text information corresponding to the audio information, and identifying the text information, thereby acquiring a basic semantic unit of the text;

Recording start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text;

The text basic semantic unit in which the start and end time information is recorded is processed to generate a voice-over character corresponding to the audio information.
The method for automatically generating a voice-over character according to claim 1, wherein the processing the basic text semantic unit of the start and stop time information to generate the voice-over text corresponding to the audio information comprises:

Obtaining a basic semantic unit of the text constituting the single sentence for each single sentence in the text information;

Determining start and end time information of the single sentence according to the start and end time information recorded in the basic semantic unit of the text that has been obtained;

The single sentence in which the start and end time information is determined is integrated to form a voice-over character corresponding to the audio information and having start and end time information of each single sentence.
The method for automatically generating a voice-over character according to claim 2, wherein, for each single sentence in the text information, when the basic semantic unit of the text constituting the single sentence is obtained, if the text is in a basic semantic unit When at least two sets of start and end time information are recorded, the text basic semantic unit group constituting the single sentence is respectively formed according to the number of groups of the start and end time information.
The method for automatically generating a voice-over character according to claim 3, wherein after the step of forming the text basic semantic unit group constituting the single sentence in the group number according to the start and end time information, the method comprises:

According to a predetermined calculation method, all start and end time information of each text basic semantic unit in each text basic semantic unit group is filtered, and a text basic semantic unit group constituting the single sentence is determined.
The method of automatically generating a dubbed character according to claim 4, wherein the predetermined calculation method comprises:

Calculating a time interval between a start time in a basic semantic unit of each text and a termination time of a last text basic semantic unit of the text basic semantic unit in each of the text basic semantic unit groups, and acquiring each of the texts The sum of the start time of the basic semantic unit group and the time interval of the end time, which will be between the times The sum of the distances is the error value of the basic semantic unit group of the text.
The method for automatically generating dubbed characters according to claim 5, wherein the screening of all start and end time information of each text basic semantic unit in each of the text basic semantic unit groups is determined to constitute the single sentence The basic semantic unit group of text, including:

Each of the text basic semantic unit groups is filtered, and a text basic semantic unit group whose error value is lower than a preset threshold is retained.
The method of automatically generating a dubbed character according to claim 6, wherein after the step of retaining the textual semantic unit group whose retention error value is lower than a preset threshold, the method comprises:

Calculating the number of times in the basic semantic unit group of the text, the starting time in each text basic semantic unit is greater than the ending time of the last text basic semantic unit of the text basic semantic unit, and obtaining the text with the largest number of times Semantic unit group.
The method for automatically generating a dubbed character according to any one of claims 4-7, wherein the recognizing the text information to obtain a text basic semantic unit comprises:

From the text information, the basic semantic unit of the text in the text information is obtained by identifying each word in each sentence.
The method for automatically generating dubbed characters according to claim 8, wherein when the start and end time information of each of the audio basic semantic units is recorded in the corresponding basic semantic unit of the text, if the audio is basic The start and end time information of the semantic unit is a null value, so that the value of the text basic semantic unit corresponding to the audio basic semantic unit is null.
The method of automatically generating a dubbed character according to claim 9, wherein after the step of determining a textual semantic unit group constituting the single sentence, the method comprises:

The start and end time information is estimated for the basic semantic unit of the text whose value is a null value according to a predetermined calculation manner.
The method for automatically generating a voice-over character according to claim 10, wherein the predetermined calculation method comprises:

Calculating average time information of basic semantic units of text in the basic semantic unit group of the text;

Putting the text basic semantic unit whose value is a null value, the end time in the basic semantic unit of the previous text, into the start time of the basic semantic unit of the text whose value is null;

After the termination time is added to the average time information, the end time of the basic semantic unit of the text whose value is null is placed.
An apparatus for automatically generating a voice-over character, comprising:

An audio recognition unit, configured to identify the audio information, and obtain the start and end time information of the identified basic audio semantic units of each audio;

a text recognition unit, configured to acquire text information corresponding to the audio information, and identify the text information, thereby acquiring a basic semantic unit of the text;

a time writing unit, configured to record start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text;

The voice-over character generating unit is configured to process the text basic semantic unit in which the start and end time information is recorded, and generate a voice-over text corresponding to the audio information.
The apparatus for automatically generating a voice-sending character according to claim 12, wherein the voice-over character generating unit comprises:

a text semantic acquisition subunit, configured to obtain, for each single sentence in the text information, a basic semantic unit of text constituting the single sentence;

a time information determining subunit, configured to determine start and end time information of the single sentence according to the start and end time information recorded in the basic semantic unit of the text that has been acquired;

The voice-over character generating sub-unit is configured to integrate the single sentences that determine the start and end time information to form a voice-over text corresponding to the audio information and having start and end time information of each single sentence.
The apparatus for automatically generating a voice-over character according to claim 13, wherein the text semantic acquisition sub-unit is specifically configured to acquire a basic semantic unit of a text constituting the single sentence for each single sentence in the text information. If at least two sets of start and end time information are recorded in the basic semantic unit of the text, the text basic semantic unit group constituting the single sentence is respectively formed according to the number of groups of the start and end time information.
The apparatus for automatically generating a voice-over character according to claim 14, further comprising:

a text semantic screening subunit, configured to form a basic semantic unit group of the text constituting the single sentence after the number of groups according to the start and end time information, according to a predetermined calculation method, in each of the text basic semantic unit groups All the start and end time information of the basic semantic units of each text is filtered to determine the basic semantic unit group of texts constituting the single sentence.
The apparatus for automatically generating a voice-over character according to claim 15, wherein the text semantic screening sub-unit comprises:

An error calculation subunit, configured to calculate a time between a start time in a basic semantic unit of each text and a termination time of a basic semantic unit of a text of the basic semantic unit of the text in each of the text basic semantic unit groups a spacing, obtaining a time of the start time and the end time in each of the text basic semantic unit groups The sum of the intervals, the sum of the time intervals is taken as the error value of the textual semantic unit group of the text.
The apparatus for automatically generating a voice-over character according to claim 15, wherein the text semantic screening sub-unit further comprises:

The filtering subunit is configured to filter each of the text basic semantic unit groups, and retain a text basic semantic unit group whose error value is lower than a preset threshold.
The apparatus for automatically generating a voice-over character according to claim 17, wherein the text semantic screening sub-unit further comprises:

a time number calculation subunit, configured to calculate, in the basic semantic unit group of the text, the basic semantic unit of each text after the text basic semantic unit group whose retention error value is lower than a preset threshold value The number of times the start time is greater than the end time of the last text basic semantic unit of the basic semantic unit of the text, and the text basic semantic unit group having the largest number of times is obtained.
The apparatus for automatically generating a voice-over character according to any one of claims 15 to 18, wherein the text recognition unit is specifically configured to: from the text information, follow the order of each word in each sentence Identifying and obtaining the basic semantic unit of the text in the text information.
The apparatus for automatically generating a dubbed character according to claim 19, wherein the time writing unit is specifically configured to record start and end time information of each of the audio basic semantic units to a corresponding text basic In the semantic unit, if the start and end time information of the audio basic semantic unit is a null value, the value of the text basic semantic unit corresponding to the audio basic semantic unit is null.
The apparatus for automatically generating a voice-over character according to claim 20, further comprising:

And a time estimating unit, configured to calculate start and end time information on the basic semantic unit of the text whose value is a null value, after determining the text basic semantic unit group constituting the single sentence, according to a predetermined calculation manner.
The apparatus for automatically generating a dubbed character according to claim 21, wherein the time estimating unit comprises:

An average time calculation subunit for calculating average time information of a basic semantic unit of text in the text basic semantic unit group;

The start time is written into the subunit, and is used to put the text basic semantic unit whose value is a null value, and the end time in the basic semantic unit of the previous text into the basic semantic unit of the text whose value is a null value. In the start time;

The termination time is written into the subunit, and after adding the termination time to the average time information, the termination time is entered into the basic semantic unit of the text whose value is a null value.
An electronic device, comprising:

monitor;

processor;

a memory for storing a voice-over character generating program, the program, when being read and executed by the processor, performing the following operations: identifying the audio information, acquiring the start and end time information of the identified basic audio semantic units of each audio; acquiring and Corresponding text information corresponding to the audio information, and identifying the text information, thereby acquiring a basic semantic unit of the text; recording start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text; The text basic semantic unit of the start and end time information is processed to generate a voice-over character corresponding to the audio information.