[go: up one dir, main page]

WO2018113535A1 - Method and apparatus for automatically generating dubbing characters, and electronic device - Google Patents

Method and apparatus for automatically generating dubbing characters, and electronic device Download PDF

Info

Publication number
WO2018113535A1
WO2018113535A1 PCT/CN2017/115194 CN2017115194W WO2018113535A1 WO 2018113535 A1 WO2018113535 A1 WO 2018113535A1 CN 2017115194 W CN2017115194 W CN 2017115194W WO 2018113535 A1 WO2018113535 A1 WO 2018113535A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
semantic unit
basic semantic
information
basic
Prior art date
Application number
PCT/CN2017/115194
Other languages
French (fr)
Chinese (zh)
Inventor
阳鹤翔
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2018113535A1 publication Critical patent/WO2018113535A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel

Definitions

  • the present application relates to the field of computer technology, and in particular, to a method for automatically generating a voice-over text; the present application also relates to an apparatus for automatically generating a voice-over text and an electronic device.
  • the audio playback synchronous display lyrics function allows people to see the lyrics of the audio file while listening to the beautiful melody. This function has become one of the essential functions of the audio playback application and the player.
  • the lyrics currently used for synchronous display of audio playback are mainly performed manually.
  • the lyrics are marked with time, and corresponding lyric files are generated for each audio file in the audio file database, and The generated lyrics file is imported into the audio playing application, so that when the audio file is played, the corresponding lyric file is synchronously displayed.
  • the present application provides a method for automatically generating voice-over characters to solve the above problems in the prior art.
  • the present application also relates to an apparatus for automatically generating dubbed characters and an electronic device.
  • the embodiment of the present application provides a method for automatically generating a voice-over text, and the method for automatically generating a voice-over text includes:
  • the text basic semantic unit in which the start and end time information is recorded is processed to generate a voice-over character corresponding to the audio information.
  • the processing, the text basic semantic unit that records the start and end time information is processed, and generating a voice-over text corresponding to the audio information, including:
  • the single sentence in which the start and end time information is determined is integrated to form a voice-over character corresponding to the audio information and having start and end time information of each single sentence.
  • the start and end time information is used.
  • the number of groups forms the basic semantic unit group of texts that make up the single sentence.
  • the method includes:
  • all start and end time information of each text basic semantic unit in each text basic semantic unit group is filtered, and a text basic semantic unit group constituting the single sentence is determined.
  • the predetermined calculation method includes:
  • the user selects, for each basic textual semantic unit group of the text, all start and end time information of each text basic semantic unit, and determines a text basic semantic unit group that constitutes the single sentence, including:
  • Each of the text basic semantic unit groups is filtered, and a text basic semantic unit group whose error value is lower than a preset threshold is retained.
  • the method includes:
  • the start time in each text's basic semantic unit is greater than The number of times of termination of the last text basic semantic unit of the basic semantic unit of the text, and the text basic semantic unit group having the largest number of times is obtained.
  • the identifying the text information to obtain a basic semantic unit of the text includes:
  • the basic semantic unit of the text in the text information is obtained by identifying each word in each sentence.
  • the start and end time information of each of the audio basic semantic units is recorded in the corresponding basic semantic unit of the text
  • the start and end time information of the audio basic semantic unit is a null value
  • the value of the basic semantic unit of the text corresponding to the audio basic semantic unit is a null value.
  • the method includes:
  • the start and end time information is estimated for the basic semantic unit of the text whose value is a null value according to a predetermined calculation manner.
  • the predetermined calculation manner includes:
  • the termination time is added to the average time information, the end time of the basic semantic unit of the text whose value is null is placed.
  • the embodiment of the present application further provides an apparatus for automatically generating a voice-over text
  • the apparatus for automatically generating a voice-over text includes:
  • An audio recognition unit configured to identify the audio information, and obtain the start and end time information of the identified basic audio semantic units of each audio
  • a text recognition unit configured to acquire text information corresponding to the audio information, and identify the text information, thereby acquiring a basic semantic unit of the text
  • a time writing unit configured to record start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text
  • the voice-over character generating unit is configured to process the text basic semantic unit in which the start and end time information is recorded, and generate a voice-over text corresponding to the audio information.
  • the voice-over text generating unit includes:
  • a text semantic acquisition subunit configured to obtain, for each single sentence in the text information, a basic semantic unit of text constituting the single sentence
  • a time information determining subunit for determining a start and end time recorded in a basic semantic unit of the text that has been acquired The information determines the start and end time information of the single sentence
  • the voice-over character generating sub-unit is configured to integrate the single sentences that determine the start and end time information to form a voice-over text corresponding to the audio information and having start and end time information of each single sentence.
  • the time text semantic acquisition subunit is specifically configured to: when each of the single sentences in the text information is obtained, obtain the basic semantic unit of the text that constitutes the single sentence, if at least two of the basic semantic units of the text are recorded
  • the text basic semantic unit group constituting the single sentence is respectively formed according to the number of groups of the start and end time information.
  • the device for automatically generating a voice-over text further includes:
  • a text semantic screening subunit configured to form a basic semantic unit group of the text constituting the single sentence after the number of groups according to the start and end time information, according to a predetermined calculation method, in each of the text basic semantic unit groups All the start and end time information of the basic semantic units of each text is filtered to determine the basic semantic unit group of texts constituting the single sentence.
  • the text semantic screening subunit includes:
  • An error calculation subunit configured to calculate a time between a start time in a basic semantic unit of each text and a termination time of a basic semantic unit of a text of the basic semantic unit of the text in each of the text basic semantic unit groups
  • the spacing is obtained by obtaining a sum of the time intervals of the start time and the end time in each of the text basic semantic unit groups, and the sum of the time intervals is used as an error value of the text basic semantic unit group.
  • the text semantic screening subunit further includes:
  • the filtering subunit is configured to filter each of the text basic semantic unit groups, and retain a text basic semantic unit group whose error value is lower than a preset threshold.
  • the text semantic screening subunit further includes:
  • a time number calculation subunit configured to calculate, in the basic semantic unit group of the text, the basic semantic unit of each text after the text basic semantic unit group whose retention error value is lower than a preset threshold value The number of times the start time is greater than the end time of the last text basic semantic unit of the basic semantic unit of the text, and the text basic semantic unit group having the largest number of times is obtained.
  • the text recognition unit is specifically configured to: obtain, from the text information, the basic semantic unit of the text in the text information according to the order of each word in each sentence.
  • the time writing unit is specifically configured to: when the start and end time information of each of the audio basic semantic units is recorded into the corresponding basic semantic unit of the text, if the basic semantic unit of the audio starts and ends The time information is a null value, so that the value of the text basic semantic unit corresponding to the audio basic semantic unit is null.
  • the device for automatically generating a voice-over text further includes:
  • a time estimating unit configured to calculate start and end time information on the basic semantic unit of the text whose value is a null value, after determining the text basic semantic unit group constituting the single sentence, according to a predetermined calculation manner.
  • the time estimating unit includes:
  • the start time is written into the subunit, and is used to put the text basic semantic unit whose value is a null value, and the end time in the basic semantic unit of the previous text into the basic semantic unit of the text whose value is a null value.
  • start time is written into the subunit, and is used to put the text basic semantic unit whose value is a null value, and the end time in the basic semantic unit of the previous text into the basic semantic unit of the text whose value is a null value.
  • the termination time is written into the subunit, and after adding the termination time to the average time information, the termination time is entered into the basic semantic unit of the text whose value is a null value.
  • an electronic device including:
  • a memory for storing a voice-over character generating program, the program, when being read and executed by the processor, performing the following operations: identifying the audio information, acquiring the start and end time information of the identified basic audio semantic units of each audio; acquiring and Corresponding text information corresponding to the audio information, and identifying the text information, thereby acquiring a basic semantic unit of the text; recording start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text; The text basic semantic unit of the start and end time information is processed to generate a voice-over character corresponding to the audio information.
  • the method, the device and the electronic device for automatically generating the voice-over text obtained by the present application obtain the start and end time information of the identified basic audio semantic units of each audio by identifying the audio information; and acquiring the text information corresponding to the audio information, And identifying the text information to obtain a text basic semantic unit; recording start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text; and recording the start and end time information
  • the text basic semantic unit is processed to generate a voice-over text corresponding to the audio information.
  • the technical solution acquires the start and end time information of each audio basic semantic unit in the audio information by performing voice recognition on the audio information, and determines the basic semantic unit of the text in each single sentence in the text information by identifying the text information corresponding to the audio information.
  • the quantity and the glyph, the basic semantic unit of the audio identified in the audio information is corresponding to the basic semantic unit of the text identified in the text information, and after establishing the correspondence, according to the basic semantics of each audio in the audio information
  • the unit start and end time information determines the time information of the corresponding single sentence in the text information, so that each of the texts
  • the single sentence has time information, so that the dynamic lyrics file is no longer produced manually, which improves the production efficiency, reduces the production cost, and simplifies the production process.
  • FIG. 1 shows a flow chart of a method of automatically generating dubbed characters provided in accordance with an embodiment of the present application
  • FIG. 2 is a flowchart showing processing of the text basic semantic unit in which the start and end time information is recorded, and generating a voice-over text corresponding to the audio information, according to an embodiment of the present application;
  • FIG. 3 shows a schematic diagram of an apparatus for automatically generating voice-over characters provided in accordance with an embodiment of the present application
  • FIG. 4 shows a schematic diagram of an electronic device provided in accordance with an embodiment of the present application.
  • the embodiment of the present application provides a method for automatically generating a voice-over text, and an embodiment of the present application simultaneously provides an apparatus for automatically generating a voice-over text and an electronic device. Detailed description will be made one by one in the following embodiments.
  • the lyrics used for synchronous display of audio playback are mainly performed by manual method.
  • Manually the lyrics are marked with time while listening to audio, and corresponding lyric files are generated for each audio file in the audio file database, and the generated lyrics file is generated. Imported into the audio playback application, so that when the audio file is played, the corresponding lyrics file is displayed synchronously.
  • the process of manually generating lyrics files is cumbersome, and is not only inefficient and costly.
  • the drawbacks of manual methods are becoming more and more serious.
  • the technical solution of the present application obtains the start and end time information of each audio basic semantic unit in the audio information by performing voice recognition on the audio information, by identifying the audio.
  • the text information corresponding to the information determines the number and glyph of the basic semantic units of the text in each single sentence in the text information, so that the basic semantic unit of the audio identified in the audio information is related to the basic semantic unit of the text identified in the text information.
  • the dynamic lyrics are edited by the editor to sort the lyrics according to the time when the song lyrics appear, and then the lyrics are displayed in synchronization when the song is played.
  • Commonly used dynamic lyrics files include: lrc, qrc, etc.
  • Lrc is an abbreviation for English lyric (lyrics) and is used as an extension of dynamic lyrics files.
  • the lyrics file with the extension lrc can be displayed synchronously in various digital players.
  • the lrc lyrics are ones that contain "*:*:*" (where "*” refers to a wildcard that is used to replace one or more real characters.
  • "*” refers to the time of the lyrics ( That is, the time content), for example: “01:01:00” means 1 minute and 1 second; “:” is used to divide the time information of minutes, seconds, and milliseconds) in the form of "tag” based on plain text.
  • a lyrics-specific format is used to divide the time information of minutes, seconds, and milliseconds
  • This lyrics file can be viewed and edited by the word processing software (after writing in the above format with Notepad, the extension name can be changed to lrc to make the lyric file of "file name.LRC").
  • the standard format of the Lrc dynamic lyrics file is [minutes: seconds: milliseconds] lyrics.
  • the lrc lyrics text contains two types of tags:
  • the first is the identification label, whose format is "[Identifier Name: Value]", which mainly contains the following predefined labels:
  • the second is the time tag, the form is "[mm:ss]” or "[mm:ss.ff]”, the time tag needs to be in the first part of the sentence in a line of lyrics, and a line of lyrics can contain multiple time tags (such as in the lyrics) The part of the sentence).
  • a line of lyrics can contain multiple time tags (such as in the lyrics) The part of the sentence).
  • the lrc dynamic lyrics file requires the same file name of the song and lrc dynamic lyrics file (ie, except for the extensions .mp3, .wma, .lrc, etc., the text and text format in front of the dot must be exactly the same) and placed in the same Under the directory (that is, in the same folder), the lyrics can be displayed simultaneously when playing a song with a player with the function of displaying lyrics.
  • An embodiment of the present application provides a method for generating a voice-over text, and the method for generating a voice-over text is implemented as follows:
  • FIG. 1 illustrates a flow chart of a method for automatically generating voice-over characters provided in accordance with an embodiment of the present application.
  • the method for automatically generating a voice-over text includes:
  • Step S101 Identify the audio information, and obtain the start and end time information of the identified basic audio semantic units of each audio.
  • the identifying the audio information is mainly converting the voice signal of the audio information into identifiable text information, for example, acquiring, in the form of text information, converting the voice signal of the audio information into The basic semantic unit of audio that can be recognized.
  • the audio basic semantic units include: Chinese characters, Chinese words, pinyin, numbers, English characters, and/or English words.
  • the speech recognition process may adopt a speech recognition method such as a statistical pattern recognition technology.
  • the audio information may be voice-recognized by a CMU-Sphinx speech recognition system.
  • CMU-Sphinx is a large vocabulary speech recognition system modeled by continuous implicit Markov model CHMM. Supports multiple modes of operation, high precision mode flat decoder and fast search mode tree decoder.
  • the text information includes an audio basic semantic unit recognized from the audio information and start and end time information of the audio basic semantic unit in the audio information.
  • the audio information may be a song file of mp3 or other music format
  • the mp3 file is an audio file that directly records the real sound for a certain period of time, so in the recognition of the mp3 file, the basic semantic unit of the audio to be recognized will be recognized.
  • the output is performed in the form of text information, the identified start and end time information of the audio basic semantic unit when played in the audio information is recorded.
  • the recognized basic audio semantic unit and the time information of the audio basic semantic unit are recorded in the text information outputted after the audio information is identified: ⁇ word, TIMECLASS>.
  • word refers to the identified basic semantic unit of audio
  • TIMECLASS refers to the time annotation, which records the basic semantic unit of the audio in the form of the start time and the end time ⁇ startTime, endTime ⁇ .
  • the time information when the time is present that is, the offset amount at the time of starting the playback of 0 with respect to the audio information, in milliseconds.
  • the audio information is an mp3 file
  • the mp3 file is often 10 seconds during playback
  • the lyrics appear when the mp3 file is played until 1 second: I think and think again
  • the identified basic audio semantic unit and the time information of the audio basic semantic unit recorded in the text information obtained by identifying the audio information are:
  • the basic audio semantic unit of the recognized audio recorded in the text information outputted after the audio information is identified is a single Chinese character; the same reason If the audio information is audio information in English, the recognized basic audio semantic unit recorded in the text information outputted after the audio information is recognized is a single English word.
  • start and end time information of the basic semantic unit of the audio is recorded in units of milliseconds, and the lyrics: "I think and think” is when the mp3 file is played until 1 second, then the audio basic semantic unit "I” appears when the mp3 file is played for 1 second to 1.1 seconds, so the time information of the recorded basic semantic unit "I" of the audio is ⁇ startTime: 1000, endTime: 1100 ⁇ .
  • Step S103 Acquire text information corresponding to the audio information, and identify the text information, thereby acquiring a basic semantic unit of the text.
  • the obtaining the text information corresponding to the audio information, and identifying the text information, thereby acquiring the basic semantic unit of the text may be implemented by searching for text information corresponding to the audio information through the Internet. After obtaining the text information, identifying each basic semantic unit in the text information, forming a basic semantic unit of the text whose time information is null for each of the identified basic semantic units, and acquiring the basic semantics of the text. unit.
  • the basic semantic unit is single-word information in the text information, including: Chinese characters, Chinese words, pinyin, numbers, English characters, and/or English words.
  • the audio information is an mp3 file
  • the lyric text corresponding to the mp3 file is searched through the internet, and the specific content of the lyric text is: “I think and think”, and the mp3 file is obtained.
  • each basic semantic unit in the text information is identified, and the basic semantic unit of the text whose time information is null is formed for each of the identified basic semantic units:
  • Step S105 recording start and end time information of each of the audio basic semantic units into the corresponding basic semantic unit of the text.
  • the start and end time information of each of the audio basic semantic units is recorded into the corresponding basic semantic unit of the text, and may be implemented by: identifying the audio information after identifying the audio information.
  • Each of the audio basic semantic units is matched with a text basic semantic unit formed by recognizing each basic semantic unit from the text information corresponding to the audio information, and the start and end time information of the audio basic semantic unit is put into To the basic semantic unit of text corresponding to the basic semantic unit of the audio.
  • the identified basic audio semantic unit recorded in the text information obtained by identifying the audio information and the time information of the audio basic semantic unit are:
  • the basic semantic units of the text "I” and “Imagine” formed after the audio information is identified by the basic semantic units “I” and “Think” and the basic semantic unit of the lyrics in the lyric text.
  • the same glyphs are used to put the start and end time information of the audio basic semantic units "I” and “think” into the text basic semantic units "I” and “Think”:
  • step S105 since the number of occurrences of the same audio basic semantic unit in the audio information may not be unique, for example, in a song, a certain word may appear multiple times, so each of the audios is performed in step S105.
  • the start and end time information of the basic semantic unit is recorded into the corresponding basic semantic unit of the text, when having the same basic audio semantic unit, the following can be implemented: the basic semantic unit of the audio to be obtained from the audio information.
  • the start and end time information is placed in each of the basic text semantic units of the same basic semantic unit of the audio.
  • the identified basic audio semantic unit recorded in the text information obtained by identifying the audio information and the time information of the audio basic semantic unit are:
  • each basic semantic unit in the text information is identified, and a basic semantic unit of the text whose time information is null for each of the identified basic semantic units is:
  • the basic semantic units of the audio identified by the audio information after recognition are "I”, “Think”, “Yes”, “Yes” and “Think” and after extracting the textual semantic units of the lyrics in the lyric text
  • the formed basic semantic units of the text "I”, “Think”, “Y”, “Yes” and “Think” have the same time-concentrated glyphs, and then put the start and end time information of the above basic audio semantic units into the corresponding text basic semantics.
  • Step S107 processing the text basic semantic unit in which the start and end time information is recorded, and generating a voice-over character corresponding to the audio information.
  • the processing the basic semantic unit of the text that records the start and stop time information, and generating the voice-over text corresponding to the audio information may be implemented in the following manner: according to the specific information in the text information
  • the single sentence determines the basic semantic unit of the text constituting the single sentence, and determines the start and end time information of the single sentence according to the start and end time information in the basic semantic unit of the text constituting the single sentence, and organizes the start and end time information of all the single sentences.
  • the voice-over text corresponding to the audio information and determining the start and end time information of all the single sentences may be implemented in the following manner: according to the specific information in the text information
  • the single sentence determines the basic semantic unit of the text constituting the single sentence, and determines the start and end time information of the single sentence according to the start and end time information in the basic semantic unit of the text constituting the single sentence, and organizes the start and end time information of all the single sentences.
  • each single sentence in the text can be distinguished by a newline between a single sentence and a single sentence.
  • FIG. 2 illustrates a flowchart for processing the text basic semantic unit in which the start and end time information is recorded, and generating a voice-over text corresponding to the audio information, according to an embodiment of the present application.
  • Step S107-1 for each single sentence in the text information, obtain a basic semantic unit of text constituting the single sentence.
  • the basic semantic unit of the text that constitutes the single sentence is obtained for each single sentence in the text information, and may be implemented by: distinguishing each single sentence in the text information according to a newline character, And the basic semantic unit of the text constituting the single sentence is obtained for a specific single sentence.
  • the specific single sentence in the text message is: "I want” and "you"
  • the basic semantic units of the text that make up the single sentence are “I” and “Think” and “You” and “Yes”
  • the text The basic semantic units "I” and “Think” are:
  • Step S107-2 determining start and end time information of the single sentence according to the start and end time information recorded in the basic semantic unit of the text that has been acquired.
  • the determining the start and end time information of the single sentence according to the acquired start and end time information in the basic semantic unit of the text may be implemented by: forming a basic semantic unit of the text of the single sentence The earliest time information of the start time is used as the start time of the single sentence, and the latest time information of the time set end time of the text basic semantic unit constituting the single sentence is used as the end time of the single sentence, and the single sentence is The start time and the end time are used as the start and end time information of the single sentence.
  • time information of the single sentence "I want” determined according to the time information of the basic semantic units of the above two texts for:
  • the time information of the single sentence "you" determined according to the time information of the basic semantic units of the above two texts is:
  • step S107-3 the single sentence in which the start and end time information is determined is integrated to form a voice-over character corresponding to the audio information and having start and end time information of each single sentence.
  • step S107-1 is performed for Each single sentence in the text information, when obtaining the basic semantic unit of the text constituting the single sentence, when having the same basic semantic unit, may be implemented as follows: if at least two sets of start and end time are recorded in the basic semantic unit of the text The information forms the basic semantic unit group of the text constituting the single sentence according to the number of groups of the start and end time information.
  • the basic semantic unit group of the texts constituting the single sentence is respectively formed according to the number of groups of the start and end time information.
  • the second group is:
  • the third group is:
  • the fourth group is:
  • the method further includes the following steps:
  • all start and end time information of each text basic semantic unit in each text basic semantic unit group is filtered, and a text basic semantic unit group constituting the single sentence is determined.
  • the predetermined calculation method is performed by calculating a starting time in a basic semantic unit of each text and a basic semantic unit of the text in each of the text basic semantic unit groups.
  • the time interval between the end times of a textual semantic unit, obtaining the basic semantic unit group of each of the texts And a sum of the time intervals of the start time and the end time, and the sum of the time intervals is used as an error value of the text basic semantic unit group.
  • the time interval refers to a time interval between a start time in a basic semantic unit of each text and a termination time of a basic semantic unit of a text of a basic semantic unit of the text, due to the formation
  • the start time of the basic semantic unit of the text may be smaller than the termination time of the basic semantic unit of the previous text, in order to prevent the negative time interval occurring in calculating the error value from affecting the error value
  • the calculation needs to obtain a positive value of the time interval.
  • the method for obtaining the positive value of the time interval includes: taking an absolute value, taking a square, etc., and the following is a description of obtaining a positive value of the time interval by taking a square. It can be understood that since the time interval between the start time in the basic semantic unit of each text and the end time of the basic semantic unit of the previous text is to be obtained, the positive value of the time interval is obtained by the calculation of the difference square.
  • the mathematical algorithm of the predetermined calculation method is:
  • the preset threshold may be a reasonable value configured by a person skilled in the art according to experience, or the preset threshold is the smallest error value, after the error value is calculated. Filtering each of the text basic semantic unit groups, and retaining a text basic semantic unit group whose error value is lower than a preset threshold.
  • the basic semantic unit group of each of the texts is filtered, and the text basic semantic unit group whose error value is lower than a preset threshold is retained, which may be implemented as follows
  • the text basic semantic unit group constituting the single sentence is kept with the smallest error value, and the other text basic semantic unit groups constituting the single sentence are filtered out.
  • An embodiment of the present application provides a preferred implementation manner. In a preferred manner, performing the filtering on each of the text basic semantic unit groups, and retaining a text basic semantic unit group whose error value is lower than a preset threshold. After the step, it is also required to calculate the number of times in the basic semantic unit group of the text that the start time of each text basic semantic unit is greater than the last text basic semantic unit of the text basic semantic unit, and obtain The most basic textual semantic unit group of this number.
  • the basic semantic unit group of texts that make up the single sentence also includes the fifth group:
  • the basic semantic unit group of the text that constitutes the single sentence with the smallest remaining error value is the second group and the fifth group, and the basic semantics of the second sentence and the fifth group according to the single sentence
  • the chronological order of the units is judged by rationality, that is, the number of times in the basic semantic unit of each text of the single sentence is greater than the number of times the base text of the last text in the single sentence is terminated.
  • the start time of the second set of "think” words is greater than the end time of the basic semantic unit "I” of the text "want”; the start time of the "y” character is greater than the basic semantic unit of a text on the " ⁇ ”
  • the termination time of the semantic unit "again” the reasonable number of the second group is 4 times; the same reason, the reasonable number of the fifth group is 3 times, then the text of the single sentence is obtained as a reasonable number of times.
  • the basic semantic unit of the text is obtained.
  • the basic semantic unit of the text in the text information is obtained by identifying each word in each sentence.
  • the audio information since the recognition rate exists in the speech recognition, that is, the audio information may not be accurately identified, so in the step When the audio information is identified in S101, there may be an unrecognized audio basic semantic unit, and in step S103, text information corresponding to the audio information is acquired, and the basic information unit of the text information is obtained.
  • the information in the text information is a character string recognizable by the computer, each basic semantic unit in the text information can be identified and formed into a text basic semantic unit, so each of the audio basics is executed in step S105.
  • the start and end time information of the semantic unit is recorded into the corresponding basic semantic unit of the text
  • the start and end time information of the audio basic semantic unit is a null value
  • the basic semantics of the text corresponding to the audio basic semantic unit is made The value of the unit is null.
  • the audio information is in the process of identification, there is an unrecognized audio basic semantic unit, that is, the audio basic semantic unit is empty, and the value of the start and end time information in the basic semantic unit of the audio is also If the value is a null value, when the start and end time information of each of the audio basic semantic units is recorded in the corresponding basic semantic unit of the text in step S105, the number of basic semantic units of the text formed is greater than the basic audio of the voice recognition. The number of semantic units is such that the value of the start and end time information in the basic semantic unit of the text on the unmatched value is a null value.
  • the basic information unit of the audio identified by the identification of the audio information and the time information of the basic semantic unit of the audio are:
  • the basic semantic unit of the text in which the basic semantic unit of each text of the lyrics in the lyric text forms a null value is:
  • the step S107-1 when the step S107-1 is performed, for each single sentence in the text information, the basic semantic unit of the text constituting the single sentence is obtained, if When the value is a textual basic semantic unit of a null value, after the step of determining the basic semantic unit group of the text constituting the single sentence, in order to make the basic semantic unit of each text have start and end time information, according to a predetermined calculation manner, The start and end time information is estimated for the basic semantic unit of the text whose value is a null value.
  • the predetermined calculation method includes:
  • the termination time is added to the average time information, the end time of the basic semantic unit of the text whose value is null is placed.
  • the calculating the average time information of the basic semantic units of the text in the basic semantic unit group of the text may be implemented by: reducing the termination time in the basic semantic unit of each text constituting the single sentence. Going to the start time, obtaining the playing time of the basic semantic unit of each text in the audio information, and calculating the single sentence according to the sum of the playing time of the basic semantic unit of the text in the single sentence divided by the number of basic semantic units of the text in the single sentence The average time information of the basic semantic units of the text.
  • the time of the basic semantic unit of the previous text can be passed through the basic semantic unit of the text of the null value.
  • the termination time in the information is estimated by time, and the termination time in the basic semantic unit of the text in the basic semantic unit of the text whose value is null is placed in the start time of the basic semantic unit of the text whose value is null. That is, the end time of the basic semantics of the text adjacent to the basic semantics of the text whose value is null is taken as the start time of the basic semantics of the text whose value is null.
  • the step S103 when the step S103 is performed to acquire the text information corresponding to the audio information, and the text information is recognized, the basic semantic unit of the text is obtained, from the text information, according to each word in each sentence.
  • the start and end time information of the basic semantic unit of the text whose value is null may be implemented in another manner: directly adopting the text of the blank value
  • the basic semantic unit of the text is formed in the order of the basic semantic units of each text in the text sentence, the basic semantic unit of the basic semantic unit of the text whose value is null is the text before and after the text. Between the basic semantic units, so it is possible to time the text basic semantic unit of the null value by the end time in the time information of the basic semantic unit of the previous text and the start time in the time information of the basic semantic unit of the next text.
  • a method for automatically generating a voice-over character is provided.
  • the present application also provides an apparatus for automatically generating a voice-over character. Since the embodiment of the device is substantially similar to the embodiment of the method, the description is relatively simple, and the relevant portions can be referred to the description of the method embodiment.
  • the device embodiments described below are merely illustrative.
  • the device for automatically generating the voice-over text is implemented as follows:
  • FIG. 3 shows a schematic diagram of an apparatus for automatically generating voiceover characters according to an embodiment of the present application.
  • the device for automatically generating a voice-over character includes: an audio recognition unit 301, a text recognition unit 303, a time writing unit 305, and a voice-over character generating unit 307;
  • the audio identification unit 301 is configured to identify the audio information, and acquire the start and end time information of the identified basic audio semantic units of each audio;
  • the text identification unit 303 is configured to acquire text information corresponding to the audio information, and identify the text information, thereby acquiring a basic semantic unit of the text;
  • the time writing unit 305 is configured to record start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text;
  • the voice-over character generating unit 307 is configured to process the text basic semantic unit in which the start and end time information is recorded, and generate a voice-over character corresponding to the audio information.
  • the time recording unit includes: a text semantic acquisition subunit, a time information determination subunit, and a dubbing text generation subunit;
  • the text semantic acquisition subunit is configured to acquire, for each single sentence in the text information, a composition of the single sentence Basic semantic unit of text;
  • the time information determining subunit is configured to determine start and end time information of the single sentence according to the start and end time information recorded in the basic semantic unit of the text that has been acquired;
  • the voice-over character generating sub-unit is configured to integrate the single sentences that determine the start and end time information to form a voice-over text corresponding to the audio information and having start and end time information of each single sentence.
  • the time text semantic acquisition subunit is specifically configured to: when each of the single sentences in the text information is obtained, obtain the basic semantic unit of the text that constitutes the single sentence, if at least two of the basic semantic units of the text are recorded
  • the text basic semantic unit group constituting the single sentence is respectively formed according to the number of groups of the start and end time information.
  • the device for automatically generating a voice-over text further includes: a text semantic screening sub-unit;
  • the text semantic screening subunit is configured to form a basic semantic unit of each of the texts according to a predetermined calculation method after forming the text basic semantic unit group constituting the single sentence respectively according to the number of groups according to the start and end time information In the group, all the start and end time information of the basic semantic units of each text are filtered to determine the basic semantic unit group of texts constituting the single sentence.
  • the time set grouping subunit includes: an error calculation subunit;
  • the error calculation subunit is configured to calculate, between each of the basic semantic unit groups of the text, a start time in a basic semantic unit of each text and a termination time of a basic semantic unit of a text of the text basic semantic unit
  • the time interval is obtained by obtaining a sum of the time intervals of the start time and the end time in each of the text basic semantic unit groups, and the sum of the time intervals is used as an error value of the text basic semantic unit group.
  • the time set group filtering subunit further includes: a filtering subunit
  • the filtering subunit is configured to filter each of the text basic semantic unit groups, and retain a text basic semantic unit group whose error value is lower than a preset threshold.
  • the time set group filtering subunit further includes: a time number calculation subunit;
  • the time count calculation subunit is configured to calculate, in the basic semantic unit of each text, after the text basic semantic unit group whose retention error value is lower than a preset threshold value
  • the start time is greater than the number of times of the last text basic semantic unit of the basic semantic unit of the text, and the text basic semantic unit group having the largest number of times is obtained.
  • the text identification unit 303 is specifically configured to: obtain, from the text information, the basic semantic unit of the text in the text information according to the order of each word in each sentence.
  • the time writing unit 305 is specifically configured to start and stop the basic semantic units of each of the audios.
  • Time information when recorded in the corresponding basic semantic unit of the text, if the start and end time information of the audio basic semantic unit is a null value, the basic semantic unit of the text corresponding to the audio basic semantic unit is taken The value is null.
  • the device for automatically generating a voice-over text further includes:
  • a time estimating unit configured to calculate a start and end time information of the basic semantic unit of the text whose value is a null value, after determining the text basic semantic unit group constituting the single sentence, according to a predetermined calculation manner
  • the time estimating unit includes:
  • the start time is written into the subunit, and is used to put the text basic semantic unit whose value is a null value, and the end time in the basic semantic unit of the previous text into the basic semantic unit of the text whose value is a null value.
  • start time is written into the subunit, and is used to put the text basic semantic unit whose value is a null value, and the end time in the basic semantic unit of the previous text into the basic semantic unit of the text whose value is a null value.
  • the termination time is written into the subunit, and after adding the termination time to the average time information, the termination time is entered into the basic semantic unit of the text whose value is a null value.
  • a method for automatically generating a voice-over character and a device for automatically generating a voice-over text are provided.
  • the present application further provides an electronic device; the electronic device implementation is as follows:
  • FIG. 4 shows a schematic diagram of an electronic device provided in accordance with an embodiment of the present application.
  • the electronic device includes: a display 401; a processor 403; a memory 405;
  • the memory 405 is configured to store a voice-over character generating program.
  • the program When the program is read and executed by the processor, the program performs the following operations: identifying the audio information, and acquiring the start and end time information of the identified basic audio semantic units of each audio. Obtaining text information corresponding to the audio information, and identifying the text information, thereby acquiring a text basic semantic unit; recording start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text And processing the text basic semantic unit in which the start and end time information is recorded, and generating a voice-over text corresponding to the audio information.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media including both permanent and non-persistent, removable and non-removable media may be implemented by any method or technology. Information can be computer readable instructions, data structures, modules of programs, or other numbers according to. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media, such as modulated data signals and carrier waves.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read only
  • embodiments of the present application can be provided as a method, system, or computer program product.
  • the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware.
  • the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

A method and apparatus for automatically generating dubbing characters, and an electronic device. The method for generating dubbing characters comprises: identifying audio information to acquire starting and ending time information about each identified audio basic semantic unit (S101); acquiring text information corresponding to the audio information, and identifying the text information, so as to acquire a text basic semantic unit (S103); recording the starting and ending time information about each of the audio basic semantic units in the corresponding text basic semantic unit (S105); and processing the text basic semantic unit into which the starting and ending time information is recorded, so as to generate dubbing characters corresponding to the audio information (S107). By means of the method, a dynamic lyrics file can be produced without using a manual method, thereby improving the production efficiency, reducing the production cost, and simplifying the production procedure.

Description

一种自动生成配音文字的方法、装置以及电子设备Method, device and electronic device for automatically generating voice-over text
本申请要求2016年12月22日递交的申请号为201611196447.6、发明名称为“一种自动生成配音文字的方法、装置以及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. 201611196447.6, entitled "A Method, Apparatus, and Electronic Apparatus for Automatically Generating Dubbing Characters", filed on Dec. 22, 2016, the entire contents of which is incorporated herein by reference. In the application.
技术领域Technical field
本申请涉及计算机技术领域,具体涉及一种自动生成配音文字的方法;本申请同时涉及一种自动生成配音文字的装置以及一种电子设备。The present application relates to the field of computer technology, and in particular, to a method for automatically generating a voice-over text; the present application also relates to an apparatus for automatically generating a voice-over text and an electronic device.
背景技术Background technique
随着音频处理技术的发展,用户对试听体验有了更高的要求,不仅要求音频播放应用能够播放音频文件,还希望音频播放应用能够同步显示与音频文件相应的歌词文件。音频播放同步显示歌词功能使得人们在听到优美旋律的同时能够看到该音频文件的歌词,该功能已经成为了音频播放应用以及播放器的必备功能之一。With the development of audio processing technology, users have higher requirements for the audition experience, not only requiring the audio playback application to play audio files, but also the audio playback application to simultaneously display the lyrics files corresponding to the audio files. The audio playback synchronous display lyrics function allows people to see the lyrics of the audio file while listening to the beautiful melody. This function has become one of the essential functions of the audio playback application and the player.
为了满足用户的需求,目前用于音频播放同步显示的歌词主要采用人工方式来进行的,人工边听音频边给歌词标注时间,为音频文件数据库中的每个音频文件生成相应的歌词文件,并将所生成的歌词文件导入到音频播放应用中,从而在播放音频文件时,同步显示相应地歌词文件。In order to meet the needs of users, the lyrics currently used for synchronous display of audio playback are mainly performed manually. By manually listening to the audio, the lyrics are marked with time, and corresponding lyric files are generated for each audio file in the audio file database, and The generated lyrics file is imported into the audio playing application, so that when the audio file is played, the corresponding lyric file is synchronously displayed.
由此可见,在现有的用于音频播放同步显示的歌词的制作方案下,采用人工方式生成歌词文件制作过程比较繁琐,不仅效率低且成本高。随着音频曲库规模的不断扩大,人工方式所存在的弊端显得日益严重。It can be seen that under the existing production scheme for lyrics for synchronous display of audio playback, the process of manually generating lyrics files is cumbersome, and is not only inefficient and costly. As the scale of audio music libraries continues to expand, the drawbacks of manual methods are becoming more and more serious.
发明内容Summary of the invention
本申请提供一种自动生成配音文字的方法,以解决现有技术中的上述问题。本申请同时涉及一种自动生成配音文字的装置以及一种电子设备。The present application provides a method for automatically generating voice-over characters to solve the above problems in the prior art. The present application also relates to an apparatus for automatically generating dubbed characters and an electronic device.
本申请实施例提供了一种自动生成配音文字的方法,所述自动生成配音文字的方法,包括:The embodiment of the present application provides a method for automatically generating a voice-over text, and the method for automatically generating a voice-over text includes:
对音频信息进行识别,获取识别出的各个音频基本语义单位的起止时间信息;Identifying the audio information, and acquiring the start and end time information of the identified basic semantic units of each audio;
获取与所述音频信息对应的文本信息,并识别所述文本信息,从而获取文本基本语义单位; Obtaining text information corresponding to the audio information, and identifying the text information, thereby acquiring a basic semantic unit of the text;
将各个所述音频基本语义单位的起止时间信息,记录到相应的所述文本基本语义单位中;Recording start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text;
对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字。The text basic semantic unit in which the start and end time information is recorded is processed to generate a voice-over character corresponding to the audio information.
可选的,所述对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字,包括:Optionally, the processing, the text basic semantic unit that records the start and end time information is processed, and generating a voice-over text corresponding to the audio information, including:
针对所述文本信息中每一单句,获取组成所述单句的文本基本语义单位;Obtaining a basic semantic unit of the text constituting the single sentence for each single sentence in the text information;
根据已获取的所述文本基本语义单位中记录的起止时间信息,确定所述单句的起止时间信息;Determining start and end time information of the single sentence according to the start and end time information recorded in the basic semantic unit of the text that has been obtained;
将确定了起止时间信息的所述单句进行整合,形成对应所述音频信息,且具有每一单句的起止时间信息的配音文字。The single sentence in which the start and end time information is determined is integrated to form a voice-over character corresponding to the audio information and having start and end time information of each single sentence.
可选的,所述针对所述文本信息中每一单句,获取组成所述单句的文本基本语义单位时,若所述文本基本语义单位中记录了至少两组起止时间信息,则按照起止时间信息的组数,分别形成组成所述单句的文本基本语义单位组。Optionally, when the basic semantic unit of the text constituting the single sentence is obtained for each single sentence in the text information, if at least two sets of start and end time information are recorded in the basic semantic unit of the text, the start and end time information is used. The number of groups, respectively, forms the basic semantic unit group of texts that make up the single sentence.
可选的,在所述按照起止时间信息的组数,分别形成组成所述单句的文本基本语义单位组的步骤之后,包括:Optionally, after the step of forming the text basic semantic unit group of the single sentence in the group according to the start and end time information, the method includes:
根据预定的计算方法,对每一所述文本基本语义单位组中,各个文本基本语义单位的所有起止时间信息进行筛选,确定组成所述单句的文本基本语义单位组。According to a predetermined calculation method, all start and end time information of each text basic semantic unit in each text basic semantic unit group is filtered, and a text basic semantic unit group constituting the single sentence is determined.
可选的,所述预定的计算方法,包括:Optionally, the predetermined calculation method includes:
计算各个所述文本基本语义单位组内,每一文本基本语义单位中的起始时间与所述文本基本语义单位的上一个文本基本语义单位的终止时间之间的时间间距,获取各个所述文本基本语义单位组中所述起始时间与所述终止时间的时间间距的和,将所述时间间距的和作为所述文本基本语义单位组的误差值。Calculating a time interval between a start time in a basic semantic unit of each text and a termination time of a last text basic semantic unit of the text basic semantic unit in each of the text basic semantic unit groups, and acquiring each of the texts The sum of the time interval of the start time and the end time in the basic semantic unit group, and the sum of the time intervals is used as an error value of the text basic semantic unit group.
可选的,所述对每一所述文本基本语义单位组中,各个文本基本语义单位的所有起止时间信息进行筛选,确定组成所述单句的文本基本语义单位组,包括:Optionally, the user selects, for each basic textual semantic unit group of the text, all start and end time information of each text basic semantic unit, and determines a text basic semantic unit group that constitutes the single sentence, including:
对各个所述文本基本语义单位组进行过滤,保留误差值低于预设的阈值的文本基本语义单位组。Each of the text basic semantic unit groups is filtered, and a text basic semantic unit group whose error value is lower than a preset threshold is retained.
可选的,在所述保留误差值低于预设的阈值的文本基本语义单位组的步骤之后,包括:Optionally, after the step of the text basic semantic unit group in which the retention error value is lower than a preset threshold, the method includes:
计算保留的所述文本基本语义单位组内,每一文本基本语义单位中的起始时间大于 所述文本基本语义单位的上一个文本基本语义单位的终止时间的次数,获取该次数最大的文本基本语义单位组。Calculating the retention of the text in the basic semantic unit group, the start time in each text's basic semantic unit is greater than The number of times of termination of the last text basic semantic unit of the basic semantic unit of the text, and the text basic semantic unit group having the largest number of times is obtained.
可选的,所述识别所述文本信息获取文本基本语义单位,包括:Optionally, the identifying the text information to obtain a basic semantic unit of the text includes:
从所述文本信息中,按照每句内的每个字的顺序进行识别获取所述文本信息中的文本基本语义单位。From the text information, the basic semantic unit of the text in the text information is obtained by identifying each word in each sentence.
可选的,在将各个所述音频基本语义单位的起止时间信息,记录到相应的所述文本基本语义单位中时,若所述音频基本语义单位的起止时间信息为空值,则使与所述音频基本语义单位相应的所述文本基本语义单位的取值为空值。Optionally, when the start and end time information of each of the audio basic semantic units is recorded in the corresponding basic semantic unit of the text, if the start and end time information of the audio basic semantic unit is a null value, The value of the basic semantic unit of the text corresponding to the audio basic semantic unit is a null value.
可选的,在所述确定组成所述单句的文本基本语义单位组的步骤之后,包括:Optionally, after the step of determining a basic semantic unit group of texts constituting the single sentence, the method includes:
按照预定的推算方式,对取值为空值的所述文本基本语义单位推算起止时间信息。The start and end time information is estimated for the basic semantic unit of the text whose value is a null value according to a predetermined calculation manner.
可选的,所述预定的推算方式,包括:Optionally, the predetermined calculation manner includes:
计算所述文本基本语义单位组中的文本基本语义单位的平均时间信息;Calculating average time information of basic semantic units of text in the basic semantic unit group of the text;
将取值为空值的所述文本基本语义单位,上一个文本基本语义单位中的终止时间,放入取值为空值的所述文本基本语义单位的起始时间中;Putting the text basic semantic unit whose value is a null value, the end time in the basic semantic unit of the previous text, into the start time of the basic semantic unit of the text whose value is null;
将所述终止时间加上所述平均时间信息后,放入取值为空值的所述文本基本语义单位的终止时间中。After the termination time is added to the average time information, the end time of the basic semantic unit of the text whose value is null is placed.
相应的,本申请实施例还提供了一种自动生成配音文字的装置,所述自动生成配音文字的装置,包括:Correspondingly, the embodiment of the present application further provides an apparatus for automatically generating a voice-over text, and the apparatus for automatically generating a voice-over text includes:
音频识别单元,用于对音频信息进行识别,获取识别出的各个音频基本语义单位的起止时间信息;An audio recognition unit, configured to identify the audio information, and obtain the start and end time information of the identified basic audio semantic units of each audio;
文本识别单元,用于获取与所述音频信息对应的文本信息,并识别所述文本信息,从而获取文本基本语义单位;a text recognition unit, configured to acquire text information corresponding to the audio information, and identify the text information, thereby acquiring a basic semantic unit of the text;
时间写入单元,用于将各个所述音频基本语义单位的起止时间信息,记录到相应的所述文本基本语义单位中;a time writing unit, configured to record start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text;
配音文字生成单元,用于对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字。The voice-over character generating unit is configured to process the text basic semantic unit in which the start and end time information is recorded, and generate a voice-over text corresponding to the audio information.
可选的,所述配音文字生成单元,包括:Optionally, the voice-over text generating unit includes:
文本语义获取子单元,用于针对所述文本信息中每一单句,获取组成所述单句的文本基本语义单位;a text semantic acquisition subunit, configured to obtain, for each single sentence in the text information, a basic semantic unit of text constituting the single sentence;
时间信息确定子单元,用于根据已获取的所述文本基本语义单位中记录的起止时间 信息确定所述单句的起止时间信息;a time information determining subunit for determining a start and end time recorded in a basic semantic unit of the text that has been acquired The information determines the start and end time information of the single sentence;
配音文字生成子单元,用于将确定了起止时间信息的所述单句进行整合,形成对应所述音频信息,且具有每一单句的起止时间信息的配音文字。The voice-over character generating sub-unit is configured to integrate the single sentences that determine the start and end time information to form a voice-over text corresponding to the audio information and having start and end time information of each single sentence.
可选的,所述时文本语义获取子单元,具体用于针对所述文本信息中每一单句,获取组成所述单句的文本基本语义单位时,若所述文本基本语义单位中记录了至少两组起止时间信息,则按照起止时间信息的组数,分别形成组成所述单句的文本基本语义单位组。Optionally, the time text semantic acquisition subunit is specifically configured to: when each of the single sentences in the text information is obtained, obtain the basic semantic unit of the text that constitutes the single sentence, if at least two of the basic semantic units of the text are recorded When the group start and stop time information is used, the text basic semantic unit group constituting the single sentence is respectively formed according to the number of groups of the start and end time information.
可选的,所述的自动生成配音文字的装置,还包括:Optionally, the device for automatically generating a voice-over text further includes:
文本语义筛选子单元,用于在所述按照起止时间信息的组数,分别形成组成所述单句的文本基本语义单位组之后,根据预定的计算方法,对每一所述文本基本语义单位组中,各个文本基本语义单位的所有起止时间信息进行筛选,确定组成所述单句的文本基本语义单位组。a text semantic screening subunit, configured to form a basic semantic unit group of the text constituting the single sentence after the number of groups according to the start and end time information, according to a predetermined calculation method, in each of the text basic semantic unit groups All the start and end time information of the basic semantic units of each text is filtered to determine the basic semantic unit group of texts constituting the single sentence.
可选的,所述文本语义筛选子单元,包括:Optionally, the text semantic screening subunit includes:
误差计算子单元,用于计算各个所述文本基本语义单位组内,每一文本基本语义单位中的起始时间与所述文本基本语义单位的上一个文本基本语义单位的终止时间之间的时间间距,获取各个所述文本基本语义单位组中所述起始时间与所述终止时间的时间间距的和,将所述时间间距的和作为所述文本基本语义单位组的误差值。An error calculation subunit, configured to calculate a time between a start time in a basic semantic unit of each text and a termination time of a basic semantic unit of a text of the basic semantic unit of the text in each of the text basic semantic unit groups The spacing is obtained by obtaining a sum of the time intervals of the start time and the end time in each of the text basic semantic unit groups, and the sum of the time intervals is used as an error value of the text basic semantic unit group.
可选的,所述文本语义筛选子单元,还包括:Optionally, the text semantic screening subunit further includes:
过滤子单元,用于对各个所述文本基本语义单位组进行过滤,保留误差值低于预设的阈值的文本基本语义单位组。The filtering subunit is configured to filter each of the text basic semantic unit groups, and retain a text basic semantic unit group whose error value is lower than a preset threshold.
可选的,所述文本语义筛选子单元,还包括:Optionally, the text semantic screening subunit further includes:
时间次数计算子单元,用于在所述保留误差值低于预设的阈值的文本基本语义单位组的之后,计算保留的所述文本基本语义单位组内,每一文本基本语义单位中的起始时间大于所述文本基本语义单位的上一个文本基本语义单位的终止时间的次数,获取该次数最大的文本基本语义单位组。a time number calculation subunit, configured to calculate, in the basic semantic unit group of the text, the basic semantic unit of each text after the text basic semantic unit group whose retention error value is lower than a preset threshold value The number of times the start time is greater than the end time of the last text basic semantic unit of the basic semantic unit of the text, and the text basic semantic unit group having the largest number of times is obtained.
可选的,所述文本识别单元,具体用于从所述文本信息中,按照每句内的每个字的顺序进行识别获取所述文本信息中的文本基本语义单位。Optionally, the text recognition unit is specifically configured to: obtain, from the text information, the basic semantic unit of the text in the text information according to the order of each word in each sentence.
可选的,所述时间写入单元,具体用于在将各个所述音频基本语义单位的起止时间信息,记录到相应的所述文本基本语义单位中时,若所述音频基本语义单位的起止时间信息为空值,则使与所述音频基本语义单位相应的所述文本基本语义单位的取值为空值。 Optionally, the time writing unit is specifically configured to: when the start and end time information of each of the audio basic semantic units is recorded into the corresponding basic semantic unit of the text, if the basic semantic unit of the audio starts and ends The time information is a null value, so that the value of the text basic semantic unit corresponding to the audio basic semantic unit is null.
可选的,所述的自动生成配音文字的装置,还包括:Optionally, the device for automatically generating a voice-over text further includes:
时间推算单元,用于在所述确定组成所述单句的文本基本语义单位组之后,按照预定的推算方式,对取值为空值的所述文本基本语义单位推算起止时间信息。And a time estimating unit, configured to calculate start and end time information on the basic semantic unit of the text whose value is a null value, after determining the text basic semantic unit group constituting the single sentence, according to a predetermined calculation manner.
可选的,所述时间推算单元,包括:Optionally, the time estimating unit includes:
平均时间计算子单元,用于计算所述文本基本语义单位组中的文本基本语义单位的平均时间信息;An average time calculation subunit for calculating average time information of a basic semantic unit of text in the text basic semantic unit group;
起始时间写入子单元,用于将取值为空值的所述文本基本语义单位,上一个文本基本语义单位中的终止时间,放入取值为空值的所述文本基本语义单位的起始时间中;The start time is written into the subunit, and is used to put the text basic semantic unit whose value is a null value, and the end time in the basic semantic unit of the previous text into the basic semantic unit of the text whose value is a null value. In the start time;
终止时间写入子单元,用于将所述终止时间加上所述平均时间信息后,放入取值为空值的所述文本基本语义单位的终止时间中。The termination time is written into the subunit, and after adding the termination time to the average time information, the termination time is entered into the basic semantic unit of the text whose value is a null value.
此外,本申请实施例还提供了一种电子设备,包括:In addition, the embodiment of the present application further provides an electronic device, including:
显示器;monitor;
处理器;processor;
存储器,用于存储配音文字生成程序,所述程序在被所述处理器读取执行时,执行如下操作:对音频信息进行识别,获取识别出的各个音频基本语义单位的起止时间信息;获取与所述音频信息对应的文本信息,并识别所述文本信息,从而获取文本基本语义单位;将各个所述音频基本语义单位的起止时间信息,记录到相应的所述文本基本语义单位中;对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字。a memory for storing a voice-over character generating program, the program, when being read and executed by the processor, performing the following operations: identifying the audio information, acquiring the start and end time information of the identified basic audio semantic units of each audio; acquiring and Corresponding text information corresponding to the audio information, and identifying the text information, thereby acquiring a basic semantic unit of the text; recording start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text; The text basic semantic unit of the start and end time information is processed to generate a voice-over character corresponding to the audio information.
与现有技术相比,本申请具有以下优点:Compared with the prior art, the present application has the following advantages:
本申请提供的一种自动生成配音文字的方法、装置以及电子设备,通过对音频信息进行识别,获取识别出的各个音频基本语义单位的起止时间信息;获取与所述音频信息对应的文本信息,并识别所述文本信息,从而获取文本基本语义单位;将各个所述音频基本语义单位的起止时间信息,记录到相应的所述文本基本语义单位中;对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字。所述技术方案通过对音频信息进行语音识别,获取音频信息中每个音频基本语义单位起止时间信息,通过识别所述音频信息对应的文本信息,确定文本信息内每个单句内文本基本语义单位的数量与字形,使所述音频信息中识别出的音频基本语义单位与所述文本信息中识别出的文本基本语义单位相对应,在确立对应关系后,根据所述音频信息中每个音频基本语义单位起止时间信息确定文本信息中对应单句的时间信息,使文本内的每 条单句带有时间信息,使动态歌词文件不再采用人工的方式进行制作,提高了制作的效率降低了制作成本,简化了制作的流程。The method, the device and the electronic device for automatically generating the voice-over text provided by the present application obtain the start and end time information of the identified basic audio semantic units of each audio by identifying the audio information; and acquiring the text information corresponding to the audio information, And identifying the text information to obtain a text basic semantic unit; recording start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text; and recording the start and end time information The text basic semantic unit is processed to generate a voice-over text corresponding to the audio information. The technical solution acquires the start and end time information of each audio basic semantic unit in the audio information by performing voice recognition on the audio information, and determines the basic semantic unit of the text in each single sentence in the text information by identifying the text information corresponding to the audio information. The quantity and the glyph, the basic semantic unit of the audio identified in the audio information is corresponding to the basic semantic unit of the text identified in the text information, and after establishing the correspondence, according to the basic semantics of each audio in the audio information The unit start and end time information determines the time information of the corresponding single sentence in the text information, so that each of the texts The single sentence has time information, so that the dynamic lyrics file is no longer produced manually, which improves the production efficiency, reduces the production cost, and simplifies the production process.
附图说明DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description are only These are some of the embodiments described in this application, and other figures can be obtained from those of ordinary skill in the art in view of these drawings.
图1示出了根据本申请的实施例提供的自动生成配音文字的方法的流程图;1 shows a flow chart of a method of automatically generating dubbed characters provided in accordance with an embodiment of the present application;
图2示出了根据本申请的实施例提供的对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字的流程图;FIG. 2 is a flowchart showing processing of the text basic semantic unit in which the start and end time information is recorded, and generating a voice-over text corresponding to the audio information, according to an embodiment of the present application;
图3示出了根据本申请的实施例提供的自动生成配音文字的装置的示意图;3 shows a schematic diagram of an apparatus for automatically generating voice-over characters provided in accordance with an embodiment of the present application;
图4示出了根据本申请的实施例提供的电子设备的示意图。FIG. 4 shows a schematic diagram of an electronic device provided in accordance with an embodiment of the present application.
具体实施方式detailed description
在下面的描述中阐述了很多具体细节以便于充分理解本申请。但是本申请能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本申请内涵的情况下做类似推广,因此本申请不受下面公开的具体实施的限制。Numerous specific details are set forth in the description below in order to provide a thorough understanding of the application. However, the present application can be implemented in many other ways than those described herein, and those skilled in the art can make similar promotion without departing from the scope of the present application, and thus the present application is not limited by the specific embodiments disclosed below.
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施方式对本申请进行进一步的详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。The above objects, features and advantages of the present application will be more clearly understood from the following description of the appended claims. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict.
本申请的实施例提供了一种自动生成配音文字的方法,本申请的实施例同时提供了一种自动生成配音文字的装置以及一种电子设备。在下面的实施例中逐一进行详细说明。The embodiment of the present application provides a method for automatically generating a voice-over text, and an embodiment of the present application simultaneously provides an apparatus for automatically generating a voice-over text and an electronic device. Detailed description will be made one by one in the following embodiments.
目前用于音频播放同步显示的歌词主要采用人工方式来进行的,人工边听音频边给歌词标注时间,为音频文件数据库中的每个音频文件生成相应的歌词文件,并将所生成的歌词文件导入到音频播放应用中,从而在播放音频文件时,同步显示相应地歌词文件。由此可见,在现有的用于音频播放同步显示的歌词的制作方案下,采用人工方式生成歌词文件制作过程比较繁琐,不仅效率低且成本高。随着音频曲库规模的不断扩大,人工方式所存在的弊端显得日益严重。针对这一问题,本申请的技术方案通过对音频信息进行语音识别,获取音频信息中每个音频基本语义单位起止时间信息,通过识别所述音频 信息对应的文本信息,确定文本信息内每个单句内文本基本语义单位的数量与字形,使所述音频信息中识别出的音频基本语义单位与所述文本信息中识别出的文本基本语义单位相对应,在确立对应关系后,根据所述音频信息中每个音频基本语义单位起止时间信息确定文本信息中对应单句的时间信息,使文本内的歌词带有时间信息,从而实现了自动制作动态歌词文件的功能。At present, the lyrics used for synchronous display of audio playback are mainly performed by manual method. Manually, the lyrics are marked with time while listening to audio, and corresponding lyric files are generated for each audio file in the audio file database, and the generated lyrics file is generated. Imported into the audio playback application, so that when the audio file is played, the corresponding lyrics file is displayed synchronously. It can be seen that under the existing production scheme for lyrics for synchronous display of audio playback, the process of manually generating lyrics files is cumbersome, and is not only inefficient and costly. As the scale of audio music libraries continues to expand, the drawbacks of manual methods are becoming more and more serious. To solve this problem, the technical solution of the present application obtains the start and end time information of each audio basic semantic unit in the audio information by performing voice recognition on the audio information, by identifying the audio. The text information corresponding to the information determines the number and glyph of the basic semantic units of the text in each single sentence in the text information, so that the basic semantic unit of the audio identified in the audio information is related to the basic semantic unit of the text identified in the text information. Correspondingly, after establishing the correspondence relationship, determining time information of the corresponding single sentence in the text information according to the start and end time information of each audio basic semantic unit in the audio information, so that the lyrics in the text have time information, thereby realizing automatic creation of dynamic lyrics The function of the file.
在详细描述本实施例的具体步骤之前,先对本技术方案涉及的动态歌词作简要说明。Before describing the specific steps of the embodiment in detail, the dynamic lyrics involved in the technical solution are briefly described.
动态歌词是通过编辑器把歌词按歌曲歌词出现的时间编辑到一起,然后在播放歌曲时同步依次将歌词显示出来。常用的动态歌词文件包括:lrc、qrc等。The dynamic lyrics are edited by the editor to sort the lyrics according to the time when the song lyrics appear, and then the lyrics are displayed in synchronization when the song is played. Commonly used dynamic lyrics files include: lrc, qrc, etc.
lrc是英文lyric(歌词)的缩写,被用做动态歌词文件的扩展名。以lrc为扩展名的歌词文件可以在各类数码播放器中同步显示。lrc歌词是一种包含着“*:*:*”(其中,“*”是指通配符,用来代替一个或多个真正的字符。在实际的歌词文件中“*”是指歌词的时间(即时间内容),例如:“01:01:00”是指1分1秒;“:”用来分割分、秒、毫秒的时间信息)形式的“标签(tag)”的、基于纯文本的歌词专用格式。这种歌词文件能以文字处理软件查看、编辑(用记事本按照上述格式写好后,将扩展名改为lrc即可做出“文件名.LRC”的歌词文件)。Lrc动态歌词文件的标准格式为[分钟:秒:毫秒]歌词。Lrc is an abbreviation for English lyric (lyrics) and is used as an extension of dynamic lyrics files. The lyrics file with the extension lrc can be displayed synchronously in various digital players. The lrc lyrics are ones that contain "*:*:*" (where "*" refers to a wildcard that is used to replace one or more real characters. In the actual lyrics file, "*" refers to the time of the lyrics ( That is, the time content), for example: "01:01:00" means 1 minute and 1 second; ":" is used to divide the time information of minutes, seconds, and milliseconds) in the form of "tag" based on plain text. A lyrics-specific format. This lyrics file can be viewed and edited by the word processing software (after writing in the above format with Notepad, the extension name can be changed to lrc to make the lyric file of "file name.LRC"). The standard format of the Lrc dynamic lyrics file is [minutes: seconds: milliseconds] lyrics.
lrc歌词文本中含有两类标签:The lrc lyrics text contains two types of tags:
一是标识标签,其格式为“[标识名:值]”主要包含以下预定义的标签:The first is the identification label, whose format is "[Identifier Name: Value]", which mainly contains the following predefined labels:
[ar:歌手名]、[ti:歌曲名]、[al:专辑名]、[by:编辑者(指lrc歌词的制作人)]。[ar: singer name], [ti: song name], [al: album name], [by: editor (referring to the producer of lrc lyrics)].
二是时间标签,形式为“[mm:ss]”或“[mm:ss.ff]”,时间标签需位于某行歌词中的句首部分,一行歌词可以包含多个时间标签(比如歌词中的迭句部分)。当歌曲播放到达某一时间点时,就会寻找对应的时间标签并显示标签后面的歌词文本,这样就完成了“歌词同步”的功能。The second is the time tag, the form is "[mm:ss]" or "[mm:ss.ff]", the time tag needs to be in the first part of the sentence in a line of lyrics, and a line of lyrics can contain multiple time tags (such as in the lyrics) The part of the sentence). When the song reaches a certain point in time, it will search for the corresponding time label and display the lyrics text behind the label, thus completing the function of “synchronization of lyrics”.
lrc动态歌词文件在使用时要求歌曲和lrc动态歌词文件的文件名相同(即除了扩展名.mp3、.wma、.lrc等不同之外,点前面的文字、文字格式要一模一样)并且放在同一目录下(即同一文件夹中),用带显示歌词功能的播放器播放歌曲时歌词就可以同步显示。The lrc dynamic lyrics file requires the same file name of the song and lrc dynamic lyrics file (ie, except for the extensions .mp3, .wma, .lrc, etc., the text and text format in front of the dot must be exactly the same) and placed in the same Under the directory (that is, in the same folder), the lyrics can be displayed simultaneously when playing a song with a player with the function of displaying lyrics.
本申请的实施例提供了一种生成配音文字的方法,所述生成配音文字的方法实施例如下:An embodiment of the present application provides a method for generating a voice-over text, and the method for generating a voice-over text is implemented as follows:
请参考图1,其示出了根据本申请的实施例提供的自动生成配音文字的方法的流程图。 Please refer to FIG. 1 , which illustrates a flow chart of a method for automatically generating voice-over characters provided in accordance with an embodiment of the present application.
所述自动生成配音文字的方法包括:The method for automatically generating a voice-over text includes:
步骤S101,对音频信息进行识别,获取识别出的各个音频基本语义单位的起止时间信息。Step S101: Identify the audio information, and obtain the start and end time information of the identified basic audio semantic units of each audio.
在本实施例中,所述对音频信息进行识别,主要是将所述音频信息的语音信号转换为可识别的文本信息,例如:以文本信息的形式获取将所述音频信息的语音信号转换为可以识别的音频基本语义单位。所述音频基本语义单位包括:中文文字、中文词语、拼音、数字、英文文字和/或英文词语等。具体的,语音识别过程可采用统计模式识别技术等语音识别方法。In this embodiment, the identifying the audio information is mainly converting the voice signal of the audio information into identifiable text information, for example, acquiring, in the form of text information, converting the voice signal of the audio information into The basic semantic unit of audio that can be recognized. The audio basic semantic units include: Chinese characters, Chinese words, pinyin, numbers, English characters, and/or English words. Specifically, the speech recognition process may adopt a speech recognition method such as a statistical pattern recognition technology.
在具体实施时,可以通过CMU-Sphinx语音识别系统对所述音频信息进行语音识别。CMU-Sphinx是大词汇量语音识别系统,采用连续隐含马尔可夫模型CHMM建模。支持多种模式操作,高精度模式扁平译码器以及快速搜索模式树译码器。In a specific implementation, the audio information may be voice-recognized by a CMU-Sphinx speech recognition system. CMU-Sphinx is a large vocabulary speech recognition system modeled by continuous implicit Markov model CHMM. Supports multiple modes of operation, high precision mode flat decoder and fast search mode tree decoder.
需要说明的是,所述文本信息中,包含从所述音频信息中识别出的音频基本语义单位以及所述音频基本语义单位在所述音频信息中起止时间信息。可以理解的,所述音频信息可以是mp3或其他音乐格式的歌曲文件,mp3文件是具有一定时长直接记录了真实声音的音频文件,所以在对mp3文件进行识别,将识别出的音频基本语义单位采用文本信息的形式进行输出时会记录识别出的该音频基本语义单位在所述音频信息中播放时起止时间信息。It should be noted that the text information includes an audio basic semantic unit recognized from the audio information and start and end time information of the audio basic semantic unit in the audio information. It can be understood that the audio information may be a song file of mp3 or other music format, and the mp3 file is an audio file that directly records the real sound for a certain period of time, so in the recognition of the mp3 file, the basic semantic unit of the audio to be recognized will be recognized. When the output is performed in the form of text information, the identified start and end time information of the audio basic semantic unit when played in the audio information is recorded.
在本实施例中,在对所述音频信息进行识别后输出的所述文本信息中采用如下格式记录识别出的音频基本语义单位以及所述音频基本语义单位的时间信息:<word,TIMECLASS>。其中,word是指识别出的音频基本语义单位,TIMECLASS是指时间标注,该时间标注采用起始时间以及终止时间{startTime,endTime}的形式记录该音频基本语义单位在在所述音频信息中播放时出现时的时间信息,即:是相对于所述音频信息在开始播放0时刻时的偏移量,单位为毫秒。In this embodiment, the recognized basic audio semantic unit and the time information of the audio basic semantic unit are recorded in the text information outputted after the audio information is identified: <word, TIMECLASS>. Wherein, word refers to the identified basic semantic unit of audio, and TIMECLASS refers to the time annotation, which records the basic semantic unit of the audio in the form of the start time and the end time {startTime, endTime}. The time information when the time is present, that is, the offset amount at the time of starting the playback of 0 with respect to the audio information, in milliseconds.
下面通过一个具体的例子说明所述生成配音文字的方法,例如:所述音频信息为mp3文件,该mp3文件在播放时的时常为10秒,在该mp3文件播放到1秒时出现歌词:“我想了又想”,则通过识别所述音频信息获取的文本信息中记录的识别出的音频基本语义单位以及所述音频基本语义单位的时间信息为:The following describes a method for generating a voice-over text by a specific example. For example, the audio information is an mp3 file, and the mp3 file is often 10 seconds during playback, and the lyrics appear when the mp3 file is played until 1 second: I think and think again, the identified basic audio semantic unit and the time information of the audio basic semantic unit recorded in the text information obtained by identifying the audio information are:
<word:“我”,{startTime:1000,endTime:1100}>;<word: "I", {startTime: 1000, endTime: 1100}>;
<word:“想”,{startTime:1200,endTime:1300}>;<word: "Think", {startTime:1200,endTime:1300}>;
<word:“了”,{startTime:1400,endTime:1500}>; <word: "了", {startTime:1400,endTime:1500}>;
<word:“又”,{startTime:1600,endTime:1700}>;<word: "again", {startTime:1600,endTime:1700}>;
<word:“想”,{startTime:1800,endTime:1900}>。<word: "Think", {startTime:1800,endTime:1900}>.
需要说明的是,若所述音频信息为中文的音频信息,则在对所述音频信息进行识别后输出的所述文本信息中记录的识别出的音频基本语义单位为单个中文汉字;同样的道理,若所述音频信息为英文的音频信息,则在对所述音频信息进行识别后输出的所述文本信息中记录的识别出的音频基本语义单位为单个英文单词。It should be noted that if the audio information is Chinese audio information, the basic audio semantic unit of the recognized audio recorded in the text information outputted after the audio information is identified is a single Chinese character; the same reason If the audio information is audio information in English, the recognized basic audio semantic unit recorded in the text information outputted after the audio information is recognized is a single English word.
可以理解的,所述音频基本语义单位的起止时间信息是以毫秒为单位进行记录的,而歌词:“我想了又想”是在该mp3文件播放到1秒时出现,则音频基本语义单位“我”在该mp3文件播放到1秒至1.1秒时出现,所以记录的音频基本语义单位“我”的时间信息为{startTime:1000,endTime:1100}。It can be understood that the start and end time information of the basic semantic unit of the audio is recorded in units of milliseconds, and the lyrics: "I think and think" is when the mp3 file is played until 1 second, then the audio basic semantic unit "I" appears when the mp3 file is played for 1 second to 1.1 seconds, so the time information of the recorded basic semantic unit "I" of the audio is {startTime: 1000, endTime: 1100}.
步骤S103,获取与所述音频信息对应的文本信息,并识别所述文本信息,从而获取文本基本语义单位。Step S103: Acquire text information corresponding to the audio information, and identify the text information, thereby acquiring a basic semantic unit of the text.
在本实施例中,所述获取与所述音频信息对应的文本信息,并识别所述文本信息,从而获取文本基本语义单位,可以采用如下方式实现:通过互联网搜索所述音频信息对应的文本信息,在获取所述文本信息后对所述文本信息中的每个基本语义单位进行识别,对识别出的每个基本语义单位形成时间信息为空值的文本基本语义单位,获取所述文本基本语义单位。In this embodiment, the obtaining the text information corresponding to the audio information, and identifying the text information, thereby acquiring the basic semantic unit of the text, may be implemented by searching for text information corresponding to the audio information through the Internet. After obtaining the text information, identifying each basic semantic unit in the text information, forming a basic semantic unit of the text whose time information is null for each of the identified basic semantic units, and acquiring the basic semantics of the text. unit.
需要说明的是,所述基本语义单位是所述文本信息内的单字信息,包括:中文文字、中文词语、拼音、数字、英文文字和/或英文词语等。It should be noted that the basic semantic unit is single-word information in the text information, including: Chinese characters, Chinese words, pinyin, numbers, English characters, and/or English words.
沿用上述具体的例子进行说明:所述音频信息为mp3文件,通过互联网络搜索该mp3文件对应的歌词文本,所述歌词文本的具体内容为:“我想了又想”,在获取该mp3文件对应的歌词文本后,对所述文本信息中的每个基本语义单位进行识别,对识别出的每个基本语义单位形成时间信息为空值的文本基本语义单位:The specific example is described above: the audio information is an mp3 file, and the lyric text corresponding to the mp3 file is searched through the internet, and the specific content of the lyric text is: “I think and think”, and the mp3 file is obtained. After the corresponding lyrics text, each basic semantic unit in the text information is identified, and the basic semantic unit of the text whose time information is null is formed for each of the identified basic semantic units:
<word:“我”,timeList{}>;<word: "I", timeList{}>;
<word:“想”,timeList{}>;<word: "think", timeList{}>;
<word:“了”,timeList{}>;<word: "了", timeList{}>;
<word:“又”,timeList{}>;<word: "again", timeList{}>;
<word:“想”,timeList{}>。<word: "Think", timeList{}>.
步骤S105,将各个所述音频基本语义单位的起止时间信息,记录到相应的所述文本基本语义单位中。 Step S105, recording start and end time information of each of the audio basic semantic units into the corresponding basic semantic unit of the text.
在本实施例中,所述将各个所述音频基本语义单位的起止时间信息,记录到相应的所述文本基本语义单位中,可以采用如下方式实现:将在对所述音频信息进行识别后识别出的各个所述音频基本语义单位与从所述音频信息对应的文本信息中对每个基本语义单位进行识别形成的文本基本语义单位进行匹配,将所述音频基本语义单位的起止时间信息放入到与该音频基本语义单位相应的文本基本语义单位内。In this embodiment, the start and end time information of each of the audio basic semantic units is recorded into the corresponding basic semantic unit of the text, and may be implemented by: identifying the audio information after identifying the audio information. Each of the audio basic semantic units is matched with a text basic semantic unit formed by recognizing each basic semantic unit from the text information corresponding to the audio information, and the start and end time information of the audio basic semantic unit is put into To the basic semantic unit of text corresponding to the basic semantic unit of the audio.
例如:通过识别所述音频信息获取的文本信息中记录的识别出的音频基本语义单位以及所述音频基本语义单位的时间信息为:For example, the identified basic audio semantic unit recorded in the text information obtained by identifying the audio information and the time information of the audio basic semantic unit are:
<word:“我”,{startTime:1000,endTime:1100}>;<word: "I", {startTime: 1000, endTime: 1100}>;
<word:“想”,{startTime:1200,endTime:1300}>;<word: "Think", {startTime:1200,endTime:1300}>;
对所述文本信息中的每个基本语义单位进行识别,对识别出的每个基本语义单位形成时间信息为空值的文本基本语义单位为:Identifying each basic semantic unit in the text information, and forming a basic semantic unit of the text whose time information is null for each of the identified basic semantic units is:
<word:“我”,timeList{}>;<word: "I", timeList{}>;
<word:“想”,timeList{}>;<word: "think", timeList{}>;
进行识别形成的文本基本语义单位进行匹配Matching the basic semantic units of the text formed to match
由于所述音频信息进行识别后识别出的音频基本语义单位“我”和“想”与对所述歌词文本中歌词的文本基本语义单位进行识别后形成的文本基本语义单位“我”和“想”的字形相同,则将音频基本语义单位“我”和“想”的起止时间信息放入到文本基本语义单位“我”和“想”中:The basic semantic units of the text "I" and "Imagine" formed after the audio information is identified by the basic semantic units "I" and "Think" and the basic semantic unit of the lyrics in the lyric text. The same glyphs are used to put the start and end time information of the audio basic semantic units "I" and "think" into the text basic semantic units "I" and "Think":
<word:“我”,timeList{startTime:1000,endTime:1100}>;<word: "I", timeList{startTime: 1000, endTime: 1100}>;
<word:“想”,timeList{startTime:1200,endTime:1300}>。<word: "Think", timeList{startTime: 1200, endTime: 1300}>.
需要说明的是,由于音频信息中相同的音频基本语义单位出现的次数可能不唯一,例如:在一首歌曲中,某个相同的字可以多次出现,所以在执行步骤S105将各个所述音频基本语义单位的起止时间信息记录到相应的所述文本基本语义单位中时,当具有相同的音频基本语义单位时,可以采用如下方式实现:将从所述音频信息中获取的音频基本语义单位的起止时间信息放入每一个与所述音频基本语义单位相同的文本基本语义单位内。It should be noted that, since the number of occurrences of the same audio basic semantic unit in the audio information may not be unique, for example, in a song, a certain word may appear multiple times, so each of the audios is performed in step S105. When the start and end time information of the basic semantic unit is recorded into the corresponding basic semantic unit of the text, when having the same basic audio semantic unit, the following can be implemented: the basic semantic unit of the audio to be obtained from the audio information. The start and end time information is placed in each of the basic text semantic units of the same basic semantic unit of the audio.
沿用上述具体的例子进行说明:通过识别所述音频信息获取的文本信息中记录的识别出的音频基本语义单位以及所述音频基本语义单位的时间信息为:The specific example is described above: the identified basic audio semantic unit recorded in the text information obtained by identifying the audio information and the time information of the audio basic semantic unit are:
<word:“我”,{startTime:1000,endTime:1100}>;<word: "I", {startTime: 1000, endTime: 1100}>;
<word:“想”,{startTime:1200,endTime:1300}>; <word: "Think", {startTime:1200,endTime:1300}>;
<word:“了”,{startTime:1400,endTime:1500}>;<word: "了", {startTime:1400,endTime:1500}>;
<word:“又”,{startTime:1600,endTime:1700}>;<word: "again", {startTime:1600,endTime:1700}>;
<word:“想”,{startTime:1800,endTime:1900}>。<word: "Think", {startTime:1800,endTime:1900}>.
在获取所述文本信息后对所述文本信息中的每个基本语义单位进行识别,对识别出的每个基本语义单位形成时间信息为空值的文本基本语义单位为:After obtaining the text information, each basic semantic unit in the text information is identified, and a basic semantic unit of the text whose time information is null for each of the identified basic semantic units is:
<word:“我”,timeList{}>;<word: "I", timeList{}>;
<word:“想”,timeList{}>;<word: "think", timeList{}>;
<word:“了”,timeList{}>;<word: "了", timeList{}>;
<word:“又”,timeList{}>;<word: "again", timeList{}>;
<word:“想”,timeList{}>。<word: "Think", timeList{}>.
由于所述音频信息进行识别后识别出的音频基本语义单位“我”、“想”、“了”、“又”和“想”与对所述歌词文本中歌词的文本基本语义单位进行提取后形成的文本基本语义单位“我”、“想”、“了”、“又”和“想”的时间集中字形相同,则将上述音频基本语义单位的起止时间信息放入到相应的文本基本语义单位中:The basic semantic units of the audio identified by the audio information after recognition are "I", "Think", "Yes", "Yes" and "Think" and after extracting the textual semantic units of the lyrics in the lyric text The formed basic semantic units of the text "I", "Think", "Y", "Yes" and "Think" have the same time-concentrated glyphs, and then put the start and end time information of the above basic audio semantic units into the corresponding text basic semantics. In the unit:
<word:“我”,timeList{startTime:1000,endTime:1100}>;<word: "I", timeList{startTime: 1000, endTime: 1100}>;
<word:“想”,timeList{startTime:1200,endTime:1300},{startTime:1800,endTime:1900}>;<word: "think", timeList{startTime: 1200, endTime: 1300}, {startTime: 1800, endTime: 1900}>;
<word:“了”,timeList{startTime:1400,endTime:1500}>;<word: "了", timeList{startTime:1400,endTime:1500}>;
<word:“又”,timeList{startTime:1600,endTime:1700}>;<word: "again", timeList{startTime:1600,endTime:1700}>;
<word:“想”,timeList{startTime:1200,endTime:1300},{startTime:1800,endTime:1900}>。<word: "think", timeList{startTime: 1200, endTime: 1300}, {startTime: 1800, endTime: 1900}>.
可以理解的,在上述例子中,由于在所述音频信息以及所述文本中“想”字出现了两次,所以将从所述音频信息中获取的“想”的起止时间信息分别放入与“想”字对应的文本基本语义单位“想”中。It can be understood that, in the above example, since the word "think" appears twice in the audio information and the text, the start and stop time information of the "thinking" obtained from the audio information is respectively placed in The basic semantic unit of the text corresponding to the word "think" is "thinking".
步骤S107,对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字。Step S107, processing the text basic semantic unit in which the start and end time information is recorded, and generating a voice-over character corresponding to the audio information.
在本实施例中,所述对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字,可以采用如下方式实现:根据所述文本信息中的具体单句确定组成该单句的文本基本语义单位,并根据组成该单句的所述文本基本语义单位中的起止时间信息确定该单句的起止时间信息,整理所有的单句的起止时间信息,生 成对应所述音频信息并确定了所有单句的起止时间信息的配音文字。In this embodiment, the processing the basic semantic unit of the text that records the start and stop time information, and generating the voice-over text corresponding to the audio information, may be implemented in the following manner: according to the specific information in the text information The single sentence determines the basic semantic unit of the text constituting the single sentence, and determines the start and end time information of the single sentence according to the start and end time information in the basic semantic unit of the text constituting the single sentence, and organizes the start and end time information of all the single sentences. The voice-over text corresponding to the audio information and determining the start and end time information of all the single sentences.
需要说明的是,在所述文本信息中确定单句时,可以通过单句与单句之间的换行符区分所述文本中的每一单句。It should be noted that when a single sentence is determined in the text information, each single sentence in the text can be distinguished by a newline between a single sentence and a single sentence.
所述对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字,具体包括步骤S107-1至S107-3,下面结合附图2作进一步说明。And processing the text basic semantic unit that records the start and end time information to generate a voice-over text corresponding to the audio information, specifically including steps S107-1 to S107-3, which are further described below with reference to FIG. 2 .
请参考图2,其示出了根据本申请的实施例提供的对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字的流程图。Please refer to FIG. 2, which illustrates a flowchart for processing the text basic semantic unit in which the start and end time information is recorded, and generating a voice-over text corresponding to the audio information, according to an embodiment of the present application.
所述对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字,包括:And processing the basic semantic unit of the text that records the start and end time information, and generating a voice-over text corresponding to the audio information, including:
步骤S107-1,针对所述文本信息中每一单句,获取组成所述单句的文本基本语义单位。Step S107-1, for each single sentence in the text information, obtain a basic semantic unit of text constituting the single sentence.
在本实施例中,所述针对所述文本信息中每一单句,获取组成所述单句的文本基本语义单位,可以采用如下方式实现:根据换行符进行区分所述文本信息中的每一单句,并针对具体的某一单句获取组成所述单句的文本基本语义单位。In this embodiment, the basic semantic unit of the text that constitutes the single sentence is obtained for each single sentence in the text information, and may be implemented by: distinguishing each single sentence in the text information according to a newline character, And the basic semantic unit of the text constituting the single sentence is obtained for a specific single sentence.
例如:所述文本信息中的具体单句为:“我想”和“你了”,则组成该单句的文本基本语义单位为“我”和“想”以及“你”和“了”,且文本基本语义单位“我”和“想”为:For example, the specific single sentence in the text message is: "I want" and "you", then the basic semantic units of the text that make up the single sentence are "I" and "Think" and "You" and "Yes", and the text The basic semantic units "I" and "Think" are:
<word:“我”,timeList{startTime:1000,endTime:1100}>;<word: "I", timeList{startTime: 1000, endTime: 1100}>;
<word:“想”,timeList{startTime:1200,endTime:1300}>;<word: "think", timeList{startTime:1200,endTime:1300}>;
文本基本语义单位“你”和“了”为:The basic semantic units of the text "you" and "out" are:
<word:“你”,timeList{startTime:1400,endTime:1500}>;<word: "you", timeList{startTime:1400,endTime:1500}>;
<word:“了”,timeList{startTime:1600,endTime:1700}>;。<word: "了", timeList{startTime:1600,endTime:1700}>;.
步骤S107-2,根据已获取的所述文本基本语义单位中记录的起止时间信息,确定所述单句的起止时间信息。Step S107-2, determining start and end time information of the single sentence according to the start and end time information recorded in the basic semantic unit of the text that has been acquired.
在本实施例中,所述根据已获取的所述文本基本语义单位中记录的起止时间信息,确定所述单句的起止时间信息,可以采用如下方式实现:以组成所述单句的文本基本语义单位中起始时间最早的时间信息作为所述单句的起始时间,以组成所述单句的文本基本语义单位的时间集中终止时间最晚的时间信息作为所述单句的终止时间,并将所述单句的起始时间以及终止时间作为所述单句的起止时间信息。In this embodiment, the determining the start and end time information of the single sentence according to the acquired start and end time information in the basic semantic unit of the text may be implemented by: forming a basic semantic unit of the text of the single sentence The earliest time information of the start time is used as the start time of the single sentence, and the latest time information of the time set end time of the text basic semantic unit constituting the single sentence is used as the end time of the single sentence, and the single sentence is The start time and the end time are used as the start and end time information of the single sentence.
例如:根据上述两个文本基本语义单位的时间信息确定的单句“我想”的时间信息 为:For example, the time information of the single sentence "I want" determined according to the time information of the basic semantic units of the above two texts for:
timeList{startTime:1000,endTime:1300},timeList{startTime:1000,endTime:1300},
根据上述两个文本基本语义单位的时间信息确定的单句“你了”的时间信息为:The time information of the single sentence "you" determined according to the time information of the basic semantic units of the above two texts is:
timeList{startTime:1400,endTime:1700}。timeList{startTime: 1400, endTime: 1700}.
步骤S107-3,将确定了起止时间信息的所述单句进行整合,形成对应所述音频信息,且具有每一单句的起止时间信息的配音文字。In step S107-3, the single sentence in which the start and end time information is determined is integrated to form a voice-over character corresponding to the audio information and having start and end time information of each single sentence.
例如:在确定所述文本中所有的单句“我想”和“你了”的时间信息之后,输出带有上述两句的时间信息的文本(即:动态歌词lrc):For example, after determining the time information of all the single sentences "I want" and "you" in the text, output the text with the time information of the above two sentences (ie: dynamic lyrics lrc):
[00:01:00]我想[00:01:00]I think
[00:01:40]你了。[00:01:40]You are.
可以理解的,在播放所述音频信息时,在达到每一所述单句的显示时间时,显示配音文字中相应的单句。It can be understood that, when the audio information is played, when the display time of each of the single sentences is reached, a corresponding single sentence in the voice-over text is displayed.
在本实施例中,由于音频信息中相同的音频基本语义单位出现的次数可能不唯一,例如:在一首歌曲中,某个相同的字可以多次出现,所以在执行步骤S107-1针对所述文本信息中每一单句,获取组成所述单句的文本基本语义单位时,当具有相同的基本语义单位时,可以采用如下方式实现:若所述文本基本语义单位中记录了至少两组起止时间信息,则按照起止时间信息的组数,分别形成组成所述单句的文本基本语义单位组。In this embodiment, since the number of occurrences of the same audio basic semantic unit in the audio information may not be unique, for example, in a song, a certain word may appear multiple times, so step S107-1 is performed for Each single sentence in the text information, when obtaining the basic semantic unit of the text constituting the single sentence, when having the same basic semantic unit, may be implemented as follows: if at least two sets of start and end time are recorded in the basic semantic unit of the text The information forms the basic semantic unit group of the text constituting the single sentence according to the number of groups of the start and end time information.
沿用上述具体的例子进行说明:所述文本中的具体单句为:“我想了又想”,则组成该单句的文本基本语义单位“我”、“想”、“了”、“又”和“想”为:The specific example above is used to illustrate: the specific single sentence in the text is: "I think and think", then the basic semantic units of the text that make up the single sentence are "I", "Think", "Y", "And" and "Think" is:
<word:“我”,timeList{startTime:1000,endTime:1100}>;<word: "I", timeList{startTime: 1000, endTime: 1100}>;
<word:“想”,timeList{startTime:1200,endTime:1300},{startTime:1800,endTime:1900}>;<word: "think", timeList{startTime: 1200, endTime: 1300}, {startTime: 1800, endTime: 1900}>;
<word:“了”,timeList{startTime:1400,endTime:1500}>;<word: "了", timeList{startTime:1400,endTime:1500}>;
<word:“又”,timeList{startTime:1600,endTime:1700}>;<word: "again", timeList{startTime:1600,endTime:1700}>;
<word:“想”,timeList{startTime:1200,endTime:1300},{startTime:1800,endTime:1900}>;<word: "think", timeList{startTime: 1200, endTime: 1300}, {startTime: 1800, endTime: 1900}>;
由于组成单句“我想了又想”的两个文本基本语义单位“想”中各具有两组时间信息,则按照起止时间信息的组数,分别形成组成所述单句的文本基本语义单位组包括如下四组:第一组为:Since the two basic semantic units "thinking" of the two sentences "I think and think" each have two sets of time information, the basic semantic unit group of the texts constituting the single sentence is respectively formed according to the number of groups of the start and end time information. The following four groups: The first group is:
<word:“我”,timeList{startTime:1000,endTime:1100}>; <word: "I", timeList{startTime: 1000, endTime: 1100}>;
<word:“想”,timeList{startTime:1200,endTime:1300}>;<word: "think", timeList{startTime:1200,endTime:1300}>;
<word:“了”,timeList{startTime:1400,endTime:1500}>;<word: "了", timeList{startTime:1400,endTime:1500}>;
<word:“又”,timeList{startTime:1600,endTime:1700}>;<word: "again", timeList{startTime:1600,endTime:1700}>;
<word:“想”,timeList{startTime:1200,endTime:1300}>;<word: "think", timeList{startTime:1200,endTime:1300}>;
第二组为:The second group is:
<word:“我”,timeList{startTime:1000,endTime:1100}>;<word: "I", timeList{startTime: 1000, endTime: 1100}>;
<word:“想”,timeList{startTime:1200,endTime:1300}>;<word: "think", timeList{startTime:1200,endTime:1300}>;
<word:“了”,timeList{startTime:1400,endTime:1500}>;<word: "了", timeList{startTime:1400,endTime:1500}>;
<word:“又”,timeList{startTime:1600,endTime:1700}>;<word: "again", timeList{startTime:1600,endTime:1700}>;
<word:“想”,timeList{startTime:1800,endTime:1900}>;<word: "think", timeList{startTime:1800,endTime:1900}>;
第三组为:The third group is:
<word:“我”,timeList{startTime:1000,endTime:1100}>;<word: "I", timeList{startTime: 1000, endTime: 1100}>;
<word:“想”,timeList{startTime:1800,endTime:1900}>;<word: "think", timeList{startTime:1800,endTime:1900}>;
<word:“了”,timeList{startTime:1400,endTime:1500}>;<word: "了", timeList{startTime:1400,endTime:1500}>;
<word:“又”,timeList{startTime:1600,endTime:1700}>;<word: "again", timeList{startTime:1600,endTime:1700}>;
<word:“想”,timeList{startTime:1200,endTime:1300}>;<word: "think", timeList{startTime:1200,endTime:1300}>;
第四组为:The fourth group is:
<word:“我”,timeList{startTime:1000,endTime:1100}>;<word: "I", timeList{startTime: 1000, endTime: 1100}>;
<word:“想”,timeList{startTime:1800,endTime:1900}>;<word: "think", timeList{startTime:1800,endTime:1900}>;
<word:“了”,timeList{startTime:1400,endTime:1500}>;<word: "了", timeList{startTime:1400,endTime:1500}>;
<word:“又”,timeList{startTime:1600,endTime:1700}>;<word: "again", timeList{startTime:1600,endTime:1700}>;
<word:“想”,timeList{startTime:1800,endTime:1900}>。<word: "Think", timeList{startTime:1800,endTime:1900}>.
由于真实的所述单句的文本基本语义单位中应该只具有一种时间信息,所以需要过滤掉时间信息不合理的文本基本语义单位组,所以在执行完按照起止时间信息的组数,分别形成组成所述单句的文本基本语义单位组的步骤之后,还包括如下步骤:Since the textual basic semantic unit of the single sentence should have only one kind of time information, it is necessary to filter out the text basic semantic unit group whose time information is unreasonable, so the composition is formed separately after executing the number of groups according to the start and end time information. After the step of the text basic semantic unit group of the single sentence, the method further includes the following steps:
根据预定的计算方法,对每一所述文本基本语义单位组中,各个文本基本语义单位的所有起止时间信息进行筛选,确定组成所述单句的文本基本语义单位组。According to a predetermined calculation method, all start and end time information of each text basic semantic unit in each text basic semantic unit group is filtered, and a text basic semantic unit group constituting the single sentence is determined.
在本实施例中,所述预定的计算方法,采用如下方式进行计算:计算各个所述文本基本语义单位组内,每一文本基本语义单位中的起始时间与所述文本基本语义单位的上一个文本基本语义单位的终止时间之间的时间间距,获取各个所述文本基本语义单位组 中所述起始时间与所述终止时间的时间间距的和,将所述时间间距的和作为所述文本基本语义单位组的误差值。In this embodiment, the predetermined calculation method is performed by calculating a starting time in a basic semantic unit of each text and a basic semantic unit of the text in each of the text basic semantic unit groups. The time interval between the end times of a textual semantic unit, obtaining the basic semantic unit group of each of the texts And a sum of the time intervals of the start time and the end time, and the sum of the time intervals is used as an error value of the text basic semantic unit group.
需要说明的是,所述时间间距是指:每一文本基本语义单位中的起始时间与所述文本基本语义单位的上一个文本基本语义单位的终止时间之间的时间间距,由于在形成的组成所述单句的文本基本语义单位组时,所述文本基本语义单位的起始时间可能会小于上一个文本基本语义单位的终止时间,为了防止在计算误差值时出现的负数时间间距影响误差值的计算,需要获取所述时间间距的正值。It should be noted that the time interval refers to a time interval between a start time in a basic semantic unit of each text and a termination time of a basic semantic unit of a text of a basic semantic unit of the text, due to the formation When the text basic semantic unit group of the single sentence is composed, the start time of the basic semantic unit of the text may be smaller than the termination time of the basic semantic unit of the previous text, in order to prevent the negative time interval occurring in calculating the error value from affecting the error value The calculation needs to obtain a positive value of the time interval.
获取所述时间间距的正值的方法包括:取绝对值、取平方等,下面以采用取平方的方式获取所述时间间距的正值来进行说明。可以理解的,由于要获取每一文本基本语义单位中的起始时间与上一个文本基本语义单位的终止时间之间的时间间距,所以通过差平方的计算方式获取所述时间间距的正值。The method for obtaining the positive value of the time interval includes: taking an absolute value, taking a square, etc., and the following is a description of obtaining a positive value of the time interval by taking a square. It can be understood that since the time interval between the start time in the basic semantic unit of each text and the end time of the basic semantic unit of the previous text is to be obtained, the positive value of the time interval is obtained by the calculation of the difference square.
具体的,所述预定的计算方法的数学算法为:Specifically, the mathematical algorithm of the predetermined calculation method is:
误差值=(startTime2-endTime1)2+(startTime3-endTime2)2...+(startTime n-endTime n-1)2 Error value = (startTime2-endTime1) 2 + (startTime 3 -endTime2) 2 ... + (startTime n-endTime n-1) 2
下面分别对上述4组时间集进行计算进行详细说明。(为了方便计算进行举例说明,在计算时以秒为单位进行计算)The calculation of the above four sets of time sets will be described in detail below. (For the convenience of calculation, the calculation is performed in seconds in the calculation)
第一组:(1.2-1.1)2+(1.4-1.3)2+(1.6-1.5)2+(1.2-1.7)2=0.28The first group: (1.2-1.1) 2 + (1.4-1.3) 2 + (1.6-1.5) 2 + (1.2-1.7) 2 =0.28
第二组:(1.2-1.1)2+(1.4-1.3)2+(1.6-1.5)2+(1.8-1.7)2=0.04The second group: (1.2-1.1) 2 + (1.4-1.3) 2 + (1.6-1.5) 2 + (1.8-1.7) 2 = 0.04
第三组:(1.8-1.1)2+(1.4-1.9)2+(1.6-1.5)2+(1.2-1.7)2=1The third group: (1.8-1.1) 2 + (1.4-1.9) 2 + (1.6-1.5) 2 + (1.2-1.7) 2 =1
第四组:(1.8-1.1)2+(1.4-1.9)2+(1.6-1.5)2+(1.8-1.7)2=0.76Group 4: (1.8-1.1) 2 + (1.4-1.9) 2 + (1.6-1.5) 2 + (1.8-1.7) 2 = 0.76
在本实施例中,所述预设的阈值可以是通过本领域的技术人员根据经验进行配置的较为合理的数值,或者所述预设的阈值为数值最小的误差值,在计算完误差值之后,对各个所述文本基本语义单位组进行过滤,保留误差值低于预设的阈值的文本基本语义单位组。In this embodiment, the preset threshold may be a reasonable value configured by a person skilled in the art according to experience, or the preset threshold is the smallest error value, after the error value is calculated. Filtering each of the text basic semantic unit groups, and retaining a text basic semantic unit group whose error value is lower than a preset threshold.
当所述预设的阈值为数值最小的误差值时,所述对各个所述文本基本语义单位组进行过滤,保留误差值低于预设的阈值的文本基本语义单位组,可以采用如下方式实现:保留误差值最小的组成所述单句的文本基本语义单位组,将其他的组成所述单句的文本基本语义单位组过滤掉。When the preset threshold is the smallest value of the error value, the basic semantic unit group of each of the texts is filtered, and the text basic semantic unit group whose error value is lower than a preset threshold is retained, which may be implemented as follows The text basic semantic unit group constituting the single sentence is kept with the smallest error value, and the other text basic semantic unit groups constituting the single sentence are filtered out.
需要说明的是,在对组成所述单句的文本基本语义单位组进行过滤时,可能会出现具有相同误差值的组成所述单句的文本基本语义单位组,这时在根据误差值进行过滤后还是无法获取单一的只具有一种时间信息的文本基本语义单位组,为了解决上述问题, 本申请的实施例提供了一种优选的实施方式,在优选方式下,在执行所述对各个所述文本基本语义单位组进行过滤,保留误差值低于预设的阈值的文本基本语义单位组的步骤之后,还需要计算保留的所述文本基本语义单位组内,每一文本基本语义单位中的起始时间大于所述文本基本语义单位的上一个文本基本语义单位的终止时间的次数,获取该次数最大的文本基本语义单位组。It should be noted that, when filtering the basic semantic unit group of texts constituting the single sentence, a basic semantic unit group of texts constituting the single sentence having the same error value may appear, and then filtering after the error value is used. Unable to get a single textual semantic unit group with only one type of time information, in order to solve the above problem, An embodiment of the present application provides a preferred implementation manner. In a preferred manner, performing the filtering on each of the text basic semantic unit groups, and retaining a text basic semantic unit group whose error value is lower than a preset threshold. After the step, it is also required to calculate the number of times in the basic semantic unit group of the text that the start time of each text basic semantic unit is greater than the last text basic semantic unit of the text basic semantic unit, and obtain The most basic textual semantic unit group of this number.
下面以一个具体的实例进行说明。The following is a specific example.
若组成组成所述单句的文本基本语义单位组还包括第五组:The basic semantic unit group of texts that make up the single sentence also includes the fifth group:
<word:“我”,timeList{startTime:1000,endTime:1100}>;<word: "I", timeList{startTime: 1000, endTime: 1100}>;
<word:“想”,timeList{startTime:1200,endTime:1300}>;<word: "think", timeList{startTime:1200,endTime:1300}>;
<word:“了”,timeList{startTime:1400,endTime:1500}>;<word: "了", timeList{startTime:1400,endTime:1500}>;
<word:“又”,timeList{startTime:1600,endTime:1700}>;<word: "again", timeList{startTime:1600,endTime:1700}>;
<word:“想”,timeList{startTime:1600,endTime:1700}>;<word: "think", timeList{startTime:1600,endTime:1700}>;
则第五组的误差值为:Then the error value of the fifth group is:
(1.2-1.1)2+(1.4-1.3)2+(1.6-1.5)2+(1.6-1.7)2=0.04(1.2-1.1) 2 +(1.4-1.3) 2 +(1.6-1.5) 2 +(1.6-1.7) 2 =0.04
经过对误差值进行过滤后,保留误差值最小的组成所述单句的文本基本语义单位组为第二组以及第五组,则还需对第二组和第五组的按照单句中文本基本语义单位的时间顺序进行合理性判断,即:判断保留的组成所述单句的每一文本基本语义单位中的起始时间大于所述单句中的上一个文本基本语义单位的终止时间的次数。After filtering the error value, the basic semantic unit group of the text that constitutes the single sentence with the smallest remaining error value is the second group and the fifth group, and the basic semantics of the second sentence and the fifth group according to the single sentence The chronological order of the units is judged by rationality, that is, the number of times in the basic semantic unit of each text of the single sentence is greater than the number of times the base text of the last text in the single sentence is terminated.
例如:第二组“想”字的起始时间大于“想”字上一个文本基本语义单位“我”的终止时间;“了”字的起始时间大于“了”字上一个文本基本语义单位“想”的终止时间;“又”字的起始时间大于“又”字上一个文本基本语义单位“了”的终止时间;“想”字的起始时间大于“想”字上一个文本基本语义单位“又”的终止时间,则第二组的合理次数为4次;同样的道理,第五组的的合理次数为3次,则获取合理次数为4次的组成所述单句的文本基本语义单位的时间集组。For example, the start time of the second set of "think" words is greater than the end time of the basic semantic unit "I" of the text "want"; the start time of the "y" character is greater than the basic semantic unit of a text on the "了" The end time of "thinking"; the starting time of the word "again" is greater than the ending time of the basic semantic unit "a" of the text "又又"; the starting time of the word "thinking" is greater than the text of the "thinking" word. For the termination time of the semantic unit "again", the reasonable number of the second group is 4 times; the same reason, the reasonable number of the fifth group is 3 times, then the text of the single sentence is obtained as a reasonable number of times. A time set of semantic units.
作为一个优选实施方式,本申请实施例提供的自动生成配音文字的方法中,在执行步骤S103获取与所述音频信息对应的文本信息,并识别所述文本信息获取文本基本语义单位时,是从所述文本信息中,按照每句内的每个字的顺序进行识别获取所述文本信息中的文本基本语义单位。As a preferred embodiment, in the method for automatically generating the voice-over text provided by the embodiment of the present application, when the text information corresponding to the audio information is acquired in step S103, and the text information is obtained, the basic semantic unit of the text is obtained. In the text information, the basic semantic unit of the text in the text information is obtained by identifying each word in each sentence.
作为一个优选实施方式,本申请实施例提供的自动生成配音文字的方法中,由于语音识别存在识别率,即:不一定能使所述音频信息精确无误的被识别出来,所以在步骤 S101中对音频信息进行识别时,可能会有未被识别出的音频基本语义单位,而在执行步骤S103,获取与所述音频信息对应的文本信息,并识别所述文本信息获取文本基本语义单位时,由于文本信息内的信息是计算机可以识别的字符串,则能够将所述文本信息内的每个基本语义单位进行识别并形成文本基本语义单位,所以在执行步骤S105将各个所述音频基本语义单位的起止时间信息记录到相应的所述文本基本语义单位中时,若所述音频基本语义单位的起止时间信息为空值,则使与所述音频基本语义单位相应的所述文本基本语义单位的取值为空值。As a preferred embodiment, in the method for automatically generating a dubbed character provided by the embodiment of the present application, since the recognition rate exists in the speech recognition, that is, the audio information may not be accurately identified, so in the step When the audio information is identified in S101, there may be an unrecognized audio basic semantic unit, and in step S103, text information corresponding to the audio information is acquired, and the basic information unit of the text information is obtained. When the information in the text information is a character string recognizable by the computer, each basic semantic unit in the text information can be identified and formed into a text basic semantic unit, so each of the audio basics is executed in step S105. When the start and end time information of the semantic unit is recorded into the corresponding basic semantic unit of the text, if the start and end time information of the audio basic semantic unit is a null value, the basic semantics of the text corresponding to the audio basic semantic unit is made The value of the unit is null.
可以理解的,若所述音频信息在识别过程中,具有未识别出的音频基本语义单位,即:所述音频基本语义单位为空,且该音频基本语义单位中的起止时间信息的取值也为空值,则在执行步骤S105将各个所述音频基本语义单位的起止时间信息记录到相应的所述文本基本语义单位中时,形成的文本基本语义单位的数目会大于语音识别出的音频基本语义单位的数目,则使未匹配上的所述文本基本语义单位中的起止时间信息的取值为空值。It can be understood that if the audio information is in the process of identification, there is an unrecognized audio basic semantic unit, that is, the audio basic semantic unit is empty, and the value of the start and end time information in the basic semantic unit of the audio is also If the value is a null value, when the start and end time information of each of the audio basic semantic units is recorded in the corresponding basic semantic unit of the text in step S105, the number of basic semantic units of the text formed is greater than the basic audio of the voice recognition. The number of semantic units is such that the value of the start and end time information in the basic semantic unit of the text on the unmatched value is a null value.
例如:通过识别所述音频信息识别出的音频基本语义单位以及所述音频基本语义单位的时间信息为:For example, the basic information unit of the audio identified by the identification of the audio information and the time information of the basic semantic unit of the audio are:
<word:“我”,{startTime:1000,endTime:1100}>;<word: "I", {startTime: 1000, endTime: 1100}>;
<word:“想”,{startTime:1200,endTime:1300}>;<word: "Think", {startTime:1200,endTime:1300}>;
<word:“又”,{startTime:1600,endTime:1700}>;<word: "again", {startTime:1600,endTime:1700}>;
对所述歌词文本中歌词的每个文本基本语义单位形成时间信息为空值的文本基本语义单位为:The basic semantic unit of the text in which the basic semantic unit of each text of the lyrics in the lyric text forms a null value is:
<word:“我”,timeList{}>;<word: "I", timeList{}>;
<word:“想”,timeList{}>;<word: "think", timeList{}>;
<word:“了”,timeList{}>;<word: "了", timeList{}>;
<word:“又”,timeList{}>;<word: "again", timeList{}>;
由于所述音频信息进行识别后只识别出了“我”、“想”和“又”,而对所述歌词文本中歌词的文本基本语义单位进行识别后形成的文本基本语义单位为:“我”、“想”、“了”、“又”,则将上述音频基本语义单位的时间信息放入到相应的文本基本语义单位中:Since the audio information is identified, only "I", "think" and "again" are recognized, and the basic semantic unit of the text formed by recognizing the basic semantic unit of the lyrics in the lyric text is: "I "," "think", "to", "again", the time information of the above basic audio semantic units is put into the corresponding text basic semantic unit:
<word:“我”,timeList{startTime:1000,endTime:1100}>;<word: "I", timeList{startTime: 1000, endTime: 1100}>;
<word:“想”,timeList{startTime:1200,endTime:1300}>; <word: "think", timeList{startTime:1200,endTime:1300}>;
<word:“了”,timeList{}>;<word: "了", timeList{}>;
<word:“又”,timeList{startTime:1600,endTime:1700}>。<word: "again", timeList{startTime:1600,endTime:1700}>.
作为一个优选实施方式,本申请实施例提供的自动生成配音文字的方法中,在执行步骤S107-1针对所述文本信息中每一单句,获取组成所述单句的文本基本语义单位时,若具有取值为空值的文本基本语义单位时,在所述确定组成所述单句的文本基本语义单位组的步骤之后,为了使每一文本基本语义单位都具有起止时间信息,按照预定的推算方式,对取值为空值的所述文本基本语义单位推算起止时间信息。As a preferred embodiment, in the method for automatically generating the voice-over text provided by the embodiment of the present application, when the step S107-1 is performed, for each single sentence in the text information, the basic semantic unit of the text constituting the single sentence is obtained, if When the value is a textual basic semantic unit of a null value, after the step of determining the basic semantic unit group of the text constituting the single sentence, in order to make the basic semantic unit of each text have start and end time information, according to a predetermined calculation manner, The start and end time information is estimated for the basic semantic unit of the text whose value is a null value.
所述预定的推算方式,包括:The predetermined calculation method includes:
计算所述文本基本语义单位组中的文本基本语义单位的平均时间信息;Calculating average time information of basic semantic units of text in the basic semantic unit group of the text;
将取值为空值的所述文本基本语义单位的上一个基本语义单位中的终止时间,放入取值为空值的所述文本基本语义单位的起始时间中;Putting the termination time in the last basic semantic unit of the basic semantic unit of the text whose value is a null value into the start time of the basic semantic unit of the text whose value is null;
将所述终止时间加上所述平均时间信息后,放入取值为空值的所述文本基本语义单位的终止时间中。After the termination time is added to the average time information, the end time of the basic semantic unit of the text whose value is null is placed.
在本实施例中,所述计算所述文本基本语义单位组中的文本基本语义单位的平均时间信息,可以采用如下方式实现:将组成所述单句的每一文本基本语义单位中的终止时间减去起始时间,获得每一文本基本语义单位在音频信息中的播放时间,并根据该单句中文本基本语义单位的播放时间的和除以该单句中文本基本语义单位的数量计算组成所述单句的文本基本语义单位的平均时间信息。In this embodiment, the calculating the average time information of the basic semantic units of the text in the basic semantic unit group of the text may be implemented by: reducing the termination time in the basic semantic unit of each text constituting the single sentence. Going to the start time, obtaining the playing time of the basic semantic unit of each text in the audio information, and calculating the single sentence according to the sum of the playing time of the basic semantic unit of the text in the single sentence divided by the number of basic semantic units of the text in the single sentence The average time information of the basic semantic units of the text.
可以理解的,由于所述文本基本语义单位是按照文本信息的单句中每个基本语义单位的顺序形成的,所以能通过取值为空值的文本基本语义单位的上一文本基本语义单位的时间信息中的终止时间进行时间估算,将取值为空值的文本基本语义单位的上一个文本基本语义单位中的终止时间,放入取值为空值的文本基本语义单位的起始时间中,即:将与取值为空值的文本基本语义临近的文本基本语义的终止时间作为取值为空值的文本基本语义的起始时间。It can be understood that since the basic semantic unit of the text is formed in the order of each basic semantic unit in the single sentence of the text information, the time of the basic semantic unit of the previous text can be passed through the basic semantic unit of the text of the null value. The termination time in the information is estimated by time, and the termination time in the basic semantic unit of the text in the basic semantic unit of the text whose value is null is placed in the start time of the basic semantic unit of the text whose value is null. That is, the end time of the basic semantics of the text adjacent to the basic semantics of the text whose value is null is taken as the start time of the basic semantics of the text whose value is null.
在确定取值为空值的文本基本语义的起始时间后,根据该单句中每个文本基本语义单位在音频信息中的平均播放时间确定取值为空值的文本基本语义单位的终止时间,即:将取值为空值的文本基本语义单位已确定的起始时间加上所述平均时间信息后,放入取值为空值的文本基本语义的终止时间中。After determining the start time of the basic semantics of the text whose value is null, determining the end time of the basic semantic unit of the text whose value is null according to the average playing time of the basic semantic unit of each text in the single sentence, That is, the start time of the basic semantic unit of the text whose value is a null value is added to the average time information, and then the end time of the basic semantics of the text whose value is null is put.
需要说明的是,由于执行步骤S103获取与所述音频信息对应的文本信息,并识别所述文本信息获取文本基本语义单位时,是从所述文本信息中,按照每句内的每个字的顺 序进行识别获取所述文本信息中的文本基本语义单位的,则对取值为空值的文本基本语义单位推算起止时间信息还可以采用另一种方式实现:直接以取值为空的文本基本语义单位的上一文本基本语义单位的时间信息中的终止时间以及以取值为空的文本基本语义单位的下一文本基本语义单位的时间信息中的开始时间,分别作为该取值为空的文本基本语义单位的时间信息中的开始时间以及终止时间。It should be noted that, when the step S103 is performed to acquire the text information corresponding to the audio information, and the text information is recognized, the basic semantic unit of the text is obtained, from the text information, according to each word in each sentence. Shun If the basic semantic unit of the text in the text information is obtained by the sequence, the start and end time information of the basic semantic unit of the text whose value is null may be implemented in another manner: directly adopting the text of the blank value The termination time in the time information of the basic semantic unit of the previous text of the semantic unit and the start time in the time information of the basic semantic unit of the next text in the basic semantic unit of the text of the null value, respectively, as the value is null The start time and the end time in the time information of the basic semantic unit of the text.
可以理解的,由于所述文本基本语义单位是按照文本单句中每个文本基本语义单位的顺序形成的,所以取值为空值的文本基本语义单位的基本语义单位是出现在与其临近的前后文本基本语义单位之间的,所以能通过上一文本基本语义单位的时间信息中的结束时间以及下一文本基本语义单位的时间信息中的开始时间对取值为空值的文本基本语义单位进行时间估算。It can be understood that since the basic semantic unit of the text is formed in the order of the basic semantic units of each text in the text sentence, the basic semantic unit of the basic semantic unit of the text whose value is null is the text before and after the text. Between the basic semantic units, so it is possible to time the text basic semantic unit of the null value by the end time in the time information of the basic semantic unit of the previous text and the start time in the time information of the basic semantic unit of the next text. Estimate.
在上述的实施例中,提供了一种自动生成配音文字的方法,与上述自动生成配音文字的方法相对应的,本申请还提供了一种自动生成配音文字的装置。由于装置的实施例基本相似于方法的实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。下述描述的装置实施例仅仅是示意性的。所述自动生成配音文字的装置实施例如下:In the above embodiment, a method for automatically generating a voice-over character is provided. Corresponding to the above method for automatically generating a voice-over character, the present application also provides an apparatus for automatically generating a voice-over character. Since the embodiment of the device is substantially similar to the embodiment of the method, the description is relatively simple, and the relevant portions can be referred to the description of the method embodiment. The device embodiments described below are merely illustrative. The device for automatically generating the voice-over text is implemented as follows:
请参考图3,其示出了根据本申请的实施例提供的自动生成配音文字的装置的示意图。Please refer to FIG. 3, which shows a schematic diagram of an apparatus for automatically generating voiceover characters according to an embodiment of the present application.
所述自动生成配音文字的装置,包括:音频识别单元301、文本识别单元303、时间写入单元305以及配音文字生成单元307;The device for automatically generating a voice-over character includes: an audio recognition unit 301, a text recognition unit 303, a time writing unit 305, and a voice-over character generating unit 307;
所述音频识别单元301,用于对音频信息进行识别,获取识别出的各个音频基本语义单位的起止时间信息;The audio identification unit 301 is configured to identify the audio information, and acquire the start and end time information of the identified basic audio semantic units of each audio;
所述文本识别单元303,用于获取与所述音频信息对应的文本信息,并识别所述文本信息,从而获取文本基本语义单位;The text identification unit 303 is configured to acquire text information corresponding to the audio information, and identify the text information, thereby acquiring a basic semantic unit of the text;
所述时间写入单元305,用于将各个所述音频基本语义单位的起止时间信息,记录到相应的所述文本基本语义单位中;The time writing unit 305 is configured to record start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text;
所述配音文字生成单元307,用于对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字。The voice-over character generating unit 307 is configured to process the text basic semantic unit in which the start and end time information is recorded, and generate a voice-over character corresponding to the audio information.
可选的,所述时间记录单元,包括:文本语义获取子单元、时间信息确定子单元以及配音文字生成子单元;Optionally, the time recording unit includes: a text semantic acquisition subunit, a time information determination subunit, and a dubbing text generation subunit;
所述文本语义获取子单元,用于针对所述文本信息中每一单句,获取组成所述单句 的文本基本语义单位;The text semantic acquisition subunit is configured to acquire, for each single sentence in the text information, a composition of the single sentence Basic semantic unit of text;
所述时间信息确定子单元,用于根据已获取的所述文本基本语义单位中记录的起止时间信息确定所述单句的起止时间信息;The time information determining subunit is configured to determine start and end time information of the single sentence according to the start and end time information recorded in the basic semantic unit of the text that has been acquired;
所述配音文字生成子单元,用于将确定了起止时间信息的所述单句进行整合,形成对应所述音频信息,且具有每一单句的起止时间信息的配音文字。The voice-over character generating sub-unit is configured to integrate the single sentences that determine the start and end time information to form a voice-over text corresponding to the audio information and having start and end time information of each single sentence.
可选的,所述时文本语义获取子单元,具体用于针对所述文本信息中每一单句,获取组成所述单句的文本基本语义单位时,若所述文本基本语义单位中记录了至少两组起止时间信息,则按照起止时间信息的组数,分别形成组成所述单句的文本基本语义单位组。Optionally, the time text semantic acquisition subunit is specifically configured to: when each of the single sentences in the text information is obtained, obtain the basic semantic unit of the text that constitutes the single sentence, if at least two of the basic semantic units of the text are recorded When the group start and stop time information is used, the text basic semantic unit group constituting the single sentence is respectively formed according to the number of groups of the start and end time information.
可选的,所述的自动生成配音文字的装置,还包括:文本语义筛选子单元;Optionally, the device for automatically generating a voice-over text further includes: a text semantic screening sub-unit;
所述文本语义筛选子单元,用于在所述按照起止时间信息的组数,分别形成组成所述单句的文本基本语义单位组之后,根据预定的计算方法,对每一所述文本基本语义单位组中,各个文本基本语义单位的所有起止时间信息进行筛选,确定组成所述单句的文本基本语义单位组。The text semantic screening subunit is configured to form a basic semantic unit of each of the texts according to a predetermined calculation method after forming the text basic semantic unit group constituting the single sentence respectively according to the number of groups according to the start and end time information In the group, all the start and end time information of the basic semantic units of each text are filtered to determine the basic semantic unit group of texts constituting the single sentence.
可选的,所述时间集组筛选子单元,包括:误差计算子单元;Optionally, the time set grouping subunit includes: an error calculation subunit;
所述误差计算子单元,用于计算各个所述文本基本语义单位组内,每一文本基本语义单位中的起始时间与所述文本基本语义单位的上一个文本基本语义单位的终止时间之间的时间间距,获取各个所述文本基本语义单位组中所述起始时间与所述终止时间的时间间距的和,将所述时间间距的和作为所述文本基本语义单位组的误差值。The error calculation subunit is configured to calculate, between each of the basic semantic unit groups of the text, a start time in a basic semantic unit of each text and a termination time of a basic semantic unit of a text of the text basic semantic unit The time interval is obtained by obtaining a sum of the time intervals of the start time and the end time in each of the text basic semantic unit groups, and the sum of the time intervals is used as an error value of the text basic semantic unit group.
可选的,所述时间集组筛选子单元,还包括:过滤子单元;Optionally, the time set group filtering subunit further includes: a filtering subunit;
所述过滤子单元,用于对各个所述文本基本语义单位组进行过滤,保留误差值低于预设的阈值的文本基本语义单位组。The filtering subunit is configured to filter each of the text basic semantic unit groups, and retain a text basic semantic unit group whose error value is lower than a preset threshold.
可选的,所述时间集组筛选子单元,还包括:时间次数计算子单元;Optionally, the time set group filtering subunit further includes: a time number calculation subunit;
所述时间次数计算子单元,用于在所述保留误差值低于预设的阈值的文本基本语义单位组的之后,计算保留的所述文本基本语义单位组内,每一文本基本语义单位中的起始时间大于所述文本基本语义单位的上一个文本基本语义单位的终止时间的次数,获取该次数最大的文本基本语义单位组。The time count calculation subunit is configured to calculate, in the basic semantic unit of each text, after the text basic semantic unit group whose retention error value is lower than a preset threshold value The start time is greater than the number of times of the last text basic semantic unit of the basic semantic unit of the text, and the text basic semantic unit group having the largest number of times is obtained.
可选的,所述文本识别单元303,具体用于从所述文本信息中,按照每句内的每个字的顺序进行识别获取所述文本信息中的文本基本语义单位。Optionally, the text identification unit 303 is specifically configured to: obtain, from the text information, the basic semantic unit of the text in the text information according to the order of each word in each sentence.
可选的,所述时间写入单元305,具体用于在将各个所述音频基本语义单位的起止 时间信息,记录到相应的所述文本基本语义单位中时,若所述音频基本语义单位的起止时间信息为空值,则使与所述音频基本语义单位相应的所述文本基本语义单位的取值为空值。Optionally, the time writing unit 305 is specifically configured to start and stop the basic semantic units of each of the audios. Time information, when recorded in the corresponding basic semantic unit of the text, if the start and end time information of the audio basic semantic unit is a null value, the basic semantic unit of the text corresponding to the audio basic semantic unit is taken The value is null.
可选的,所述的自动生成配音文字的装置,还包括:Optionally, the device for automatically generating a voice-over text further includes:
时间推算单元,用于在所述确定组成所述单句的文本基本语义单位组之后,按照预定的推算方式,对取值为空值的所述文本基本语义单位推算起止时间信息a time estimating unit, configured to calculate a start and end time information of the basic semantic unit of the text whose value is a null value, after determining the text basic semantic unit group constituting the single sentence, according to a predetermined calculation manner
可选的,所述时间推算单元,包括:Optionally, the time estimating unit includes:
平均时间计算子单元,用于计算所述文本基本语义单位组中的文本基本语义单位的平均时间信息;An average time calculation subunit for calculating average time information of a basic semantic unit of text in the text basic semantic unit group;
起始时间写入子单元,用于将取值为空值的所述文本基本语义单位,上一个文本基本语义单位中的终止时间,放入取值为空值的所述文本基本语义单位的起始时间中;The start time is written into the subunit, and is used to put the text basic semantic unit whose value is a null value, and the end time in the basic semantic unit of the previous text into the basic semantic unit of the text whose value is a null value. In the start time;
终止时间写入子单元,用于将所述终止时间加上所述平均时间信息后,放入取值为空值的所述文本基本语义单位的终止时间中。The termination time is written into the subunit, and after adding the termination time to the average time information, the termination time is entered into the basic semantic unit of the text whose value is a null value.
在上述的实施例中,提供了一种自动生成配音文字的方法以及一种自动生成配音文字的装置,此外,本申请还提供了一种电子设备;所述电子设备实施例如下:In the above embodiments, a method for automatically generating a voice-over character and a device for automatically generating a voice-over text are provided. In addition, the present application further provides an electronic device; the electronic device implementation is as follows:
请参考图4,其示出了根据本申请的实施例提供的电子设备的示意图。Please refer to FIG. 4, which shows a schematic diagram of an electronic device provided in accordance with an embodiment of the present application.
所述电子设备,包括:显示器401;处理器403;存储器405;The electronic device includes: a display 401; a processor 403; a memory 405;
所述存储器405,用于存储配音文字生成程序,所述程序在被所述处理器读取执行时,执行如下操作:对音频信息进行识别,获取识别出的各个音频基本语义单位的起止时间信息;获取与所述音频信息对应的文本信息,并识别所述文本信息,从而获取文本基本语义单位;将各个所述音频基本语义单位的起止时间信息,记录到相应的所述文本基本语义单位中;对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字。The memory 405 is configured to store a voice-over character generating program. When the program is read and executed by the processor, the program performs the following operations: identifying the audio information, and acquiring the start and end time information of the identified basic audio semantic units of each audio. Obtaining text information corresponding to the audio information, and identifying the text information, thereby acquiring a text basic semantic unit; recording start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text And processing the text basic semantic unit in which the start and end time information is recorded, and generating a voice-over text corresponding to the audio information.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。The memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory. Memory is an example of a computer readable medium.
1、计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数 据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非暂存电脑可读媒体(transitory media),如调制的数据信号和载波。1. Computer readable media including both permanent and non-persistent, removable and non-removable media may be implemented by any method or technology. Information can be computer readable instructions, data structures, modules of programs, or other numbers according to. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media, such as modulated data signals and carrier waves.
2、本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。2. Those skilled in the art will appreciate that embodiments of the present application can be provided as a method, system, or computer program product. Thus, the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware. Moreover, the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
本申请虽然以较佳实施例公开如上,但其并不是用来限定本申请,任何本领域技术人员在不脱离本申请的精神和范围内,都可以做出可能的变动和修改,因此本申请的保护范围应当以本申请权利要求所界定的范围为准。 The present application is disclosed in the above preferred embodiments, but it is not intended to limit the present application, and any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of the present application. The scope of protection should be based on the scope defined by the claims of the present application.

Claims (23)

  1. 一种自动生成配音文字的方法,其特征在于,包括:A method for automatically generating a voice-over text, comprising:
    对音频信息进行识别,获取识别出的各个音频基本语义单位的起止时间信息;Identifying the audio information, and acquiring the start and end time information of the identified basic semantic units of each audio;
    获取与所述音频信息对应的文本信息,并识别所述文本信息,从而获取文本基本语义单位;Obtaining text information corresponding to the audio information, and identifying the text information, thereby acquiring a basic semantic unit of the text;
    将各个所述音频基本语义单位的起止时间信息,记录到相应的所述文本基本语义单位中;Recording start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text;
    对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字。The text basic semantic unit in which the start and end time information is recorded is processed to generate a voice-over character corresponding to the audio information.
  2. 根据权利要求1所述的自动生成配音文字的方法,其特征在于,所述对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字,包括:The method for automatically generating a voice-over character according to claim 1, wherein the processing the basic text semantic unit of the start and stop time information to generate the voice-over text corresponding to the audio information comprises:
    针对所述文本信息中每一单句,获取组成所述单句的文本基本语义单位;Obtaining a basic semantic unit of the text constituting the single sentence for each single sentence in the text information;
    根据已获取的所述文本基本语义单位中记录的起止时间信息,确定所述单句的起止时间信息;Determining start and end time information of the single sentence according to the start and end time information recorded in the basic semantic unit of the text that has been obtained;
    将确定了起止时间信息的所述单句进行整合,形成对应所述音频信息,且具有每一单句的起止时间信息的配音文字。The single sentence in which the start and end time information is determined is integrated to form a voice-over character corresponding to the audio information and having start and end time information of each single sentence.
  3. 根据权利要求2所述的自动生成配音文字的方法,其特征在于,所述针对所述文本信息中每一单句,获取组成所述单句的文本基本语义单位时,若所述文本基本语义单位中记录了至少两组起止时间信息,则按照起止时间信息的组数,分别形成组成所述单句的文本基本语义单位组。The method for automatically generating a voice-over character according to claim 2, wherein, for each single sentence in the text information, when the basic semantic unit of the text constituting the single sentence is obtained, if the text is in a basic semantic unit When at least two sets of start and end time information are recorded, the text basic semantic unit group constituting the single sentence is respectively formed according to the number of groups of the start and end time information.
  4. 根据权利要求3所述的自动生成配音文字的方法,其特征在于,在所述按照起止时间信息的组数,分别形成组成所述单句的文本基本语义单位组的步骤之后,包括:The method for automatically generating a voice-over character according to claim 3, wherein after the step of forming the text basic semantic unit group constituting the single sentence in the group number according to the start and end time information, the method comprises:
    根据预定的计算方法,对每一所述文本基本语义单位组中,各个文本基本语义单位的所有起止时间信息进行筛选,确定组成所述单句的文本基本语义单位组。According to a predetermined calculation method, all start and end time information of each text basic semantic unit in each text basic semantic unit group is filtered, and a text basic semantic unit group constituting the single sentence is determined.
  5. 根据权利要求4所述的自动生成配音文字的方法,其特征在于,所述预定的计算方法,包括:The method of automatically generating a dubbed character according to claim 4, wherein the predetermined calculation method comprises:
    计算各个所述文本基本语义单位组内,每一文本基本语义单位中的起始时间与所述文本基本语义单位的上一个文本基本语义单位的终止时间之间的时间间距,获取各个所述文本基本语义单位组中所述起始时间与所述终止时间的时间间距的和,将所述时间间 距的和作为所述文本基本语义单位组的误差值。Calculating a time interval between a start time in a basic semantic unit of each text and a termination time of a last text basic semantic unit of the text basic semantic unit in each of the text basic semantic unit groups, and acquiring each of the texts The sum of the start time of the basic semantic unit group and the time interval of the end time, which will be between the times The sum of the distances is the error value of the basic semantic unit group of the text.
  6. 根据权利要求5所述的自动生成配音文字的方法,其特征在于,所述对每一所述文本基本语义单位组中,各个文本基本语义单位的所有起止时间信息进行筛选,确定组成所述单句的文本基本语义单位组,包括:The method for automatically generating dubbed characters according to claim 5, wherein the screening of all start and end time information of each text basic semantic unit in each of the text basic semantic unit groups is determined to constitute the single sentence The basic semantic unit group of text, including:
    对各个所述文本基本语义单位组进行过滤,保留误差值低于预设的阈值的文本基本语义单位组。Each of the text basic semantic unit groups is filtered, and a text basic semantic unit group whose error value is lower than a preset threshold is retained.
  7. 根据权利要求6所述的自动生成配音文字的方法,其特征在于,在所述保留误差值低于预设的阈值的文本基本语义单位组的步骤之后,包括:The method of automatically generating a dubbed character according to claim 6, wherein after the step of retaining the textual semantic unit group whose retention error value is lower than a preset threshold, the method comprises:
    计算保留的所述文本基本语义单位组内,每一文本基本语义单位中的起始时间大于所述文本基本语义单位的上一个文本基本语义单位的终止时间的次数,获取该次数最大的文本基本语义单位组。Calculating the number of times in the basic semantic unit group of the text, the starting time in each text basic semantic unit is greater than the ending time of the last text basic semantic unit of the text basic semantic unit, and obtaining the text with the largest number of times Semantic unit group.
  8. 根据权利要求4-7中任意一项所述的自动生成配音文字的方法,其特征在于,所述识别所述文本信息,从而获取文本基本语义单位包括:The method for automatically generating a dubbed character according to any one of claims 4-7, wherein the recognizing the text information to obtain a text basic semantic unit comprises:
    从所述文本信息中,按照每句内的每个字的顺序进行识别获取所述文本信息中的文本基本语义单位。From the text information, the basic semantic unit of the text in the text information is obtained by identifying each word in each sentence.
  9. 根据权利要求8所述的自动生成配音文字的方法,其特征在于,在将各个所述音频基本语义单位的起止时间信息,记录到相应的所述文本基本语义单位中时,若所述音频基本语义单位的起止时间信息为空值,则使与所述音频基本语义单位相应的所述文本基本语义单位的取值为空值。The method for automatically generating dubbed characters according to claim 8, wherein when the start and end time information of each of the audio basic semantic units is recorded in the corresponding basic semantic unit of the text, if the audio is basic The start and end time information of the semantic unit is a null value, so that the value of the text basic semantic unit corresponding to the audio basic semantic unit is null.
  10. 根据权利要求9所述的自动生成配音文字的方法,其特征在于,在所述确定组成所述单句的文本基本语义单位组的步骤之后,包括:The method of automatically generating a dubbed character according to claim 9, wherein after the step of determining a textual semantic unit group constituting the single sentence, the method comprises:
    按照预定的推算方式,对取值为空值的所述文本基本语义单位推算起止时间信息。The start and end time information is estimated for the basic semantic unit of the text whose value is a null value according to a predetermined calculation manner.
  11. 根据权利要求10所述的自动生成配音文字的方法,其特征在于,所述预定的推算方式,包括:The method for automatically generating a voice-over character according to claim 10, wherein the predetermined calculation method comprises:
    计算所述文本基本语义单位组中的文本基本语义单位的平均时间信息;Calculating average time information of basic semantic units of text in the basic semantic unit group of the text;
    将取值为空值的所述文本基本语义单位,上一个文本基本语义单位中的终止时间,放入取值为空值的所述文本基本语义单位的起始时间中;Putting the text basic semantic unit whose value is a null value, the end time in the basic semantic unit of the previous text, into the start time of the basic semantic unit of the text whose value is null;
    将所述终止时间加上所述平均时间信息后,放入取值为空值的所述文本基本语义单位的终止时间中。After the termination time is added to the average time information, the end time of the basic semantic unit of the text whose value is null is placed.
  12. 一种自动生成配音文字的装置,其特征在于,包括: An apparatus for automatically generating a voice-over character, comprising:
    音频识别单元,用于对音频信息进行识别,获取识别出的各个音频基本语义单位的起止时间信息;An audio recognition unit, configured to identify the audio information, and obtain the start and end time information of the identified basic audio semantic units of each audio;
    文本识别单元,用于获取与所述音频信息对应的文本信息,并识别所述文本信息,从而获取文本基本语义单位;a text recognition unit, configured to acquire text information corresponding to the audio information, and identify the text information, thereby acquiring a basic semantic unit of the text;
    时间写入单元,用于将各个所述音频基本语义单位的起止时间信息,记录到相应的所述文本基本语义单位中;a time writing unit, configured to record start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text;
    配音文字生成单元,用于对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字。The voice-over character generating unit is configured to process the text basic semantic unit in which the start and end time information is recorded, and generate a voice-over text corresponding to the audio information.
  13. 根据权利要求12所述的自动生成配音文字的装置,其特征在于,所述配音文字生成单元,包括:The apparatus for automatically generating a voice-sending character according to claim 12, wherein the voice-over character generating unit comprises:
    文本语义获取子单元,用于针对所述文本信息中每一单句,获取组成所述单句的文本基本语义单位;a text semantic acquisition subunit, configured to obtain, for each single sentence in the text information, a basic semantic unit of text constituting the single sentence;
    时间信息确定子单元,用于根据已获取的所述文本基本语义单位中记录的起止时间信息确定所述单句的起止时间信息;a time information determining subunit, configured to determine start and end time information of the single sentence according to the start and end time information recorded in the basic semantic unit of the text that has been acquired;
    配音文字生成子单元,用于将确定了起止时间信息的所述单句进行整合,形成对应所述音频信息,且具有每一单句的起止时间信息的配音文字。The voice-over character generating sub-unit is configured to integrate the single sentences that determine the start and end time information to form a voice-over text corresponding to the audio information and having start and end time information of each single sentence.
  14. 根据权利要求13所述的自动生成配音文字的装置,其特征在于,所述文本语义获取子单元,具体用于针对所述文本信息中每一单句,获取组成所述单句的文本基本语义单位时,若所述文本基本语义单位中记录了至少两组起止时间信息,则按照起止时间信息的组数,分别形成组成所述单句的文本基本语义单位组。The apparatus for automatically generating a voice-over character according to claim 13, wherein the text semantic acquisition sub-unit is specifically configured to acquire a basic semantic unit of a text constituting the single sentence for each single sentence in the text information. If at least two sets of start and end time information are recorded in the basic semantic unit of the text, the text basic semantic unit group constituting the single sentence is respectively formed according to the number of groups of the start and end time information.
  15. 根据权利要求14所述的自动生成配音文字的装置,其特征在于,还包括:The apparatus for automatically generating a voice-over character according to claim 14, further comprising:
    文本语义筛选子单元,用于在所述按照起止时间信息的组数,分别形成组成所述单句的文本基本语义单位组之后,根据预定的计算方法,对每一所述文本基本语义单位组中,各个文本基本语义单位的所有起止时间信息进行筛选,确定组成所述单句的文本基本语义单位组。a text semantic screening subunit, configured to form a basic semantic unit group of the text constituting the single sentence after the number of groups according to the start and end time information, according to a predetermined calculation method, in each of the text basic semantic unit groups All the start and end time information of the basic semantic units of each text is filtered to determine the basic semantic unit group of texts constituting the single sentence.
  16. 根据权利要求15所述的自动生成配音文字的装置,其特征在于,所述文本语义筛选子单元,包括:The apparatus for automatically generating a voice-over character according to claim 15, wherein the text semantic screening sub-unit comprises:
    误差计算子单元,用于计算各个所述文本基本语义单位组内,每一文本基本语义单位中的起始时间与所述文本基本语义单位的上一个文本基本语义单位的终止时间之间的时间间距,获取各个所述文本基本语义单位组中所述起始时间与所述终止时间的时间 间距的和,将所述时间间距的和作为所述文本基本语义单位组的误差值。An error calculation subunit, configured to calculate a time between a start time in a basic semantic unit of each text and a termination time of a basic semantic unit of a text of the basic semantic unit of the text in each of the text basic semantic unit groups a spacing, obtaining a time of the start time and the end time in each of the text basic semantic unit groups The sum of the intervals, the sum of the time intervals is taken as the error value of the textual semantic unit group of the text.
  17. 根据权利要求15所述的自动生成配音文字的装置,其特征在于,所述文本语义筛选子单元,还包括:The apparatus for automatically generating a voice-over character according to claim 15, wherein the text semantic screening sub-unit further comprises:
    过滤子单元,用于对各个所述文本基本语义单位组进行过滤,保留误差值低于预设的阈值的文本基本语义单位组。The filtering subunit is configured to filter each of the text basic semantic unit groups, and retain a text basic semantic unit group whose error value is lower than a preset threshold.
  18. 根据权利要求17所述的自动生成配音文字的装置,其特征在于,所述文本语义筛选子单元,还包括:The apparatus for automatically generating a voice-over character according to claim 17, wherein the text semantic screening sub-unit further comprises:
    时间次数计算子单元,用于在所述保留误差值低于预设的阈值的文本基本语义单位组的之后,计算保留的所述文本基本语义单位组内,每一文本基本语义单位中的起始时间大于所述文本基本语义单位的上一个文本基本语义单位的终止时间的次数,获取该次数最大的文本基本语义单位组。a time number calculation subunit, configured to calculate, in the basic semantic unit group of the text, the basic semantic unit of each text after the text basic semantic unit group whose retention error value is lower than a preset threshold value The number of times the start time is greater than the end time of the last text basic semantic unit of the basic semantic unit of the text, and the text basic semantic unit group having the largest number of times is obtained.
  19. 根据权利要求15-18中任意一项所述的自动生成配音文字的装置,其特征在于,所述文本识别单元,具体用于从所述文本信息中,按照每句内的每个字的顺序进行识别获取所述文本信息中的文本基本语义单位。The apparatus for automatically generating a voice-over character according to any one of claims 15 to 18, wherein the text recognition unit is specifically configured to: from the text information, follow the order of each word in each sentence Identifying and obtaining the basic semantic unit of the text in the text information.
  20. 根据权利要求19所述的自动生成配音文字的装置,其特征在于,所述时间写入单元,具体用于在将各个所述音频基本语义单位的起止时间信息,记录到相应的所述文本基本语义单位中时,若所述音频基本语义单位的起止时间信息为空值,则使与所述音频基本语义单位相应的所述文本基本语义单位的取值为空值。The apparatus for automatically generating a dubbed character according to claim 19, wherein the time writing unit is specifically configured to record start and end time information of each of the audio basic semantic units to a corresponding text basic In the semantic unit, if the start and end time information of the audio basic semantic unit is a null value, the value of the text basic semantic unit corresponding to the audio basic semantic unit is null.
  21. 根据权利要求20所述的自动生成配音文字的装置,其特征在于,还包括:The apparatus for automatically generating a voice-over character according to claim 20, further comprising:
    时间推算单元,用于在所述确定组成所述单句的文本基本语义单位组之后,按照预定的推算方式,对取值为空值的所述文本基本语义单位推算起止时间信息。And a time estimating unit, configured to calculate start and end time information on the basic semantic unit of the text whose value is a null value, after determining the text basic semantic unit group constituting the single sentence, according to a predetermined calculation manner.
  22. 根据权利要求21所述的自动生成配音文字的装置,其特征在于,所述时间推算单元,包括:The apparatus for automatically generating a dubbed character according to claim 21, wherein the time estimating unit comprises:
    平均时间计算子单元,用于计算所述文本基本语义单位组中的文本基本语义单位的平均时间信息;An average time calculation subunit for calculating average time information of a basic semantic unit of text in the text basic semantic unit group;
    起始时间写入子单元,用于将取值为空值的所述文本基本语义单位,上一个文本基本语义单位中的终止时间,放入取值为空值的所述文本基本语义单位的起始时间中;The start time is written into the subunit, and is used to put the text basic semantic unit whose value is a null value, and the end time in the basic semantic unit of the previous text into the basic semantic unit of the text whose value is a null value. In the start time;
    终止时间写入子单元,用于将所述终止时间加上所述平均时间信息后,放入取值为空值的所述文本基本语义单位的终止时间中。The termination time is written into the subunit, and after adding the termination time to the average time information, the termination time is entered into the basic semantic unit of the text whose value is a null value.
  23. 一种电子设备,其特征在于,所述电子设备包括: An electronic device, comprising:
    显示器;monitor;
    处理器;processor;
    存储器,用于存储配音文字生成程序,所述程序在被所述处理器读取执行时,执行如下操作:对音频信息进行识别,获取识别出的各个音频基本语义单位的起止时间信息;获取与所述音频信息对应的文本信息,并识别所述文本信息,从而获取文本基本语义单位;将各个所述音频基本语义单位的起止时间信息,记录到相应的所述文本基本语义单位中;对记录了所述起止时间信息的所述文本基本语义单位进行处理,生成对应所述音频信息的配音文字。 a memory for storing a voice-over character generating program, the program, when being read and executed by the processor, performing the following operations: identifying the audio information, acquiring the start and end time information of the identified basic audio semantic units of each audio; acquiring and Corresponding text information corresponding to the audio information, and identifying the text information, thereby acquiring a basic semantic unit of the text; recording start and end time information of each of the audio basic semantic units into a corresponding basic semantic unit of the text; The text basic semantic unit of the start and end time information is processed to generate a voice-over character corresponding to the audio information.
PCT/CN2017/115194 2016-12-22 2017-12-08 Method and apparatus for automatically generating dubbing characters, and electronic device WO2018113535A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201611196447.6A CN108228658B (en) 2016-12-22 2016-12-22 Method and device for automatically generating dubbing characters and electronic equipment
CN201611196447.6 2016-12-22

Publications (1)

Publication Number Publication Date
WO2018113535A1 true WO2018113535A1 (en) 2018-06-28

Family

ID=62624697

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/115194 WO2018113535A1 (en) 2016-12-22 2017-12-08 Method and apparatus for automatically generating dubbing characters, and electronic device

Country Status (3)

Country Link
CN (1) CN108228658B (en)
TW (1) TWI749045B (en)
WO (1) WO2018113535A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110858492A (en) * 2018-08-23 2020-03-03 阿里巴巴集团控股有限公司 Audio editing method, device, equipment and system and data processing method
CN110728116B (en) * 2019-10-23 2023-12-26 深圳点猫科技有限公司 Method and device for generating video file dubbing manuscript
CN113571061B (en) * 2020-04-28 2024-12-13 阿里巴巴集团控股有限公司 Speech transcription text editing system, method, device and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6345252B1 (en) * 1999-04-09 2002-02-05 International Business Machines Corporation Methods and apparatus for retrieving audio information using content and speaker information
CN1949227A (en) * 2006-10-24 2007-04-18 北京搜狗科技发展有限公司 Searching method, system and apparatus for playing media file
CN101616264A (en) * 2008-06-27 2009-12-30 中国科学院自动化研究所 News Video Cataloging Method and System
CN104599693A (en) * 2015-01-29 2015-05-06 语联网(武汉)信息技术有限公司 Preparation method of lines synchronized subtitles

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4026543B2 (en) * 2003-05-26 2007-12-26 日産自動車株式会社 Vehicle information providing method and vehicle information providing device
CN101615417B (en) * 2009-07-24 2011-01-26 北京海尔集成电路设计有限公司 Synchronous Chinese lyrics display method which is accurate to words
GB2502944A (en) * 2012-03-30 2013-12-18 Jpal Ltd Segmentation and transcription of speech
CN204559707U (en) * 2015-04-23 2015-08-12 南京信息工程大学 Teleprompter with speech recognition function
CN105788589B (en) * 2016-05-04 2021-07-06 腾讯科技(深圳)有限公司 Audio data processing method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6345252B1 (en) * 1999-04-09 2002-02-05 International Business Machines Corporation Methods and apparatus for retrieving audio information using content and speaker information
CN1949227A (en) * 2006-10-24 2007-04-18 北京搜狗科技发展有限公司 Searching method, system and apparatus for playing media file
CN101616264A (en) * 2008-06-27 2009-12-30 中国科学院自动化研究所 News Video Cataloging Method and System
CN104599693A (en) * 2015-01-29 2015-05-06 语联网(武汉)信息技术有限公司 Preparation method of lines synchronized subtitles

Also Published As

Publication number Publication date
CN108228658B (en) 2022-06-03
TW201832222A (en) 2018-09-01
TWI749045B (en) 2021-12-11
CN108228658A (en) 2018-06-29

Similar Documents

Publication Publication Date Title
EP1693829B1 (en) Voice-controlled data system
US8666727B2 (en) Voice-controlled data system
Rubin et al. Content-based tools for editing audio stories
TW202008349A (en) Speech labeling method and apparatus, and device
WO2017157142A1 (en) Song melody information processing method, server and storage medium
KR101292698B1 (en) Method and apparatus for attaching metadata
US7522967B2 (en) Audio summary based audio processing
WO2018229693A1 (en) Method and system for automatically generating lyrics of a song
US20080077869A1 (en) Conference supporting apparatus, method, and computer program product
WO2011146366A1 (en) Methods and systems for performing synchronization of audio with corresponding textual transcriptions and determining confidence values of the synchronization
WO2018113535A1 (en) Method and apparatus for automatically generating dubbing characters, and electronic device
US10140310B1 (en) Identifying and utilizing synchronized content
CN109213977A (en) The generation system of court&#39;s trial notes
US8706484B2 (en) Voice recognition dictionary generation apparatus and voice recognition dictionary generation method
CN102881282A (en) Method and system for obtaining prosodic boundary information
JP4697432B2 (en) Music playback apparatus, music playback method, and music playback program
US20100222905A1 (en) Electronic apparatus with an interactive audio file recording function and method thereof
EP1826686B1 (en) Voice-controlled multimedia retrieval system
CN114999464A (en) Voice data processing method and device
CN115329125A (en) Song skewer burning splicing method and device
CN110781651A (en) Method for inserting pause from text to voice
JP2007272815A (en) Content server apparatus, genre setting method and computer program
CN101266790A (en) Device and method for automatically marking time for text file
CN108648733B (en) A method and system for generating Diqu
CN110895575A (en) Audio processing method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17883425

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17883425

Country of ref document: EP

Kind code of ref document: A1