CN114882904A

CN114882904A - Two-stage voice alignment method, electronic device, and storage medium

Info

Publication number: CN114882904A
Application number: CN202210274508.5A
Authority: CN
Inventors: 陈运兵; 张新科; 崔潇潇
Original assignee: Beijing Intengine Technology Co Ltd
Current assignee: Beijing Intengine Technology Co Ltd
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2022-08-09

Abstract

The embodiment of the invention relates to a two-stage voice alignment method, electronic equipment and a storage medium, wherein the two-stage voice alignment method comprises the following steps: acquiring voice data and a word-level transcript; inputting the two into a word-level alignment model to obtain an initial word-level alignment result; traversing each frame of word-level frame data, merging adjacent frames belonging to the same word to obtain first merged frame data, and recording the starting time and the ending time of the first merged frame data; merging the first-class placeholder and the word-level frame data corresponding to the word adjacent to the first-class placeholder to obtain second merged frame data; updating the starting time and the ending time of the word to obtain a word-level alignment transcript; acquiring a first frequency spectrum characteristic sequence and an information vector; inputting the two and the word-level alignment transcript into a phoneme-level alignment model to obtain an initial phoneme-level alignment result; and traversing each frame of phoneme-level frame data, merging the second placeholder and the phoneme-level frame data corresponding to the phoneme unit, and acquiring a phoneme-level alignment transcript. In this way, no error occurs at the start position or the end position of the transcript.

Description

Two-level speech alignment method, electronic device, and storage medium

技术领域technical field

本发明实施例涉及计算机技术领域，尤其涉及一种两级语音对齐方法、电子设备，及存储介质。Embodiments of the present invention relate to the field of computer technologies, and in particular, to a two-level speech alignment method, an electronic device, and a storage medium.

背景技术Background technique

随着人工智能技术的快速发展，语音领域的相关技术越来越成熟，语音技术应用也越来越广泛。语音对齐技术作为语音处理中的一个重要环节，可以为语音识别、语音合成、语音评测等技术提供重要支撑。语音对齐边界的准确度决定了语音识别和语音评测结果的准确率、以及语音合成音频的质量。With the rapid development of artificial intelligence technology, the related technologies in the field of speech are becoming more and more mature, and the application of speech technology is becoming more and more extensive. As an important link in speech processing, speech alignment technology can provide important support for speech recognition, speech synthesis, speech evaluation and other technologies. The accuracy of speech alignment boundaries determines the accuracy of speech recognition and speech evaluation results, as well as the quality of speech synthesized audio.

目前技术中，音素级对齐模型因为模型的敏感度较高，在语音对齐结果的开始位置会出现错位或结束位置会出现拖尾的现象，导致最终获取的与语音对应的抄本并不能够和语音本身完全的对应上，可能存在一定的误差。In the current technology, due to the high sensitivity of the phoneme-level alignment model, the start position of the speech alignment result may be misplaced or the end position may be smeared. As a result, the final obtained transcript corresponding to the speech cannot be compared with the speech. There may be some errors in its complete correspondence.

发明内容SUMMARY OF THE INVENTION

本申请提供了一种两级语音对齐方法、电子设备，及存储介质，以解决现有技术中上述部分或全部技术问题。The present application provides a two-level speech alignment method, electronic device, and storage medium to solve some or all of the above technical problems in the prior art.

第一方面，本申请提供了一种两级语音对齐方法，该方法包括：In a first aspect, the present application provides a two-level speech alignment method, the method comprising:

获取语音数据，以及与语音数据对应的字级抄本；Obtain speech data, and word-level transcripts corresponding to the speech data;

将语音数据和与语音数据对应的字级抄本输入到预构建的字级对齐模型中，获取与语音数据对应的初始字级对齐结果，其中，初始字级对齐结果包括多帧字级帧数据，且每一帧字级帧数据对应字级抄本中的一个字，或者对应一个第一类占位符，字级抄本中的一个字对应至少一帧字级帧数据；The speech data and the word-level transcript corresponding to the speech data are input into a pre-built word-level alignment model, and an initial word-level alignment result corresponding to the speech data is obtained, wherein the initial word-level alignment result includes multiple frames of word-level frame data, And each frame of word-level frame data corresponds to a word in the word-level transcript, or corresponds to a first-class placeholder, and a word in the word-level transcript corresponds to at least one frame of word-level frame data;

遍历每一帧字级帧数据，对属于同一字的相邻帧进行合并，获取第一合并帧数据，并记录第一合并帧数据的起始时间和结束时间；Traverse each frame of word-level frame data, merge adjacent frames belonging to the same word, obtain the first merged frame data, and record the start time and end time of the first merged frame data;

按照第一预设规则，将第一类占位符与与第一类占位符紧邻的字对应的字级帧数据合并，获取第二合并帧数据；According to the first preset rule, the first type of placeholder is combined with the word-level frame data corresponding to the word immediately adjacent to the first type of placeholder to obtain the second combined frame data;

更新第二合并帧数据的起始时间和结束时间，最终获取与语音数据对应的字级对齐抄本；Update the start time and end time of the second merged frame data, and finally obtain the word-level alignment transcript corresponding to the voice data;

获取语音数据对应的第一频谱特征序列，和与语音数据对应的信息向量；Obtain the first spectral feature sequence corresponding to the voice data, and the information vector corresponding to the voice data;

将字级对齐抄本、第一频谱特征序列和信息向量，输入到预构建的音素级对齐模型中，获取与语音数据对应的初始音素级对齐结果，初始音素级对齐结果包括多帧音素级帧数据，每一帧音素级帧数据对应一个音素单元，或者对应一个第二类占位符；Input the word-level alignment transcript, the first spectral feature sequence and the information vector into the pre-built phoneme-level alignment model, and obtain the initial phoneme-level alignment result corresponding to the speech data. The initial phoneme-level alignment result includes multiple frames of phoneme-level frame data , each frame of phoneme-level frame data corresponds to a phoneme unit, or corresponds to a second-type placeholder;

遍历每一帧音素级帧数据，将初始音素级对齐结果中与音素单元对应的音素级帧数据相邻的至少一个第二类占位符，与音素单元对应的音素级帧数据合并，获取第三合并帧数据；Traverse each frame of phoneme-level frame data, merge at least one second-type placeholder adjacent to the phoneme-level frame data corresponding to the phoneme unit in the initial phoneme-level alignment result with the phoneme-level frame data corresponding to the phoneme unit, and obtain the first Three merged frame data;

记录第三合并帧数据的起始时间和结束时间，作为音素单元对应的起始时间和结束时间，最终获取与语音数据对应的音素级对齐抄本。The start time and end time of the third merged frame data are recorded as the start time and end time corresponding to the phoneme unit, and finally a phoneme-level aligned transcript corresponding to the speech data is obtained.

可选的，按照第一预设规则，将第一类占位符与与第一类占位符紧邻的字对应的字级帧数据合并，获取第二合并帧数据，具体包括：Optionally, according to the first preset rule, the first type of placeholder is merged with the word-level frame data corresponding to the word immediately adjacent to the first type of placeholder to obtain the second merged frame data, which specifically includes:

当第一类占位符的左侧为字对应的字级帧数据时，将第一类占位符与左侧为字对应的字级帧数据合并，并将第一类占位符的结束时间作为第二合并帧的结束时间，左侧为字的字级帧数据的起始时间为第二合并帧的起始时间；When the left side of the first type of placeholder is the word-level frame data corresponding to the word, the first type of placeholder is merged with the word-level frame data corresponding to the word on the left, and the end of the first type of placeholder is Time is the end time of the second merged frame, and the start time of the word-level frame data of the word on the left is the start time of the second merged frame;

或者，当第一类占位符为右侧为字对应的字级帧数据时，将第一类占位符与右侧为字对应的字级帧数据合并，并将第一类占位符的起始时间作为第二合并帧的起始时间，将右侧为字对应的字级帧数据的结束时间，作为第二合并帧的结束时间。Or, when the first type of placeholder is the word-level frame data corresponding to the word on the right side, the first type of placeholder is merged with the word-level frame data corresponding to the right side of the word, and the first type of placeholder is combined. The start time of the second merged frame is taken as the start time of the second merged frame, and the right side is the end time of the word-level frame data corresponding to the word as the end time of the second merged frame.

第二方面，提供了一种电子设备，包括处理器、通信接口、存储器和通信总线，其中，处理器，通信接口，存储器通过通信总线完成相互间的通信；In a second aspect, an electronic device is provided, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus;

存储器，用于存放计算机程序；memory for storing computer programs;

处理器，用于执行存储器上所存放的程序时，实现第一方面任一项实施例的两级语音对齐方法的步骤。The processor is configured to implement the steps of the two-level speech alignment method of any one of the embodiments of the first aspect when executing the program stored in the memory.

第三方面，提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现如第一方面任一项实施例的两级语音对齐方法的步骤。A third aspect provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of the two-level speech alignment method according to any one of the embodiments of the first aspect.

本申请实施例提供的上述技术方案与现有技术相比具有如下优点：Compared with the prior art, the above-mentioned technical solutions provided in the embodiments of the present application have the following advantages:

本申请实施例提供的该方法，对语音数据，以及语音数据对应的字级抄本。然后将语音数据和语音数据对应的字级抄本输入到预构建的字级对齐模型，然后获取初始字级对齐结果。遍历初始字级对齐结果中的每一帧字级帧数据，对字级帧数据中的与字对应的帧数据，以及第一类占位符分别进行合并处理。占位符，用以表示非语音数据。将占位符在语音数据中的时间分配到与之相邻的字上去，用以保证语音数据中每个字语音都可以与抄本中的字能够对应，实现字级抄本对齐。The method provided by the embodiment of the present application is used for speech data and word-level transcripts corresponding to the speech data. Then the speech data and the word-level transcript corresponding to the speech data are input into the pre-built word-level alignment model, and then the initial word-level alignment result is obtained. Traverse each frame of word-level frame data in the initial word-level alignment result, and merge the frame data corresponding to the word in the word-level frame data and the first type of placeholders respectively. Placeholder for non-voice data. Allocate the time of the placeholder in the speech data to the adjacent words to ensure that the speech of each word in the speech data can correspond to the words in the transcript, and achieve word-level transcript alignment.

在此基础上，以字级抄本中的每一个字作为音素抄本，与语音数据对应的第一频谱特征序列，以及与语音数据对应的信息向量输入到预构建的音素级对齐模型中，获取初始音素级对齐结果。然后，遍历每一帧音素级帧数据，执行对音素级帧数据的一个合并过程，用以将第二类占位符与相邻的音素对应的帧数据进行合并，获取合并帧。其中，第二类占位符同样为非语音数据对应的音素级占位符。该合并帧的起始时间和结束时间作为音素的起始时间和结束时间，通过该方式，实现音素级抄本对齐。以上，则因为所有的非语音数据对应的时间都已被合理的分配到与之相邻的的音素级抄本对应的音素级帧数据上，也即是非语音数据对应的时间被分配到相邻的音素上，如此一来，所得到的音素级抄本，就不会在语音对齐结果的开始位置会出现错位或结束位置会出现拖尾的现象，完全可以保证最终获取的音素级抄本可以与语音本身基本对应上，最大可能的消除误差。On this basis, each word in the word-level transcript is used as the phoneme transcript, the first spectral feature sequence corresponding to the speech data, and the information vector corresponding to the speech data are input into the pre-built phoneme-level alignment model to obtain the initial Phoneme-level alignment results. Then, each frame of phoneme-level frame data is traversed, and a process of merging the phoneme-level frame data is performed, so as to merge the second-type placeholder and the frame data corresponding to the adjacent phonemes to obtain a merged frame. The second type of placeholders are also phoneme-level placeholders corresponding to non-speech data. The start time and end time of the merged frame are used as the start time and end time of the phoneme, and in this way, the phoneme-level transcript alignment is achieved. Above, because the time corresponding to all non-speech data has been reasonably allocated to the phoneme-level frame data corresponding to the adjacent phoneme-level transcript, that is, the time corresponding to the non-speech data is allocated to the adjacent phoneme-level frame data. In terms of phonemes, in this way, the obtained phoneme-level transcripts will not be misplaced at the beginning of the speech alignment result or tailed at the end position, and it is completely guaranteed that the final phoneme-level transcripts obtained can be compared with the speech itself. Basically corresponds to the maximum possible elimination error.

附图说明Description of drawings

图1为本发明实施例提供的一种两级语音对齐方法流程示意图；1 is a schematic flowchart of a two-level speech alignment method provided by an embodiment of the present invention;

图2为本发明实施例提供的字级对齐抄本效果示意图；2 is a schematic diagram of the effect of a word-level aligned transcript provided by an embodiment of the present invention;

图3为本发明实施例提供的第三合并帧数据的方法流程示意图；3 is a schematic flowchart of a third method for merging frame data according to an embodiment of the present invention;

图4本发明实施例提供的获取音素级对齐抄本的效果示意图；4 is a schematic diagram of the effect of obtaining a phoneme-level aligned transcript provided by an embodiment of the present invention;

图5为本发明提供的构建字级对齐模型的方法整体流程示意图；5 is a schematic diagram of the overall flow of the method for constructing a word-level alignment model provided by the present invention;

图6为本发明提供的构建音素级对齐模型的方法整体流程示意图；6 is a schematic diagram of the overall flow of the method for constructing a phoneme-level alignment model provided by the present invention;

图7为本发明实施例提供的一种两级语音对齐装置结构示意图；FIG. 7 is a schematic structural diagram of a two-level speech alignment apparatus according to an embodiment of the present invention;

图8为本发明实施例提供一种电子设备结构示意图。FIG. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

为便于对本发明实施例的理解，下面将结合附图以具体实施例做进一步的解释说明，实施例并不构成对本发明实施例的限定。In order to facilitate the understanding of the embodiments of the present invention, further explanation will be given below with specific embodiments in conjunction with the accompanying drawings, and the embodiments do not constitute limitations to the embodiments of the present invention.

针对背景技术中所提及的技术问题，本申请实施例提供了一种两级语音对齐方法，具体参见图1所示，图1为本发明实施例提供的一种两级语音对齐方法流程示意图，该方法步骤包括：In response to the technical problems mentioned in the background art, an embodiment of the present application provides a two-level speech alignment method, as shown in FIG. 1 . FIG. 1 is a schematic flowchart of a two-level speech alignment method provided by an embodiment of the present invention. , the method steps include:

步骤110，获取语音数据，以及与语音数据对应的字级抄本。Step 110: Acquire voice data and a word-level transcript corresponding to the voice data.

具体的，语音数据为通过拾音设备采集得到的音频数据。字级抄本可以是预先设定或后期标注的抄本。预先设定，例如是一本书，或者一段台词等等，任何抄本皆可。然后，找相应的人员对书或台词等抄本进行录音，来获取语音数据。Specifically, the voice data is audio data collected through a sound pickup device. Word-level transcripts can be pre-set or post-annotated transcripts. Pre-set, for example, a book, or a paragraph of lines, etc., any transcript will do. Then, find the corresponding personnel to record the transcripts such as books or lines to obtain the voice data.

步骤120，将语音数据和与语音数据对应的字级抄本输入到预构建的字级对齐模型中，获取与语音数据对应的初始字级对齐结果。Step 120: Input the speech data and the word-level transcript corresponding to the speech data into a pre-built word-level alignment model, and obtain an initial word-level alignment result corresponding to the speech data.

具体的，字级对齐模型，为预先构建的模型，例如可以是一个预先构建的端到端神经网络模型。下文中，将详细介绍该字级对齐模型的构建过程，这里不再过多说明。Specifically, the word-level alignment model is a pre-built model, such as a pre-built end-to-end neural network model. In the following, the construction process of the word-level alignment model will be described in detail, and will not be explained too much here.

而经过字级对齐模型后，所获取到的初始字级对齐结果中可以包括多帧字级帧数据，且每一帧字级帧数据对应字级抄本中的一个字，或者对应一个第一类占位符，字级抄本中的一个字对应至少一帧字级帧数据。After the word-level alignment model, the obtained initial word-level alignment result can include multiple frames of word-level frame data, and each frame of word-level frame data corresponds to a word in the word-level transcript, or corresponds to a first-class Placeholder, a word in the word-level transcript corresponds to at least one frame of word-level frame data.

举一个例子，例如语音数据为“非语音数据+你好”。非语音数据，例如是噪音，对应的字级抄本为第1帧至第5帧数据为噪音，以第一类占位符的形式对噪音数据去体现。而“你”字对应的是第6帧数据至第7帧数据，“好”字对应第8帧数据至第9帧数据。为了和下文中音素级帧数据区别，所以，通过字级对齐模型输出的帧数据定义为字级帧数据。在本质上，就是一帧数据而已。另外，在初始字级对齐结果中，可以在抄本的开始处、结尾处，以及字与字之间，各插入一个字级占位符。之所以在这些位置添加字级占位符，是考虑到有可能获取的初始字级对齐结果中，并非每一帧数据都能够对应到对应字的，有些地方可能没有对应到发音位置，所以需要以字级占位符的形式体现。For example, the voice data is "non-voice data + hello". For non-speech data, such as noise, the corresponding word-level transcript is that the first to fifth frames of data are noise, and the noise data is represented in the form of the first type of placeholders. The word "you" corresponds to the 6th frame data to the 7th frame data, and the word "good" corresponds to the 8th frame data to the 9th frame data. In order to distinguish it from the phoneme-level frame data below, the frame data output by the word-level alignment model is defined as word-level frame data. In essence, it is just a frame of data. In addition, in the initial word-level alignment results, a word-level placeholder can be inserted at the beginning of the transcript, at the end, and between words. The reason for adding word-level placeholders in these positions is to consider that in the initial word-level alignment results that may be obtained, not every frame of data can correspond to the corresponding word, and some places may not correspond to the pronunciation position, so it is necessary to In the form of word-level placeholders.

步骤130，遍历每一帧字级帧数据，对属于同一字的相邻帧进行合并，获取第一合并帧数据，并记录第一合并帧数据的起始时间和结束时间。Step 130: Traverse each frame of word-level frame data, merge adjacent frames belonging to the same word, acquire first merged frame data, and record the start time and end time of the first merged frame data.

步骤140，按照第一预设规则，将第一类占位符与与第一类占位符紧邻的字对应的字级帧数据合并，获取第二合并帧数据。Step 140 , according to the first preset rule, combine the first type of placeholder with the word-level frame data corresponding to the word immediately adjacent to the first type of placeholder to obtain the second combined frame data.

步骤150，更新第二合并帧数据的起始时间和结束时间，最终获取与语音数据对应的字级对齐抄本。Step 150: Update the start time and end time of the second merged frame data, and finally obtain a word-level aligned transcript corresponding to the speech data.

具体的，由上所介绍的，初始字级对齐结果包括多帧字级帧数据。每一帧字级帧数据，都会映射为一个字或字级占位符。根据CTC(Connectionist TemporalClassification)的解码规则，多帧输入对应一个输出，例如，“你”字对应第6帧数据和第7帧数据。所以，需要遍历每一帧字级帧数据，对属于同一字的相邻帧进行合并，获取第一合并帧数据。Specifically, as described above, the initial word-level alignment result includes multiple frames of word-level frame data. Each frame of word-level frame data is mapped to a word or word-level placeholder. According to the decoding rules of CTC (Connectionist TemporalClassification), multi-frame input corresponds to one output, for example, the word "you" corresponds to the 6th frame data and the 7th frame data. Therefore, it is necessary to traverse each frame of word-level frame data, merge adjacent frames belonging to the same word, and obtain the first merged frame data.

同时，还需要按照第一预设规则，将第一类占位符与与第一类占位符紧邻的字对应的字级帧数据合并，获取第二合并帧数据。At the same time, it is also necessary to combine the first type of placeholder with the word-level frame data corresponding to the word immediately adjacent to the first type of placeholder according to the first preset rule, to obtain the second combined frame data.

具体的，按照第一预设规则，将第一类占位符与与第一类占位符紧邻的字对应的字级帧数据合并，获取第二合并帧数据的实现方式参见如下：Specifically, according to the first preset rule, the first type of placeholder is merged with the word-level frame data corresponding to the word immediately adjacent to the first type of placeholder, and the implementation of obtaining the second merged frame data is as follows:

例如，上文中所说的，第1帧到第5帧都是第一类占位符，第6帧为“你”字(假设当前的“你”字对应的两帧字级帧数据已经合并，构成第一合并帧，第一合并帧当前为第6帧数据)。For example, as mentioned above, frames 1 to 5 are placeholders of the first type, and frame 6 is the word "you" (assuming that the two frames of word-level frame data corresponding to the current word "you" have been merged , which constitutes the first combined frame, and the first combined frame is currently the 6th frame data).

在一种可选的实施例中，可以先将第5帧的第一类占位符先与第一合并帧的“你”字对应的字级帧数据合并。然后，再将第4帧的第一类占位符与当前“你”字对应的字级帧数据合并，以此类推，将所有的第一类占位符与“你”字合并。In an optional embodiment, the first type placeholder of the fifth frame may be first merged with the word-level frame data corresponding to the word "you" in the first merged frame. Then, merge the first-type placeholders of the fourth frame with the word-level frame data corresponding to the current "you" word, and so on, merge all the first-type placeholders with the "you" word.

在第二种可选的实施例中，可以先将前5帧对应的字级占位符合并为一帧字级帧数据，然后再将合并后的字级帧数据与第一合并帧的“你”字对应的字级帧数据合并。In the second optional embodiment, the word-level placeholders corresponding to the first five frames can be combined into one frame of word-level frame data, and then the combined word-level frame data can be combined with the "word-level frame data" of the first combined frame. The word-level frame data corresponding to the "you" word is merged.

在第三种可选的例子中，假设初始字级对齐结果中的第6帧和第7帧还没有合并，也即是“你”字当前依然对应两帧字级帧数据。前5帧均为第一类占位符。In the third optional example, it is assumed that the sixth frame and the seventh frame in the initial word-level alignment result have not been merged, that is, the word "you" still currently corresponds to two frames of word-level frame data. The first 5 frames are all first-class placeholders.

那么，前5帧第一类占位符可以采用如上面所介绍的第一种或第二种中任一种所介绍的第一类占位符的合并方式合并后，再和第6帧“你”字对应的字级帧数据合并，然后再将合并帧与第7帧字级帧数据合并，获取“你”字对应的最终的字级帧数据。Then, the first 5 frames of the first type of placeholders can be merged with the first type of placeholders as described in either of the first or second type described above, and then merged with the sixth frame " The word-level frame data corresponding to the word "you" is combined, and then the combined frame is combined with the word-level frame data of the seventh frame to obtain the final word-level frame data corresponding to the word "you".

考虑到“你”字对应的字级帧数据和“好”字对应的字级帧数据之间还有第一类占位符。Considering that there is a first type of placeholder between the word-level frame data corresponding to the word "you" and the word-level frame data corresponding to the word "good".

那么这个第一类占位符可以设定为与左边的“字”对应的字级帧数据合并，也可以设定为与右边的“字”对应的字级帧数据合并。具体合并方式则可以根据实际情况设定。类似的，假设“好”字也有第一类占位符，而“好”字后面也没有其他字了，所以“好”字后面的第一类占位符则与“好”字对应的字级帧数据合并。Then this first type of placeholder can be set to merge with the word-level frame data corresponding to the "word" on the left, or can be set to merge with the word-level frame data corresponding to the "word" on the right. The specific merging method can be set according to the actual situation. Similarly, suppose that the word "好" also has a first-class placeholder, and there are no other words after the word "好", so the first-class placeholder after the word "好" is the word corresponding to the word "好" Level frame data merging.

此外，在合并字级帧数据的同时，还需要确定合并后字级帧数据的起始时间和结束时间。例如，上文中所提及的“你”字对应的起始时间就是第一帧字级帧数据对应的起始时间。而结尾时间则是“你”字对应的最后一帧字级帧数据的结尾时间(此刻假设“你”字对应的字级帧数据后面的第一类占位符与“好”字对应的字级帧数据合并)。In addition, while merging the word-level frame data, it is also necessary to determine the start time and the end time of the word-level frame data after merging. For example, the start time corresponding to the word "you" mentioned above is the start time corresponding to the first frame of word-level frame data. The end time is the end time of the last frame of word-level frame data corresponding to the word "you" (at this moment, it is assumed that the first type of placeholder behind the word-level frame data corresponding to the word "you" is the word corresponding to the word "good" level frame data merging).

需要说明的是，步骤140和步骤150的执行顺序并不是固定的。而是可以根据实际情况进行自由切换的。具体如何切换，可以参见如上所介绍的几种情况。It should be noted that the execution order of step 140 and step 150 is not fixed. It can be switched freely according to the actual situation. For details on how to switch, please refer to the several situations described above.

总之，通过该方式，将所有的字级帧数据按照上述方式合并后，既可以获取到与语音数据对应的字级对齐抄本。In a word, in this way, after combining all the word-level frame data in the above-mentioned manner, the word-level aligned transcript corresponding to the speech data can be obtained.

图2示出了根据上述方式所得到的字级对齐抄本，具体参见图2所示。图2中以“噪音+制冷模式+噪音”为例进行说明。所以，前半段噪音对应的第一类字级占位符和“制”字对应的字级帧数据合并，获取“制”字的起始时间和结束时间。其他的参见上文介绍，这里不再过多赘述。FIG. 2 shows the word-level aligned transcript obtained according to the above method, see FIG. 2 for details. Figure 2 takes "noise + cooling mode + noise" as an example for description. Therefore, the first type of word-level placeholders corresponding to the first half of the noise are combined with the word-level frame data corresponding to the "system" word to obtain the start time and end time of the "system" word. For others, please refer to the introduction above, and will not repeat them here.

步骤160，获取语音数据对应的第一频谱特征序列，和与语音数据对应的信息向量。Step 160: Obtain a first spectral feature sequence corresponding to the speech data and an information vector corresponding to the speech data.

具体的，在获取到语音数据后，可以对语音数据进行特征提取，获取第一频谱特征序列。具体的提取过程为现有技术，简单来说，就是先提取语音信号对应的时频信号转换为频域信号，在对频域信号进行特征提取，获取第一频谱特征虚了，这里不再详细说明。而对于语音数据对应的信息向量同样是通过现有技术获取，这里也不再过多赘述。Specifically, after the voice data is obtained, feature extraction may be performed on the voice data to obtain a first spectral feature sequence. The specific extraction process is the prior art. In simple terms, it is to first extract the time-frequency signal corresponding to the speech signal and convert it into a frequency-domain signal, and then perform feature extraction on the frequency-domain signal to obtain the first spectral feature, which is not detailed here. illustrate. The information vector corresponding to the speech data is also obtained through the prior art, and details are not repeated here.

步骤170，将字级对齐抄本、第一频谱特征序列和信息向量，输入到预构建的音素级对齐模型中，获取与语音数据对应的初始音素级对齐结果。Step 170: Input the word-level alignment transcript, the first spectral feature sequence and the information vector into a pre-built phoneme-level alignment model to obtain an initial phoneme-level alignment result corresponding to the speech data.

具体的，与初始字级对齐结果类似的，初始音素级对齐结果包括多帧音素级帧数据，每一帧音素级帧数据对应一个音素单元，或者对应一个第二类占位符。而，预构建的音素级对齐模型的具体构建过程，同样在下文中做详细介绍，这里不再过多说明。当前，仅是介绍其功能，音素级对齐模型例如可以是高斯混合模型。Specifically, similar to the initial word-level alignment result, the initial phone-level alignment result includes multiple frames of phone-level frame data, and each frame of phone-level frame data corresponds to a phone unit or a second-type placeholder. However, the specific construction process of the pre-built phoneme-level alignment model is also described in detail below, and will not be explained too much here. Currently, just to introduce its functionality, the phoneme-level alignment model can be, for example, a Gaussian mixture model.

步骤180，遍历每一帧音素级帧数据，将初始音素级对齐结果中与音素单元对应的音素级帧数据相邻的至少一个第二类占位符，与音素单元对应的音素级帧数据合并，获取第三合并帧数据。Step 180, traverse each frame of phoneme-level frame data, and combine at least one second-type placeholder adjacent to the phoneme-level frame data corresponding to the phoneme unit in the initial phoneme-level alignment result with the phoneme-level frame data corresponding to the phoneme unit. , to obtain the third merged frame data.

步骤190，记录第三合并帧数据的起始时间和结束时间，作为音素单元对应的起始时间和结束时间，最终获取与语音数据对应的音素级对齐抄本。Step 190: Record the start time and end time of the third merged frame data as the start time and end time corresponding to the phoneme unit, and finally obtain the phoneme-level alignment transcript corresponding to the speech data.

具体的，在一个可选的例子中，遍历每一帧音素级帧数据，将初始音素级对齐结果中与音素单元对应的音素级帧数据相邻的至少一个第二类占位符，与音素单元对应的音素级帧数据合并，获取第三合并帧数据的具体实现过程可以参见如图3所示，该方法步骤包括：Specifically, in an optional example, each frame of phoneme-level frame data is traversed, and at least one second-type placeholder adjacent to the phoneme-level frame data corresponding to the phoneme unit in the initial phoneme-level alignment result is compared with the phoneme. The phoneme-level frame data corresponding to the unit is merged, and the specific implementation process of acquiring the third merged frame data can be referred to as shown in Figure 3, and the method steps include:

步骤310，当音素级帧数据相邻至少一个第二类占位符时，将至少一个第二类占位符进行合并，获取第四合并帧数据，并确定第四合并帧数据的起始时间和结束时间。Step 310, when the phoneme-level frame data is adjacent to at least one second-type placeholder, merge the at least one second-type placeholder, obtain the fourth merged frame data, and determine the start time of the fourth merged frame data and end time.

步骤320，将音素级帧数据与第四合并帧数据进行合并，获取第三合并帧数据。Step 320: Combine the phoneme-level frame data and the fourth combined frame data to obtain the third combined frame data.

具体的，第二类占位符实际上，与第一类占位符类似的，都是一个普通的占位符。其作用就是用以区分语音数据和非语音数据所对应的音素级(或字级)帧数据。Specifically, the second type of placeholder is actually an ordinary placeholder similar to the first type of placeholder. Its function is to distinguish phoneme-level (or word-level) frame data corresponding to speech data and non-speech data.

在一个可选的例子中，第二类占位符可以包括音素级占位符和/或静音标记。In an optional example, the second type of placeholders may include phoneme-level placeholders and/or silence markers.

那么，当音素级帧数据相邻至少一个第二类占位符时，将至少一个第二类占位符进行合并，获取第四合并帧数据，具体可以包括如下几种情况。Then, when the phoneme-level frame data is adjacent to at least one second-type placeholder, the at least one second-type placeholder is merged to obtain the fourth merged frame data, which may specifically include the following situations.

第一种，当第二类占位符仅包括音素级占位符时，将每一个音素级占位符转换为静音标记后，合并静音标记，获取第四合并帧数据。First, when the second type of placeholders only include phoneme-level placeholders, after converting each phoneme-level placeholder into silence marks, the silence marks are merged to obtain fourth merged frame data.

第二种，当第二类占位符按序包括音素级占位符和静音占位符(两个占位符相邻)时，将音素级占位符与与其相邻的静音占位符合并后，合并所有静音占位符，获取第四合并帧数据。Second, when the second type of placeholders include phoneme-level placeholders and silence placeholders in sequence (the two placeholders are adjacent), the phoneme-level placeholders are combined with their adjacent silence placeholders. , merge all silence placeholders, and obtain the fourth merged frame data.

进一步可选的，记录第三合并帧数据的起始时间和结束时间，作为音素单元对应的起始时间和结束时间，具体包括如下方法步骤：Further optionally, record the start time and end time of the third merged frame data as the start time and end time corresponding to the phoneme unit, specifically including the following method steps:

当第四合并帧数据在音素级帧数据的左侧时，将第四合并帧数据的起始时间作为第三合并帧数据的起始时间，将音素级帧数据的结束时间，作为第三合并帧数据的结束时间；When the fourth merged frame data is on the left side of the phoneme-level frame data, the start time of the fourth merged frame data is used as the start time of the third merged frame data, and the end time of the phoneme-level frame data is used as the third merge. The end time of the frame data;

或者，当第四合并帧数据在音素级数据的右侧时，将音素级数据的起始时间作为第三合并帧数据的起始时间，将第四合并帧数据的结束时间，作为第三合并帧数据的结束时间。Or, when the fourth merged frame data is on the right side of the phoneme-level data, the start time of the phoneme-level data is used as the start time of the third merged frame data, and the end time of the fourth merged frame data is used as the third merged data. End time of frame data.

以下例子中，以第二类占位符包括音素级占位符和静音标记为例进行说明。In the following examples, the second type of placeholders include phoneme-level placeholders and mute marks as an example for description.

在一个具体的例子中，假设当第二类占位符包括音素级占位符和静音标记，且组合顺序为音素级占位符和静音标记时，需要先把音素级占位符转化为静音标记，然后将其与右相邻的静音合并为一个静音标记，同时合并边界时间。In a specific example, it is assumed that when the second type of placeholders include phoneme-level placeholders and silence markers, and the combination order is phoneme-level placeholders and silence markers, the phoneme-level placeholders need to be converted into silences first. mark and then merge it with the right adjacent silence into a single silence marker, along with the boundary time.

在另一个具体的例子中，假设当第二类占位符包括音素级占位符和静音标记，且组合顺序为静音标记和音素级占位符时，需要先把音素级占位符转化为静音标记，然后将其与左相邻的静音合并为一个静音标记，同时合并边界时间。In another specific example, it is assumed that when the second type of placeholders include phoneme-level placeholders and silence markers, and the combination order is silence markers and phoneme-level placeholders, the phoneme-level placeholders need to be converted into Silence marker, then merges it with the left-adjacent silence into a single silence marker, also merging the boundary times.

在另一个具体的例子中，假设当第二类占位符包括音素级占位符和静音标记，且组合顺序为静音标记、音素级占位符和静音标记时，需要先把音素级占位符转化为静音标记，然后将其与左相邻的静音标记、右相邻的静音标记合并为一个静音标记，同时合并边界时间。In another specific example, it is assumed that when the second type of placeholders include phoneme-level placeholders and silence markers, and the combination sequence is silence markers, phoneme-level placeholders, and silence markers, the phoneme-level placeholders need to be placed first. The symbol is converted into a silence marker, which is then merged with the left-adjacent silence marker, the right-adjacent silence marker into a single silence marker, and the boundary times are merged at the same time.

上述例子中，虽然步骤中都有涉及将音素级占位符转换为静音标记。但是在实际过程中，其实也不需要这么操作。比如第二类占位符仅包括音素级占位符，那么直接将音素级占位符合并，再与其相邻的音素单元对应的音素级帧数据合并即可(包括时间的合并)，无需转换为静音标记。只是存在音素级占位符和静音标记两种时，最好是统一为同一种后，再与音素单元对应的音素级帧数据合并，从计算过程上来看，会更有效率。In the above example, although steps are involved in converting phoneme-level placeholders to silence markers. But in practice, this is not actually necessary. For example, the second type of placeholder only includes phoneme-level placeholders, then the phoneme-level placeholders are directly merged, and then the phoneme-level frame data corresponding to the adjacent phoneme units can be merged (including the merging of time) without conversion. Mark for silence. Only when there are two types of phoneme-level placeholders and silence markers, it is better to unify them into the same type, and then combine them with the phoneme-level frame data corresponding to the phoneme unit, which is more efficient in terms of the calculation process.

当然，并非是一定要将音素级占位符转换为静音标记，反过来未尝不可。总之上述操作的目的就是为了提高合并帧数据的效率。所以，不论是先将音素级占位符转换为静音标记，然后再合并所有静音标记，还是直接将音素级占位符与音素级帧数据合并，然后将音素级帧数据与静音标记合并，亦或是按序合并，比如音素级对齐抄本包括：音素级占位符、音素级占位符、音素级帧数据、静音标记、音素级占位符，那么先将前两帧音素级占位符合并，然后音素级帧数据合并，然后再与静音标记合并，最后与音素级占位符合并。Of course, it is not necessary to convert phoneme-level placeholders to silence markers, and vice versa. In short, the purpose of the above operations is to improve the efficiency of merging frame data. Therefore, whether it is to convert the phoneme-level placeholders to silence markers first, and then merge all silence markers, or directly merge the phoneme-level placeholders with phoneme-level frame data, and then merge the phoneme-level frame data and silence markers, also Or merge them in order. For example, phoneme-level alignment transcripts include: phoneme-level placeholders, phoneme-level placeholders, phoneme-level frame data, silence marks, and phoneme-level placeholders. Then, the frame data at the phoneme level is merged, then it is merged with the silence marker, and finally it is merged with the placeholder at the phoneme level.

以上等等，皆在本实施例的可实施范围内，这里不再一一举例说明具体实现方式。The above and so on are all within the practicable scope of this embodiment, and the specific implementation manners will not be illustrated one by one here.

图4示意出了利用本申请实施例的上述方法步骤，获取音素级对齐抄本的效果图。具体参见图4所示，图4中同样是以“非语音数据+制冷模式+非语音数据”为语音数据，按照上述操作步骤执行后，获取的效果图。整个过程可以看出，完全可以实现音素级对齐。FIG. 4 illustrates an effect diagram of obtaining a phoneme-level aligned transcript by using the above method steps of the embodiment of the present application. Specifically, referring to FIG. 4 , in FIG. 4 , “non-voice data + cooling mode + non-voice data” is also used as voice data, and the effect diagram obtained after executing the above operation steps. It can be seen from the whole process that phoneme-level alignment can be achieved.

进一步可选的，获取与语音数据对应的音素级对齐抄本后，该方法还可以包括：Further optionally, after obtaining the phoneme-level aligned transcript corresponding to the speech data, the method may further include:

利用与语音数据对应的音素级对齐抄本，对与语音数据对应的字级对齐抄本进行优化。The word-level aligned transcripts corresponding to the phonetic data are optimized using the phoneme-level aligned transcripts corresponding to the phonetic data.

具体的，在获取到音素级对齐抄本后，实际上对于语音数据而言，语音数据中的每一个字，其拼音对应的帧数据，皆是由音素对应的帧数据构成的。相当于在更细的粒度中已经完成对齐操作，那么，利用音素组合完成对拼音的对齐操作，将会更加精准，也即是完成了对字级对齐抄本的优化操作。将每一个拼音对应的首个音素的开始时间作为该拼音的开始时间，将该拼音的末尾音素的结束时间作为该拼音的结束时间，以此来完成字级对齐操作的优化。Specifically, after the phoneme-level alignment transcript is obtained, in fact, for the speech data, each word in the speech data and the frame data corresponding to its pinyin are composed of the frame data corresponding to the phoneme. It is equivalent to having completed the alignment operation in a finer granularity. Then, using the phoneme combination to complete the pinyin alignment operation will be more accurate, that is, the optimization of the word-level alignment transcript is completed. The word-level alignment operation is optimized by taking the start time of the first phoneme corresponding to each pinyin as the start time of the pinyin, and the end time of the last phoneme of the pinyin as the end time of the pinyin.

进一步可选的，由上文内容介绍可知，在获取字级对齐抄本时，需要将语音数据和语音数据对应的字级抄本输入到预构建的字级对齐模型。而在获取音素级对齐抄本时，需要将字级对齐抄本、第一频谱特征序列和信息向量，输入到预构建的音素级对齐模型中。Further optional, as can be seen from the above description, when acquiring the word-level alignment transcript, the speech data and the word-level transcript corresponding to the speech data need to be input into the pre-built word-level alignment model. When obtaining the phoneme-level alignment transcript, the word-level alignment transcript, the first spectral feature sequence, and the information vector need to be input into a pre-built phoneme-level alignment model.

而下文中，将详细说明如何构建字级对齐模型，以及音素级对齐模型。具体参见图5和图6所示。图5示出了构建字级对齐模型的过程，图6示出了构建音素级对齐模型的过程。In the following, how to build a word-level alignment model and a phoneme-level alignment model will be explained in detail. See Figure 5 and Figure 6 for details. Figure 5 shows the process of building a word-level alignment model, and Figure 6 shows the process of building a phoneme-level alignment model.

首先，参见图5所示，该方法步骤包括：First, as shown in Figure 5, the method steps include:

步骤510，获取多条语音样本数据，以及获取与每一条语音样本数据分别对应的字级抄本。Step 510: Acquire multiple pieces of speech sample data, and acquire a word-level transcript corresponding to each piece of speech sample data.

具体的，多条语音样本数据为多个人的语音样本数据，且每一个人对应至少两条语音样本数据。具体的获取过程参见上文获取语音数据，以及语音数据对应的字级抄本的过程，这里不再赘述。Specifically, the multiple pieces of voice sample data are voice sample data of multiple people, and each person corresponds to at least two pieces of voice sample data. For the specific acquisition process, refer to the above process of acquiring the voice data and the word-level transcript corresponding to the voice data, which will not be repeated here.

步骤520，对第i个人的多条语音样本数据分别进行特征向量提取，获取第i信息向量。Step 520: Perform feature vector extraction on a plurality of pieces of speech sample data of the ith person respectively to obtain the ith information vector.

步骤530，对第i个人的多条语音样本数据分别进行频域特征提取，获取与每一条语音样本数据对应的第二频谱特征序列。Step 530: Perform frequency domain feature extraction on a plurality of pieces of speech sample data of the ith person respectively, and obtain a second spectral feature sequence corresponding to each piece of speech sample data.

具体的，获取特征向量的过程，以及频域特征提取等过程均可以通过现有技术获取，这里不再过多说明。Specifically, the process of acquiring the feature vector, and the process of extracting the frequency domain feature can be acquired through the prior art, which will not be described here.

步骤540，根据每一条第二频谱特征序列、与每一条语音样本数据分别对应的字级抄本，和第i信息向量，对字级对齐模型进行迭代训练，直至字级对齐模型达到第一预设标准时，获取最终获取的字级对齐模型。Step 540, according to each second spectral feature sequence, the word-level transcript corresponding to each piece of speech sample data, and the i-th information vector, iteratively train the word-level alignment model until the word-level alignment model reaches the first preset. When standard, get the final obtained word-level alignment model.

具体的，按照常规的神经网络模型迭代训练过程对该模型进行训练即可。不过，考虑到降低信道和加性噪声的影响，可以在对字级对齐模型进行迭代训练之前，首先可以分别对每一条第二频谱特征序列进行归一化处理，获取与每一条第二频谱特征序列对应的符合正太分布的第三频谱特征序列；Specifically, the model can be trained according to the iterative training process of the conventional neural network model. However, considering the influence of reducing the channel and additive noise, before the iterative training of the word-level alignment model, each second spectral feature sequence can be normalized separately to obtain the same value as each second spectral feature sequence. The third spectral feature sequence corresponding to the sequence conforming to the normal distribution;

然后，将第三频谱特征序列按照序列长度进行排序；最后按照排序顺序，依次在第j次训练时，将排序在第j位的第三频谱特征序列、与每一条语音数据分别对应的字级抄本，以及第i信息向量输入到字级对齐模型中，用以完成对字级对齐模型的第j次训练。Then, sort the third spectral feature sequence according to the sequence length; finally, according to the sorting order, during the jth training, the third spectral feature sequence in the jth position and the word level corresponding to each piece of speech data are sorted. The transcript, and the i-th information vector are input into the word-level alignment model to complete the j-th training of the word-level alignment model.

在一个具体的例子中，需要将第三频谱特征序列由长到短依次排序。之所以，将第三频谱特征序列按照序列长度进行排序后，按照排序顺序，依次在第j次训练，是为了提供模型有一个循序渐进的过程，提高模型的鲁棒性。而之所以加入说话人的信息向量，则是为了提高模型的泛化性。In a specific example, the third spectral feature sequence needs to be sorted in descending order. The reason why the third spectral feature sequence is sorted according to the sequence length, and then the jth training is performed in sequence according to the sorting order, is to provide the model with a step-by-step process and improve the robustness of the model. The reason why the speaker's information vector is added is to improve the generalization of the model.

在一个可选的例子中，确定字级对齐模型是否达到第一预设标准，可以根据预配置的损失函数确定。随着时间和迭代次数的增加，损失越小。当损失小于预设阈值时，则达到标准。或者，迭代次数达到预设次数后，停止训练，默认达到预设训练次数后说获取的模型即为最终想要得到的字级对齐模型，其中，i和j均为正整数。In an optional example, determining whether the word-level alignment model meets the first preset standard may be determined according to a preconfigured loss function. The loss gets smaller as the time and the number of iterations increase. When the loss is less than a preset threshold, the standard is reached. Alternatively, after the number of iterations reaches the preset number of times, the training is stopped. After the preset number of training times is reached by default, the obtained model is the final desired word-level alignment model, where i and j are both positive integers.

图6示出了针对音素级对齐模型的构建过程，具体参见如下：Figure 6 shows the construction process for the phoneme-level alignment model, and the details are as follows:

步骤610，获取多条语音样本数据，以及获取与每一条语音样本数据分别对应的字级抄本。Step 610: Acquire multiple pieces of speech sample data, and acquire a word-level transcript corresponding to each piece of speech sample data.

其中，字级抄本按照图6对应的方法步骤获取。多条语音样本数据为多个人的语音样本数据，且每一个人对应至少两条语音样本数据。The word-level transcript is obtained according to the method steps corresponding to FIG. 6 . The multiple pieces of voice sample data are voice sample data of multiple people, and each person corresponds to at least two pieces of voice sample data.

步骤620，对每一条语音样本数据分别进行频域特征提取，获取与每一条语音样本数据对应的第四频谱特征序列。Step 620: Perform frequency domain feature extraction on each piece of speech sample data, to obtain a fourth spectral feature sequence corresponding to each piece of speech sample data.

具体的，特征提取过程为现有技术，这里不再过多赘述。Specifically, the feature extraction process is in the prior art, and details are not repeated here.

步骤630，分别对每一条第四频谱特征序列进行归一化处理，获取与每一条第四频谱特征序列对应的符合正太分布的第五频谱特征序列。Step 630: Perform normalization processing on each fourth spectral feature sequence, respectively, to obtain a fifth spectral feature sequence corresponding to each fourth spectral feature sequence and conforming to the normal distribution.

具体的，对第四频谱特征序列进行归一化处理的目的同样是为了降低信道和加性噪声的影响。Specifically, the purpose of normalizing the fourth spectral feature sequence is also to reduce the influence of the channel and additive noise.

步骤640，将每一条第五频谱特征序列、与每一条第五频谱特征序列对应的字级抄本和音素级抄本构建数据组。Step 640 , constructing a data set from each fifth spectral feature sequence, the word-level transcript and phoneme-level transcript corresponding to each fifth spectral feature sequence.

例如，将这三种数据进行绑定，并生成唯一的ID信息。For example, these three kinds of data are bound to generate unique ID information.

步骤650，将所有数据组随机拆分为多个子集，每个子集包括与多个人中每一个分别对应的数据组。In step 650, all the data sets are randomly divided into a plurality of subsets, and each subset includes a data set corresponding to each of the plurality of persons.

具体的，每一个子集包括与多个人中每一个分别对应的数据组，是为了保证模型更具鲁棒性。Specifically, each subset includes a data set corresponding to each of the multiple persons, in order to ensure that the model is more robust.

步骤660，利用多个子集，依次对音素级对齐模型进行迭代训练，当音素级对齐模型达到第二预设标准时结束，获取最终的音素级对齐模型，即预构建的音素级对齐模型。Step 660, using multiple subsets to iteratively train the phoneme-level alignment model in turn, and ending when the phoneme-level alignment model reaches the second preset standard, and obtain the final phoneme-level alignment model, that is, a pre-built phoneme-level alignment model.

具体的，上一步子集训练生成模型作为下一步子集训练初始化模型，但第一步的模型为随机初始化。最后一步，则是迭代训练使用全部训练数据。按照子集数据由小到大依次进入音素级对齐模型中，用以对其进行对齐训练，以使该模型具有循序渐进的过程，提高其复杂性。Specifically, the subset training generated model in the previous step is used as the initialization model for the next subset training, but the model in the first step is randomly initialized. The final step is iterative training using all the training data. According to the subset data from small to large, it is entered into the phoneme-level alignment model for alignment training, so that the model has a step-by-step process and improves its complexity.

除了训练过程以外，也包括常规神经网络模型的测试过程，这里不再过多赘述。In addition to the training process, it also includes the testing process of conventional neural network models, which will not be repeated here.

进一步可选的，除了上述操作外，该方法还可以包括：构建字级发音字典，和音素级发音字典。其中，在构建字级发音字典时，可以将语音样本数据中的抄本以字为单元进行拆分，生成字级抄本，对所有抄本的字进行去重后可生成字级发音字典，该过程不需要做分词处理。Further optionally, in addition to the above operations, the method may further include: constructing a word-level pronunciation dictionary and a phoneme-level pronunciation dictionary. Among them, when building a word-level pronunciation dictionary, the transcripts in the speech sample data can be split into word-level transcripts to generate word-level transcripts. After de-duplicating all the words in the transcripts, a word-level pronunciation dictionary can be generated. This process does not require Word segmentation is required.

构建音素发音字典时，由于每个字可能存在多种发音，因此每个字可能映射多个音素组合。语音样板数据对应的字级抄本，可以根据语义和发音规则生成音素级抄本，在特定场景下每个字基本都是有其固定发音和相应含义，则每个字都对应着唯一的一对音素组合，其中音素之间用空格隔开，方便程序处理。When building a phoneme pronunciation dictionary, since each word may have multiple pronunciations, each word may map multiple phoneme combinations. The word-level transcript corresponding to the speech template data can generate phoneme-level transcripts according to semantics and pronunciation rules. In a specific scenario, each word basically has its fixed pronunciation and corresponding meaning, and each word corresponds to a unique pair of phonemes Combination, in which the phonemes are separated by spaces, which is convenient for program processing.

在上文中利用字级对齐模型生成初始字级对齐结果时，就可以通过字级发音字典辅助完成。类似的道理，在利用音素级对齐模型生成初始音素级对齐结果时，也可以通过音素级发音字典辅助完成。When using the word-level alignment model to generate the initial word-level alignment results above, it can be assisted by the word-level pronunciation dictionary. Similarly, when using the phoneme-level alignment model to generate the initial phoneme-level alignment results, it can also be assisted by the phoneme-level pronunciation dictionary.

在者，上述初始字级对齐结果以及音素级对其结果中，涉及声调的字级帧数据或音素级帧数据中，也会携带有相应的音调标注。例如，音素级对齐结果中“你好”最终生成的是n i3 h ao3，里面的3即为声调。In the above-mentioned initial word-level alignment results and phoneme-level alignment results, the word-level frame data or phoneme-level frame data involving tones also carry corresponding tonal annotations. For example, "Hello" in the phoneme-level alignment result is n i3 h ao3, and the 3 in it is the tone.

本发明实施例提供的两级语音对齐方法，对语音数据，以及语音数据对应的字级抄本。然后将语音数据和语音数据对应的字级抄本输入到预构建的字级对齐模型，然后获取初始字级对齐结果。遍历初始字级对齐结果中的每一帧字级帧数据，对字级帧数据中的与字对应的帧数据，以及第一类占位符分别进行合并处理。占位符，用以表示非语音数据。将占位符在语音数据中的时间分配到与之相邻的字上去，用以保证语音数据中每个字语音都可以与抄本中的字能够对应，实现字级抄本对齐。The two-level speech alignment method provided by the embodiment of the present invention is for speech data and word-level transcripts corresponding to the speech data. Then the speech data and the word-level transcript corresponding to the speech data are input into the pre-built word-level alignment model, and then the initial word-level alignment result is obtained. Traverse each frame of word-level frame data in the initial word-level alignment result, and merge the frame data corresponding to the word in the word-level frame data and the first type of placeholders respectively. Placeholder for non-voice data. Allocate the time of the placeholder in the speech data to the adjacent words to ensure that the speech of each word in the speech data can correspond to the words in the transcript, and achieve word-level transcript alignment.

以上，为本申请所提供的两级语音对齐几个方法实施例，下文中则介绍说明本申请所提供的两级语音对齐其他实施例，具体参见如下。The above are several method embodiments of the two-level speech alignment provided by the present application, and the following describes other embodiments of the two-level speech alignment provided by the present application. For details, refer to the following.

图7为本发明实施例提供的一种两级语音对齐装置，该装置包括：获取模块701、处理模块702、遍历模块703、记录模块704，以及更新模块705。FIG. 7 is a two-level speech alignment apparatus provided by an embodiment of the present invention. The apparatus includes: an acquisition module 701 , a processing module 702 , a traversal module 703 , a recording module 704 , and an update module 705 .

获取模块701，用于获取语音数据，以及与语音数据对应的字级抄本。The acquiring module 701 is configured to acquire speech data and word-level transcripts corresponding to the speech data.

处理模块702，用于将语音数据和与语音数据对应的字级抄本输入到预构建的字级对齐模型中，获取与语音数据对应的初始字级对齐结果，其中，初始字级对齐结果包括多帧字级帧数据，且每一帧字级帧数据对应字级抄本中的一个字，或者对应一个第一类占位符，字级抄本中的一个字对应至少一帧字级帧数据；The processing module 702 is used to input the speech data and the word-level transcript corresponding to the speech data into a pre-built word-level alignment model, and obtain an initial word-level alignment result corresponding to the speech data, wherein the initial word-level alignment result includes multiple Frame word-level frame data, and each frame of word-level frame data corresponds to a word in the word-level transcript, or corresponds to a first-class placeholder, and a word in the word-level transcript corresponds to at least one frame of word-level frame data;

遍历模块703，用于遍历每一帧字级帧数据；Traversing module 703, for traversing each frame of word-level frame data;

处理模块702，还用于对属于同一字的相邻帧进行合并，获取第一合并帧数据；The processing module 702 is further configured to merge adjacent frames belonging to the same word to obtain the first merged frame data;

记录模块704，用于记录第一合并帧数据的起始时间和结束时间；Recording module 704, for recording the start time and end time of the first merged frame data;

处理模块702，还用于按照第一预设规则，将第一类占位符与与第一类占位符紧邻的字对应的字级帧数据合并，获取第二合并帧数据；The processing module 702 is further configured to merge the first type of placeholder with the word-level frame data corresponding to the word immediately adjacent to the first type of placeholder according to the first preset rule, and obtain the second merged frame data;

更新模块705，用于更新第二合并帧数据的起始时间和结束时间，最终获取与语音数据对应的字级对齐抄本；The update module 705 is used to update the start time and the end time of the second merged frame data, and finally obtains the word-level alignment transcript corresponding to the voice data;

获取模块701，还用于获取语音数据对应的第一频谱特征序列，和与语音数据对应的信息向量；The obtaining module 701 is further configured to obtain the first spectral feature sequence corresponding to the voice data and the information vector corresponding to the voice data;

处理模块702，还用于将字级对齐抄本、第一频谱特征序列和信息向量，输入到预构建的音素级对齐模型中，获取与语音数据对应的初始音素级对齐结果，初始音素级对齐结果包括多帧音素级帧数据，每一帧音素级帧数据对应一个音素单元，或者对应一个第二类占位符；The processing module 702 is also used to input the word-level alignment transcript, the first spectral feature sequence and the information vector into the pre-built phoneme-level alignment model, and obtain the initial phoneme-level alignment result corresponding to the speech data, and the initial phoneme-level alignment result. Including multiple frames of phoneme-level frame data, each frame of phoneme-level frame data corresponds to a phoneme unit, or corresponds to a second-type placeholder;

遍历模块703，还用于遍历每一帧音素级帧数据；Traversing module 703, further configured to traverse each frame of phoneme-level frame data;

处理模块702，还用于将初始音素级对齐结果中与音素单元对应的音素级帧数据相邻的至少一个第二类占位符，与音素单元对应的音素级帧数据合并，获取第三合并帧数据；The processing module 702 is further configured to combine at least one second type placeholder adjacent to the phoneme-level frame data corresponding to the phoneme unit in the initial phoneme-level alignment result with the phoneme-level frame data corresponding to the phoneme unit, and obtain a third combination frame data;

记录模块704，还用于记录第三合并帧数据的起始时间和结束时间，作为音素单元对应的起始时间和结束时间，最终获取与语音数据对应的音素级对齐抄本。The recording module 704 is further configured to record the start time and end time of the third merged frame data as the start time and end time corresponding to the phoneme unit, and finally obtain the phoneme-level aligned transcript corresponding to the speech data.

可选的，该装置还包括：优化模块706，用于利用与语音数据对应的音素级对齐抄本，对与语音数据对应的字级对齐抄本进行优化。Optionally, the apparatus further includes: an optimization module 706, configured to optimize the word-level aligned transcript corresponding to the speech data by using the phoneme-level aligned transcript corresponding to the speech data.

可选的，处理模块702，具体用于：当音素级帧数据相邻至少一个第二类占位符时，将至少一个第二类占位符进行合并，获取第四合并帧数据，并确定第四合并帧数据的起始时间和结束时间；Optionally, the processing module 702 is specifically configured to: when the phoneme-level frame data is adjacent to at least one second-type placeholder, merge the at least one second-type placeholder, obtain the fourth merged frame data, and determine The start time and end time of the fourth merged frame data;

将音素级帧数据与第四合并帧数据进行合并，获取第三合并帧数据。The phoneme-level frame data and the fourth merged frame data are merged to obtain the third merged frame data.

可选的，记录模块704，具体用于当第四合并帧数据在音素级帧数据的左侧时，将第四合并帧数据的起始时间作为第三合并帧数据的起始时间，将音素级帧数据的结束时间，作为第三合并帧数据的结束时间；Optionally, the recording module 704 is specifically configured to use the start time of the fourth merged frame data as the start time of the third merged frame data when the fourth merged frame data is on the left side of the phoneme-level frame data, and use the phoneme as the start time of the third merged frame data. The end time of the level frame data, as the end time of the third merged frame data;

或者，当第四合并帧数据在音素级数据的右侧时，将音素级数据的起始时间作为第三合并帧数据的起始时间，将第四合并帧数据的解释时间，作为第三合并帧数据的结束时间。Or, when the fourth merged frame data is on the right side of the phoneme-level data, the start time of the phoneme-level data is used as the start time of the third merged frame data, and the interpretation time of the fourth merged frame data is used as the third merged time. End time of frame data.

可选的，第二类占位符包括音素级占位符和/或静音标记；处理模块702，具体用于：Optionally, the second type of placeholders include phoneme-level placeholders and/or silence markers; the processing module 702 is specifically used for:

当第二类占位符均包括音素级占位符时，将每一个音素级占位符转换为静音标记后，合并静音标记，获取第四合并帧数据；When the second type of placeholders all include phoneme-level placeholders, after converting each phoneme-level placeholder into silence marks, merge the silence marks to obtain the fourth merged frame data;

或者，当第二类占位符包括音素级占位符和静音占位符时，将音素级占位符与与其相邻的静音占位符合并后，合并所有静音占位符，获取第四合并帧数据。Or, when the second type of placeholders includes phoneme-level placeholders and silence placeholders, after merging the phoneme-level placeholders with their adjacent silence placeholders, merge all silence placeholders to obtain the fourth Merge frame data.

可选的，获取模块701，还用于获取多条语音样本数据，以及获取与每一条语音样本数据分别对应的字级抄本；其中，多条语音样本数据为多个人的语音样本数据，且每一个人对应至少两条语音样本数据；Optionally, the obtaining module 701 is further configured to obtain multiple pieces of voice sample data, and obtain word-level transcripts corresponding to each piece of voice sample data; wherein, the multiple pieces of voice sample data are voice sample data of multiple people, and each One person corresponds to at least two voice sample data;

处理模块702，还用于对第i个人的多条语音样本数据分别进行特征向量提取，获取第i信息向量；The processing module 702 is also used to extract the feature vector respectively for the multiple pieces of speech sample data of the ith person to obtain the ith information vector;

对第i个人的多条语音样本数据分别进行频域特征提取，获取与每一条语音样本数据对应的第二频谱特征序列；Perform frequency domain feature extraction on multiple pieces of speech sample data of the i-th person respectively, and obtain a second spectral feature sequence corresponding to each piece of speech sample data;

根据每一条第二频谱特征序列、与每一条语音样本数据分别对应的字级抄本，和第i信息向量，对字级对齐模型进行迭代训练，直至字级对齐模型达到第一预设标准时，获取最终获取的字级对齐模型，即预构建的字级对齐模型其中，i为正整数。According to each second spectral feature sequence, the word-level transcript corresponding to each piece of speech sample data, and the i-th information vector, the word-level alignment model is iteratively trained until the word-level alignment model reaches the first preset standard. The word-level alignment model finally obtained, that is, the pre-built word-level alignment model, where i is a positive integer.

可选的，处理模块702，还用于分别对每一条第二频谱特征序列进行归一化处理，获取与每一条第二频谱特征序列对应的符合正太分布的第三频谱特征序列；Optionally, the processing module 702 is further configured to perform normalization processing on each second spectral feature sequence respectively, and obtain a third spectral feature sequence corresponding to each second spectral feature sequence and conforming to the normal distribution;

将第三频谱特征序列按照序列长度进行排序；sorting the third spectral feature sequence according to the sequence length;

按照排序顺序，依次在第j次训练时，将排序在第j位的第三频谱特征序列、与每一条语音数据分别对应的字级抄本，以及第i信息向量输入到字级对齐模型中，用以完成对字级对齐模型的第j次训练，其中，j为正整数。According to the sorting order, in the jth training, the third spectral feature sequence sorted in the jth position, the word-level transcript corresponding to each piece of speech data, and the i-th information vector are input into the word-level alignment model, Used to complete the jth training of the word-level alignment model, where j is a positive integer.

可选的，获取模块701，还用于获取多条语音样本数据，以及获取与每一条语音样本数据分别对应的字级抄本和音素级抄本；其中，多条语音样本数据为多个人的语音样本数据，且每一个人对应至少两条语音样本数据；Optionally, the obtaining module 701 is further configured to obtain multiple pieces of voice sample data, and obtain word-level transcripts and phoneme-level transcripts corresponding to each piece of voice sample data respectively; wherein, the multiple pieces of voice sample data are voice samples of multiple people. data, and each person corresponds to at least two voice sample data;

处理模块702，还用于对每一条语音样本数据分别进行频域特征提取，获取与每一条语音样本数据对应的第四频谱特征序列；The processing module 702 is further configured to perform frequency domain feature extraction on each piece of speech sample data respectively, and obtain a fourth spectral feature sequence corresponding to each piece of speech sample data;

分别对每一条第四频谱特征序列进行归一化处理，获取与每一条第四频谱特征序列对应的符合正太分布的第五频谱特征序列；Perform normalization processing on each fourth spectral feature sequence respectively, and obtain a fifth spectral feature sequence corresponding to each fourth spectral feature sequence and conforming to the normal distribution;

将每一条第五频谱特征序列、与每一条第五频谱特征序列对应的字级抄本和音素级抄本构建数据组；constructing a data set from each fifth spectral feature sequence, the word-level transcript and phoneme-level transcript corresponding to each fifth spectral feature sequence;

将所有数据组随机拆分为多个子集，每个子集包括与多个人中每一个分别对应的数据组；Randomly split all data sets into multiple subsets, each subset including a respective data set corresponding to each of the multiple persons;

利用多个子集，依次对音素级对齐模型进行迭代训练，当音素级对齐模型达到第二预设标准时结束，获取最终的音素级对齐模型，即预构建的音素级对齐模型。Using multiple subsets, the phoneme-level alignment model is iteratively trained in sequence, and the end when the phoneme-level alignment model reaches the second preset standard, and the final phoneme-level alignment model, that is, a pre-built phoneme-level alignment model, is obtained.

本发明实施例提供的两级语音对齐装置中各部件所执行的功能均已在上述任一方法实施例中做了详细的描述，因此这里不再赘述。The functions performed by each component in the two-level speech alignment apparatus provided in the embodiment of the present invention have been described in detail in any of the above method embodiments, and thus are not repeated here.

本发明实施例提供的一种两级语音对齐装置，对语音数据，以及语音数据对应的字级抄本。然后将语音数据和语音数据对应的字级抄本输入到预构建的字级对齐模型，然后获取初始字级对齐结果。遍历初始字级对齐结果中的每一帧字级帧数据，对字级帧数据中的与字对应的帧数据，以及第一类占位符分别进行合并处理。占位符，用以表示非语音数据。将占位符在语音数据中的时间分配到与之相邻的字上去，用以保证语音数据中每个字语音都可以与抄本中的字能够对应，实现字级抄本对齐。The embodiment of the present invention provides a two-level speech alignment device for speech data and word-level transcripts corresponding to the speech data. Then the speech data and the word-level transcript corresponding to the speech data are input into the pre-built word-level alignment model, and then the initial word-level alignment result is obtained. Traverse each frame of word-level frame data in the initial word-level alignment result, and merge the frame data corresponding to the word in the word-level frame data and the first type of placeholders respectively. Placeholder for non-voice data. Allocate the time of the placeholder in the speech data to the adjacent words to ensure that the speech of each word in the speech data can correspond to the words in the transcript, and achieve word-level transcript alignment.

如图8所示，本申请实施例提供了一种电子设备，包括处理器111、通信接口112、存储器113和通信总线114，其中，处理器111，通信接口112，存储器113通过通信总线114完成相互间的通信。As shown in FIG. 8 , an embodiment of the present application provides an electronic device, including a processor 111 , a communication interface 112 , a memory 113 and a communication bus 114 , wherein the processor 111 , the communication interface 112 , and the memory 113 are completed through the communication bus 114 communication with each other.

存储器113，用于存放计算机程序；a memory 113 for storing computer programs;

在本申请一个实施例中，处理器111，用于执行存储器113上所存放的程序时，实现前述任意一个方法实施例提供的两级语音对齐方法，包括：In an embodiment of the present application, the processor 111 is configured to implement the two-level speech alignment method provided by any one of the foregoing method embodiments when executing the program stored in the memory 113, including:

可选的，获取与语音数据对应的音素级对齐抄本后，方法还包括：Optionally, after obtaining the phoneme-level aligned transcript corresponding to the speech data, the method further includes:

可选的，遍历每一帧音素级帧数据，将初始音素级对齐结果中与音素单元对应的音素级帧数据相邻的至少一个第二类占位符，与音素单元对应的音素级帧数据合并，获取第三合并帧数据，具体包括：Optionally, traverse each frame of phoneme-level frame data, and place at least one second-type placeholder adjacent to the phoneme-level frame data corresponding to the phoneme unit in the initial phoneme-level alignment result, and the phoneme-level frame data corresponding to the phoneme unit. Merge to obtain the third merged frame data, including:

当音素级帧数据相邻至少一个第二类占位符时，将至少一个第二类占位符进行合并，获取第四合并帧数据，并确定第四合并帧数据的起始时间和结束时间；When the phoneme-level frame data is adjacent to at least one second-type placeholder, the at least one second-type placeholder is merged, the fourth merged frame data is obtained, and the start time and end time of the fourth merged frame data are determined ;

可选的，记录第三合并帧数据的起始时间和结束时间，作为音素单元对应的起始时间和结束时间，具体包括：Optionally, record the start time and end time of the third merged frame data as the start time and end time corresponding to the phoneme unit, specifically including:

可选的，第二类占位符包括音素级占位符和/或静音标记；当音素级帧数据相邻至少一个第二类占位符时，将至少一个第二类占位符进行合并，获取第四合并帧数据，具体包括：当第二类占位符均包括音素级占位符时，将每一个音素级占位符转换为静音标记后，合并静音标记，获取第四合并帧数据；Optionally, the second-type placeholder includes a phoneme-level placeholder and/or a silence mark; when the phoneme-level frame data is adjacent to at least one second-type placeholder, at least one second-type placeholder is merged , and obtain the fourth merged frame data, which specifically includes: when the second type of placeholders all include phoneme-level placeholders, converting each phoneme-level placeholder into a silence mark, merging the silence marks, and obtaining the fourth merged frame data;

可选的，将语音数据和与语音数据对应的字级抄本输入到预构建的字级对齐模型之前，方法还包括：Optionally, before inputting the speech data and the word-level transcript corresponding to the speech data into the pre-built word-level alignment model, the method further includes:

获取多条语音样本数据，以及获取与每一条语音样本数据分别对应的字级抄本；其中，多条语音样本数据为多个人的语音样本数据，且每一个人对应至少两条语音样本数据；Acquiring multiple pieces of voice sample data, and obtaining word-level transcripts corresponding to each piece of voice sample data; wherein, the multiple pieces of voice sample data are voice sample data of multiple people, and each person corresponds to at least two pieces of voice sample data;

对第i个人的多条语音样本数据分别进行特征向量提取，获取第i信息向量；Extracting feature vectors from multiple pieces of speech sample data of the ith person to obtain the ith information vector;

可选的，根据每一条第二频谱特征序列、与每一条语音样本数据分别对应的字级抄本，和第i信息向量，对字级对齐模型进行迭代训练，具体包括：Optionally, the word-level alignment model is iteratively trained according to each second spectral feature sequence, the word-level transcript corresponding to each piece of speech sample data, and the i-th information vector, specifically including:

分别对每一条第二频谱特征序列进行归一化处理，获取与每一条第二频谱特征序列对应的符合正太分布的第三频谱特征序列；Perform normalization processing on each second spectral feature sequence respectively, and obtain a third spectral feature sequence corresponding to each second spectral feature sequence and conforming to the normal distribution;

可选的，将字级对齐抄本、第一频谱特征序列和信息向量，输入到预构建的音素级对齐模型之前，方法还包括：Optionally, before inputting the word-level alignment transcript, the first spectral feature sequence and the information vector into the pre-built phoneme-level alignment model, the method further includes:

获取多条语音样本数据，以及获取与每一条语音样本数据分别对应的字级抄本和音素级抄本；其中，多条语音样本数据为多个人的语音样本数据，且每一个人对应至少两条语音样本数据；Acquire multiple pieces of voice sample data, and obtain word-level transcripts and phoneme-level transcripts corresponding to each piece of voice sample data; wherein, the multiple pieces of voice sample data are voice sample data of multiple people, and each person corresponds to at least two voices sample;

对每一条语音样本数据分别进行频域特征提取，获取与每一条语音样本数据对应的第四频谱特征序列；Perform frequency domain feature extraction on each piece of speech sample data respectively, and obtain a fourth spectral feature sequence corresponding to each piece of speech sample data;

本申请实施例还提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现如前述任意一个方法实施例提供的两级语音对齐方法的步骤。Embodiments of the present application further provide a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of the two-level speech alignment method provided by any one of the foregoing method embodiments.

需要说明的是，在本文中，诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this document, relational terms such as "first" and "second" etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these There is no such actual relationship or sequence between entities or operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article, or device that includes the element.

以上仅是本发明的具体实施方式，使本领域技术人员能够理解或实现本发明。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所申请的原理和新颖特点相一致的最宽的范围。The above are only specific embodiments of the present invention, so that those skilled in the art can understand or implement the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features claimed herein.

Claims

1. A two-stage speech alignment method, the method comprising:

acquiring voice data and a word-level transcript corresponding to the voice data;

inputting the voice data and the word-level transcript corresponding to the voice data into a pre-constructed word-level alignment model, and obtaining an initial word-level alignment result corresponding to the voice data, wherein the initial word-level alignment result comprises multiple frames of word-level frame data, each frame of the word-level frame data corresponds to one word in the word-level transcript, or corresponds to one first-type placeholder, and one word in the word-level transcript corresponds to at least one frame of the word-level frame data;

traversing each frame of the word-level frame data, merging adjacent frames belonging to the same word to obtain first merged frame data, and recording the starting time and the ending time of the first merged frame data;

according to a first preset rule, combining the first type of placeholder with word-level frame data corresponding to the word adjacent to the first type of placeholder to obtain second combined frame data;

updating the starting time and the ending time of the second merged frame data, and finally obtaining a word-level alignment transcript corresponding to the voice data;

acquiring a first spectrum characteristic sequence corresponding to the voice data and an information vector corresponding to the voice data;

inputting the word-level alignment transcript, the first spectrum feature sequence and the information vector into a pre-constructed phoneme-level alignment model, and acquiring an initial phoneme-level alignment result corresponding to the voice data, wherein the initial phoneme-level alignment result comprises multi-frame phoneme-level frame data, and each frame of the phoneme-level frame data corresponds to a phoneme unit or corresponds to a second-class placeholder;

traversing each frame of the phoneme-level frame data, merging at least one second-class placeholder adjacent to the phoneme-level frame data corresponding to the phoneme unit in the initial phoneme-level alignment result with the phoneme-level frame data corresponding to the phoneme unit, and acquiring third merged frame data;

and recording the starting time and the ending time of the third merged frame data as the starting time and the ending time corresponding to the phoneme unit, and finally obtaining the phoneme-level aligned transcript corresponding to the voice data.

2. The method of claim 1, wherein after obtaining the phone-level aligned transcript corresponding to the speech data, the method further comprises:

and optimizing the word-level aligned transcript corresponding to the voice data by utilizing the phoneme-level aligned transcript corresponding to the voice data.

3. The method of claim 1, wherein traversing each frame of the phone-level frame data, merging at least one second type placeholder in the initial phone-level alignment result adjacent to the phone-level frame data corresponding to the phone unit with the phone-level frame data corresponding to the phone unit to obtain third merged frame data comprises:

when the phoneme-level frame data is adjacent to at least one second-class placeholder, merging the at least one second-class placeholder to obtain fourth merged frame data, and determining the starting time and the ending time of the fourth merged frame data;

and merging the phoneme-level frame data and the fourth merged frame data to obtain third merged frame data.

4. The method according to claim 3, wherein recording a start time and an end time of the third merged frame data as a start time and an end time corresponding to the phoneme unit specifically includes:

when the fourth merged frame data is on the left side of the phoneme-level frame data, taking the start time of the fourth merged frame data as the start time of the third merged frame data, and taking the end time of the phoneme-level frame data as the end time of the third merged frame data;

or, when the fourth merged frame data is on the right side of the phoneme-level data, the start time of the phoneme-level data is used as the start time of the third merged frame data, and the interpretation time of the fourth merged frame data is used as the end time of the third merged frame data.

5. The method of claim 3 or 4, wherein the second type of placeholder comprises a phone-level placeholder and/or a mute flag; when the phoneme-level frame data is adjacent to at least one second-class placeholder, merging the at least one second-class placeholder to obtain fourth merged frame data, specifically including:

when the second-class placeholders comprise the phoneme-level placeholders, converting each phoneme-level placeholder into a mute marker, merging the mute markers, and acquiring fourth merged frame data;

or when the second type of placeholder includes a phoneme-level placeholder and a silence placeholder, merging the phoneme-level placeholder and the silence placeholder adjacent to the phoneme-level placeholder, and merging all the silence placeholders to obtain fourth merged frame data.

6. The method of any of claims 1-4, wherein prior to inputting the speech data and the word-level transcript corresponding to the speech data into the pre-constructed word-level alignment model, the method further comprises:

acquiring a plurality of pieces of voice sample data, and acquiring word-level copies corresponding to each piece of voice sample data respectively; the voice sample data of a plurality of persons are obtained, and each person corresponds to at least two pieces of voice sample data;

respectively extracting characteristic vectors of a plurality of pieces of voice sample data of an ith person to obtain an ith information vector;

respectively extracting frequency domain features of a plurality of pieces of voice sample data of an ith person to obtain a second frequency spectrum feature sequence corresponding to each piece of voice sample data;

and performing iterative training on the word-level alignment model according to each second frequency spectrum characteristic sequence, the word-level transcript corresponding to each voice sample data and the ith information vector, and obtaining a finally obtained word-level alignment model until the word-level alignment model reaches a first preset standard, namely the pre-constructed word-level alignment model, wherein i is a positive integer.

7. The method according to claim 6, wherein iteratively training a word-level alignment model according to each of the second spectral feature sequences, the word-level transcript corresponding to each of the voice sample data, and the ith information vector comprises:

respectively carrying out normalization processing on each second frequency spectrum characteristic sequence to obtain a third frequency spectrum characteristic sequence which corresponds to each second frequency spectrum characteristic sequence and accords with the positive distribution;

sequencing the third spectral feature sequence according to the sequence length;

and inputting a third spectral feature sequence which is sequenced at a jth position, word-level transcripts which respectively correspond to each piece of voice data and the ith information vector into the word-level alignment model according to a sequencing sequence when training for the jth time in sequence to finish the jth training of the word-level alignment model, wherein j is a positive integer.

8. The method of any of claims 1-4, wherein before inputting the word-level aligned transcript, the first sequence of spectral features, and the information vector into a pre-constructed phoneme-level alignment model, the method further comprises:

acquiring a plurality of pieces of voice sample data, and acquiring a word-level transcript and a phoneme-level transcript corresponding to each piece of voice sample data respectively; the voice sample data of a plurality of persons are obtained, and each person corresponds to at least two pieces of voice sample data;

respectively extracting frequency domain characteristics of each piece of voice sample data to obtain a fourth frequency spectrum characteristic sequence corresponding to each piece of voice sample data;

respectively carrying out normalization processing on each fourth frequency spectrum characteristic sequence to obtain a fifth frequency spectrum characteristic sequence which corresponds to each fourth frequency spectrum characteristic sequence and accords with positive distribution;

constructing a data set by using each fifth frequency spectrum characteristic sequence, and the word-level transcript and the phoneme-level transcript corresponding to each fifth frequency spectrum characteristic sequence;

randomly splitting all data sets into a plurality of subsets, each subset including a data set corresponding to each of a plurality of persons respectively;

and performing iterative training on the phoneme level alignment model in sequence by using the plurality of subsets, and obtaining a final phoneme level alignment model, namely the pre-constructed phoneme level alignment model, when the phoneme level alignment model reaches a second preset standard.

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the steps of the two-stage speech alignment method of any of claims 1-8 when executing a program stored in a memory.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the two-stage speech alignment method according to any one of claims 1 to 8.