CN106683667A

CN106683667A - Automatic rhythm extracting method, system and application thereof in natural language processing

Info

Publication number: CN106683667A
Application number: CN201710023633.8A
Authority: CN
Inventors: 陈彦局; 潘嵘; 李双印
Original assignee: Shenzhen Ipin Information Technology Co Ltd
Current assignee: Shenzhen Ipin Information Technology Co Ltd
Priority date: 2017-01-13
Filing date: 2017-01-13
Publication date: 2017-05-17

Abstract

The invention relates to an automatic rhythm extracting method, a system and an application thereof in natural language processing; the method includes steps of applying an automatic text-voice alignment technology to generate a large-scale rhythm data set, and applying a circular neural network to perform modeling on the rhythm of a sentence, adding a bidirectional expanding mechanism; applying the automatically structured text rhythm data to a natural language processing task based on the circular neural network. The method fully uses the isomorphism properties of common sequence data in the text rhythm sequence and the natural language processing task; through an alternative training method under the multi-task study, the natural language processing task is promoted without the assistance of artificially explicit marked semantic information. The practice of the method can overcome shortcomings of low efficiency, different standards, incapability of large-scale application of the artificial rhythm marking; meanwhile, the method can transfer semantics and pragmatics in massive voice data to the other tasks.

Description

Automatic rhythm extraction method and system and application thereof in natural language processing task

Technical Field

The present invention relates to a method for extracting prosody of speech, and more particularly, to a method and a system for automatically extracting prosody and an application thereof in a natural language processing task.

Background

Prosody in speech can reflect the intention of a speaker by giving different prominence to different words in a sentence, so that the prosody prominence is considered to have an indicative role in understanding the semantics and pragmatics of speech, and the prosody of speech mainly comprises information such as continuous reading, sense group pause, rereading, rising and falling, and the like. Besides speech, text is another form capable of expressing semantics and pragmatism, and prosody characteristics contained in the text can be understood and learned by different readers, namely the text contains prosody characteristics of the text, the characteristics can be learned and predicted, and the prosody contained in the text can provide guidance on semantics and pragmatism for other natural language processing tasks, so that the performance of the text is improved. The prosody implicit in the text data cannot be observed and obtained directly, so that the prosody of the text corresponding to the text data can be obtained and marked from the voice data, and then the algorithm can learn how to perceive and predict the prosody from the pure text, thereby providing guidance for other natural language processing tasks except supervised grammatical information.

Most of the current natural language processing architectures use words and expressions (word vectors) thereof as basic units, prosodic features in the speech are expressed as continuous feature sequences, and the speech has no obvious word segmentation points, and large-scale high-quality linguistic data and training cannot be obtained by accurate word prosodic extraction based on a speech recognition technology, so that most of the current methods for extracting and utilizing the prosodic features of the speech need people with expert experience to manually segment speech segments, align the speech and texts, label the prosodic features of the words, and the like, and the efficiency of the generation process of supervised data is low.

The following documents are relevant in the prior art:

1)Brenier,J.M.；Cer,D.M.；and Jurafsky,D.2005.The detection of emphaticwords using acoustic and lexical features.In INTERSPEECH,3297-3300.

2)Brenier,J.M.2008.The Automatic Prediction of Prosodic Prominencefrom Text.ProQuest.

a method for predicting prosody by using plain text and corresponding evaluation indexes are provided. The document uses a ToBI tool set to perform manual segmentation and prosodic saliency labeling on speech and its corresponding text according to speech features corresponding to different words, such as: judging whether the text is highlighted or not according to pronunciation duration (duration), pronunciation intensity (intensity), maximum and minimum values (fundamental frequency and maximum) of pronunciation basic frequency and the like, and further generating a prosody data set of the text. The literature simultaneously uses a maximum entropy classifier to learn and predict the prosody of the text, and the classifier can achieve about 79% of prediction accuracy rate under the condition of only using text features. The above documents do not apply the generated prosodic data set to assist other natural language processing tasks.

Another related document:

3)Hovy,D.；Anumanchipalli,G.K.；Parlikar,A.；Vaughn,C.；Lammert,A.；Hovy,E.；and Black,A.W.2013.Analysis and Modeling of“Focus”in Context.InINTERSPEECH,402-406.

a method for predicting prosody from context using plain text is provided. The literature uses context to assist text prosody prediction and uses a crowd sourcing (crowdsourceing) method to perform artificial prosody data set annotation with certain scale on the basis of related work.

In all three related documents listed above, the prosody attribute of the word needs to be labeled manually, and the segmentation of the speech and the alignment of the speech and the text need to be performed before labeling, which causes a limitation on the efficiency of generating the data set, so that the method cannot obtain a large amount of labeled data in a short time, and thus the methods mentioned in the above documents lack effectiveness and cannot be applied to actual production. Meanwhile, the sample size of the data set generated by the method is not enough to cover all problem spaces of prosody prediction, so that the expandability of the algorithm is not strong, and the situation of insufficient performance in application is caused.

Therefore, no method capable of automatically extracting prosodic characteristics corresponding to words from speech is found in the prior art, all of which are manually extracted, and no record or practical application of a natural language processing task assisted by text prosodic characteristics corresponding to speech is found in the prior related documents.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art.

Therefore, the invention aims to provide a method for efficient automatic prosody extraction and application of the method on natural language processing tasks, which can overcome the defects of low efficiency, inconsistent standards and incapability of large-scale application of the traditional manual labeling, can migrate semantic and pragmatic characteristics existing in a large amount of voice data to other tasks as an unsupervised data generation mode on labeling, and can effectively utilize prosody modes in voice to improve the performance of other natural language processing tasks.

In order to achieve the above object, the present invention provides an automatic speech rhythm extracting and labeling method, which comprises the following steps:

step 1, receiving voice data to be marked, and acquiring a corresponding text of the voice data;

step 2, aligning the collected voice data and the corresponding text on a time axis by using a text-voice alignment technology to form an aligned text;

step 3, sentence segmentation is carried out on the aligned text, and therefore a sample taking a sentence as a unit is generated;

and 4, applying an automatic prosody saliency labeling algorithm to each sentence in the sample so as to construct and obtain an automatically labeled text prosody data set, wherein the prosody saliency labels (or the prosody labels of the sentences) of the sentences refer to numerical sequences corresponding to the sentences, and the sequences reflect the prosody saliency strengths of different parts (or basic units) of the sentences through the numerical values.

More specifically, the aligning the voice data and the corresponding text in the step 2 on the time axis specifically includes: enabling the basic units in each text to correspond to a time axis on the voice data, and thus obtaining the voice data fragments corresponding to each basic unit in the text, wherein the basic units refer to Chinese characters or words and English words.

More specifically, the step 4 further includes: if the original voice data contains a plurality of readers or a plurality of different reading environments, the pronunciation habits of different readers need to be standardized respectively, and the prosodic features of the voice data need to be discretized.

According to another aspect of the present invention, there is also provided an application of an automatic prosody extraction method to a natural language processing task, the method including:

the prosody of text data is used as a sequence labeling task, a long-short term memory artificial neural network (LSTM) is adopted to model a prosody sequence, the input of the LSTM model is a word vector sequence corresponding to a sentence, and prosody saliency labeling of a current position basic unit is predicted and output at each time point.

More specifically, the LSTM model may be extended to bi-directional LSTM networks, multi-layer bi-directional LSTM networks, or time-cycled neural networks, and their derivative types and structures, among others.

More specifically, the method further comprises:

using the text prosody data set for a Recurrent Neural Network (RNN) based sentence compression task: and (3) marking text prosody saliency as an auxiliary task, using a sentence compression task as a main task, adopting an alternate training mode under multi-task learning, inputting a part of text prosody data or sentence compression data into the model in each time period, inputting another task in the next time period, and alternately performing the two tasks until the model converges.

More specifically, the method further comprises:

using the text prosodic data set to assist natural language processing tasks based on recurrent neural networks and their associated extended improved structures: and (2) marking text prosody saliency as an auxiliary task, using a sentence compression task as a main task, adopting an alternate training mode under multi-task learning, inputting a part of text prosody data or sentence compression data into the model in each time period, inputting the other task in the next time period, alternately performing the two tasks, and optimizing the model parameters until the model converges.

According to another aspect of the present invention, there is also provided an automatic speech prosody extraction labeling system, including:

the acquisition module is used for receiving voice data to be marked and acquiring a corresponding text of the voice data;

the alignment module is used for aligning the collected voice data and the text thereof on a time axis by using a text-voice alignment technology to form an aligned text;

the segmentation module is used for carrying out sentence segmentation on the aligned text to generate a sample taking a sentence as a unit;

and the automatic prosody labeling module is used for applying an automatic prosody saliency labeling algorithm to each sentence in the sample so as to construct and obtain an automatically labeled text prosody data set, wherein the prosody saliency labels (or the prosody labels of the sentences) of the sentences refer to numerical sequences corresponding to the sentences, and the numerical sequences reflect the prosody saliency strength of different parts (or basic units) of the sentences through numerical values.

More specifically, the aligning the voice data and the corresponding text in the aligning module on the time axis specifically means: enabling the basic units in each text to correspond to a time axis on the voice data, and thus obtaining the voice data fragments corresponding to each basic unit in the text, wherein the basic units refer to Chinese characters or words and English words.

More specifically, the segmentation module is further configured to:

if the original voice data contains a plurality of readers or a plurality of different reading environments, different reader pronunciation habits need to be standardized respectively, and the prosodic features need to be discretized according to the needs.

The invention has the following beneficial technical effects:

1) the method has the advantages that the method uses an automatic text-voice alignment technology to generate a large-scale prosody data set, uses aligned voice segments as prosody indexes, can control the labeling quality of prosody prominence on the basis of certain strength, constructs a text prosody data set with weak supervision characteristics, has the advantages of higher efficiency and obviously better expansibility than the traditional mode compared with the traditional manual labeling means, can add prior knowledge at any time to adjust the actual labeling result and performance of the data set, has high processing speed and low cost, and constructs huge amount of data (the data volume generated in the same time is more than two orders of magnitude more than that of the traditional method) under the condition of saving a large amount of human resources.

2) According to the method, the recurrent neural network is used for modeling the prosody of the sentence, after a bidirectional expansion mechanism is added, the recurrent neural network can effectively consider the context state of the words, the prediction accuracy rate of the prosody saliency annotation of the words can reach more than 90%, the method is obviously superior to the traditional maximum entropy method, meanwhile, the characteristic extraction is not required by expert knowledge, the characteristic engineering is reduced, and meanwhile, the process is more in line with the process of human cognition.

3) The invention uses an automatically constructed text prosody data set on a natural language processing task based on a recurrent neural network.

The method makes full use of isomorphic characteristics of a text prosody sequence and common sequence data in the natural language processing task, and promotes the natural language processing task without the assistance of explicitly labeled semantic information through an alternate training mode under multi-task learning. In the example of sentence compression task, the method of the present invention has significant performance improvement (performance improvement of more than 10%) compared with the prior art.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 shows a flow diagram of an automatic speech prosody extraction labeling method according to the invention;

FIG. 2 illustrates a multitasking LSTM model processing scheme in accordance with the present invention;

FIG. 3 is a diagram illustrating a multitasking two-way LSTM model processing scheme according to the present invention;

FIG. 4 is a system block diagram of an automatic phonetic prosody extraction labeling system according to the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as specifically described herein, and thus the scope of the present invention is not limited by the specific embodiments disclosed below.

FIG. 1 is a flow chart illustrating an automatic phonetic prosody extraction labeling method according to the present invention.

As shown in fig. 1, an automatic speech prosody extraction and labeling method according to the present invention includes the following steps:

step 1, receiving voice data to be marked, and acquiring a corresponding text of the voice data.

specifically, the basic unit in each text corresponds to a time axis on the voice data, so as to obtain a voice data clip corresponding to each basic unit in the text. Wherein, the basic unit refers to a word or a word in Chinese or English.

In addition, the text-to-speech alignment technique includes, but is not limited to, obtaining a time axis for each basic unit in the speech data and a time period between the basic units by obtaining a time corresponding to a start pronunciation to a time corresponding to an end pronunciation of each basic unit in the speech data.

And 3, carrying out sentence segmentation on the aligned text to generate a sample taking a sentence as a unit.

For example, the aligned text may be sentence-segmented, but not limited to, according to punctuation characteristics of the sentence, such that each sentence is composed of basic units accompanied by corresponding segments of speech data.

And 4, applying an automatic prosody saliency labeling algorithm to each sentence in the text after the sentence segmentation, thereby constructing and obtaining an automatically labeled text prosody data set.

Specifically, the method further comprises the following steps: if the original voice data contains a plurality of readers or a plurality of different reading environments, different reader pronunciation habits need to be standardized respectively to eliminate the influence of the different reader pronunciation habits, and the prosodic features of the voice data are discretized according to the requirement. Wherein, the prosodic features refer to the maximum value and the minimum value of the pronunciation length, the pronunciation intensity and the pronunciation basic frequency of the basic unit.

Applying an automatic prosody saliency labeling algorithm to each sentence in the text after the sentence segmentation, wherein part or all of the three prosody features can be selected as input of the automatic prosody saliency labeling algorithm, and the prosody saliency labels (or prosody labels of the sentences) of the sentences refer to numerical sequences corresponding to the sentences, and the numerical sequences reflect the prosody saliency strengths of different parts (or basic units) of the sentences through numerical values.

According to the second aspect of the present invention, there is also provided an application method of automatic prosody extraction in a natural language processing task, the application method including:

taking the prosody of the text data as a sequence labeling task, modeling a prosody sequence by adopting a long-short term memory artificial neural network (LSTM), inputting the LSTM model as a word vector sequence corresponding to a sentence, and predicting and outputting the prosody saliency label of the basic unit at the current position at each time point.

More specifically, the LSTM model may be extended to a bidirectional LSTM Network, a multi-layer bidirectional LSTM Network, or a time-cycle neural Network, and derived types and structures thereof, such as a Gated Recurrent Network (GRN).

More specifically, the application method further includes:

using the text prosody data set for a Recurrent Neural Network (RNN) based sentence compression task: and (3) marking text prosody saliency as an auxiliary task, using a sentence compression task as a main task, adopting an alternate training mode under multi-task learning, inputting a part of text prosody data or sentence compression data into the model in each time period, inputting another task in the next time period, and alternately performing the two tasks until the model converges. FIG. 2 illustrates a multitasking LSTM model processing approach with text prosody saliency tagged as an auxiliary task corresponding to the output of the A-series nodes and a sentence compression task as a primary task corresponding to the output of the Y-series nodes, according to the present invention. And (3) adopting an alternate training mode, inputting a part of text prosody saliency labeling task data or sentence compression data into the model in each time period, inputting another task in the next time period, and alternately performing the two tasks until the model converges. The multitasking two-way LSTM model processing according to the present invention is shown in fig. 3.

More specifically, the application method further includes:

using the text prosodic data set for a recurrent neural network-based natural language processing task: and (2) marking text prosody saliency as an auxiliary task, using a sentence compression task as a main task, adopting an alternate training mode under multi-task learning, inputting a part of text prosody data or sentence compression data into the model in each time period, inputting the other task in the next time period, alternately performing the two tasks, and optimizing the model parameters until the model converges. Where the recurrent neural networks include, but are not limited to, LSTM, GRU, and extensions in depth thereof.

For the above manner, which can be described in a formal language, let X be the input text sequence, a be the prosodic saliency sequence corresponding to the text sequence, and Y be the compression label corresponding to the text, and the three sequences correspond to the following form:

X＝(x₁，...，x_N)，

A＝(a₁，...，a_N)

Y＝(y₁，...，y_N)

the task is actually to optimize the following problem:

for the LSTM model (top), p can be expressed as:

for the bi-directional LSTM model (below), p can be expressed as:

wherein,

using the optimized parameter θ, the prosodic saliency a prediction output of the model is expressed as:

similarly, for the main prediction task Y of the model, an isomorphic expression can be obtained, and details are not repeated here.

As shown in fig. 4, the system includes:

More specifically, the segmentation module is further configured to:

if the original voice data contains a plurality of readers or a plurality of different reading environments, different reader pronunciation habits need to be standardized respectively, and the prosodic features of the voice data need to be discretized according to the needs.

The method aligns the voice segments with corresponding words in the text by an automatic text-voice alignment technology, and utilizes the voice segments as the indexes of the prosody prominence of the words so as to obtain a large amount of automatically generated text prosody data with labels and construct a text prosody data set.

Meanwhile, the invention utilizes the weak supervision characteristic, and the text prosody data set is alternately trained with other natural language processing tasks in a multi-task learning mode under the model structure of the recurrent neural network, thereby achieving the purpose of improving the performance of other tasks.

In the description of the present specification, the description of the terms "one embodiment," "a specific embodiment," etc., means that a particular feature, structure, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An automatic speech rhythm extracting and labeling method is characterized by comprising the following steps:

and 4, applying an automatic prosody saliency labeling algorithm to each sentence in the sample so as to construct and obtain an automatically labeled text prosody data set.

2. The method for automatic phonetic prosody extraction and labeling according to claim 1, wherein the aligning the phonetic data and the corresponding text in step 2 on the time axis specifically includes: enabling the basic units in each text to correspond to a time axis on the voice data, and thus obtaining the voice data fragments corresponding to each basic unit in the text, wherein the basic units refer to Chinese characters or words and English words.

3. The method for automatic phonetic prosody extraction labeling according to claim 1, wherein the step 4 further comprises: if the original voice data contains a plurality of readers or a plurality of different reading environments, the pronunciation habits of different readers need to be standardized respectively, and the prosodic features of the voice data need to be discretized according to the needs.

4. Use of an automatic prosody extraction method according to any one of claims 1 to 3 in a natural language processing task, the method comprising:

5. The application of the automatic prosody extraction method in natural language processing tasks according to claim 4, wherein the LSTM model is extensible to a bidirectional LSTM network, a multi-layer bidirectional LSTM network, or a time-cycling neural network, and derived types and structures thereof.

6. The use of an automatic prosody extraction method in a natural language processing task according to claim 5, further comprising:

7. The use of an automatic prosody extraction method in a natural language processing task according to claim 5, further comprising:

using the text prosodic data set for a natural language processing task based on a recurrent neural network and its associated extended improved structure: and (2) marking text prosody saliency as an auxiliary task, using a sentence compression task as a main task, adopting an alternate training mode under multi-task learning, inputting a part of text prosody data or sentence compression data into the model in each time period, inputting the other task in the next time period, alternately performing the two tasks, and optimizing the model parameters until the model converges.

8. An automatic phonetic prosody extraction labeling system, comprising:

and the automatic prosody labeling module is used for applying an automatic prosody saliency labeling algorithm to each sentence in the sample so as to construct and obtain an automatically labeled text prosody data set.

9. The system for automatic phonetic prosody extraction and labeling according to claim 8, wherein the aligning the phonetic data and the corresponding text in the aligning module on the time axis specifically includes: enabling the basic units in each text to correspond to a time axis on the voice data, and thus obtaining the voice data fragments corresponding to each basic unit in the text, wherein the basic units refer to Chinese characters or words and English words.

10. The system of claim 8, wherein the segmentation module is further configured to: