CN106683667A - Automatic rhythm extracting method, system and application thereof in natural language processing - Google Patents
Automatic rhythm extracting method, system and application thereof in natural language processing Download PDFInfo
- Publication number
- CN106683667A CN106683667A CN201710023633.8A CN201710023633A CN106683667A CN 106683667 A CN106683667 A CN 106683667A CN 201710023633 A CN201710023633 A CN 201710023633A CN 106683667 A CN106683667 A CN 106683667A
- Authority
- CN
- China
- Prior art keywords
- text
- prosody
- task
- data
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000003058 natural language processing Methods 0.000 title claims abstract description 26
- 230000033764 rhythmic process Effects 0.000 title claims abstract description 10
- 238000013528 artificial neural network Methods 0.000 claims abstract description 19
- 238000005516 engineering process Methods 0.000 claims abstract description 10
- 238000012549 training Methods 0.000 claims abstract description 10
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 6
- 238000002372 labelling Methods 0.000 claims description 35
- 238000000605 extraction Methods 0.000 claims description 20
- 230000006835 compression Effects 0.000 claims description 19
- 238000007906 compression Methods 0.000 claims description 19
- 230000011218 segmentation Effects 0.000 claims description 16
- 230000000306 recurrent effect Effects 0.000 claims description 12
- 239000012634 fragment Substances 0.000 claims description 5
- 239000013598 vector Substances 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 abstract description 2
- 238000012545 processing Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/148—Duration modelling in HMMs, e.g. semi HMM, segmental models or transition probabilities
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1807—Speech classification or search using natural language modelling using prosody or stress
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to an automatic rhythm extracting method, a system and an application thereof in natural language processing; the method includes steps of applying an automatic text-voice alignment technology to generate a large-scale rhythm data set, and applying a circular neural network to perform modeling on the rhythm of a sentence, adding a bidirectional expanding mechanism; applying the automatically structured text rhythm data to a natural language processing task based on the circular neural network. The method fully uses the isomorphism properties of common sequence data in the text rhythm sequence and the natural language processing task; through an alternative training method under the multi-task study, the natural language processing task is promoted without the assistance of artificially explicit marked semantic information. The practice of the method can overcome shortcomings of low efficiency, different standards, incapability of large-scale application of the artificial rhythm marking; meanwhile, the method can transfer semantics and pragmatics in massive voice data to the other tasks.
Description
Technical Field
The present invention relates to a method for extracting prosody of speech, and more particularly, to a method and a system for automatically extracting prosody and an application thereof in a natural language processing task.
Background
Prosody in speech can reflect the intention of a speaker by giving different prominence to different words in a sentence, so that the prosody prominence is considered to have an indicative role in understanding the semantics and pragmatics of speech, and the prosody of speech mainly comprises information such as continuous reading, sense group pause, rereading, rising and falling, and the like. Besides speech, text is another form capable of expressing semantics and pragmatism, and prosody characteristics contained in the text can be understood and learned by different readers, namely the text contains prosody characteristics of the text, the characteristics can be learned and predicted, and the prosody contained in the text can provide guidance on semantics and pragmatism for other natural language processing tasks, so that the performance of the text is improved. The prosody implicit in the text data cannot be observed and obtained directly, so that the prosody of the text corresponding to the text data can be obtained and marked from the voice data, and then the algorithm can learn how to perceive and predict the prosody from the pure text, thereby providing guidance for other natural language processing tasks except supervised grammatical information.
Most of the current natural language processing architectures use words and expressions (word vectors) thereof as basic units, prosodic features in the speech are expressed as continuous feature sequences, and the speech has no obvious word segmentation points, and large-scale high-quality linguistic data and training cannot be obtained by accurate word prosodic extraction based on a speech recognition technology, so that most of the current methods for extracting and utilizing the prosodic features of the speech need people with expert experience to manually segment speech segments, align the speech and texts, label the prosodic features of the words, and the like, and the efficiency of the generation process of supervised data is low.
The following documents are relevant in the prior art:
1)Brenier,J.M.;Cer,D.M.;and Jurafsky,D.2005.The detection of emphaticwords using acoustic and lexical features.In INTERSPEECH,3297-3300.
2)Brenier,J.M.2008.The Automatic Prediction of Prosodic Prominencefrom Text.ProQuest.
a method for predicting prosody by using plain text and corresponding evaluation indexes are provided. The document uses a ToBI tool set to perform manual segmentation and prosodic saliency labeling on speech and its corresponding text according to speech features corresponding to different words, such as: judging whether the text is highlighted or not according to pronunciation duration (duration), pronunciation intensity (intensity), maximum and minimum values (fundamental frequency and maximum) of pronunciation basic frequency and the like, and further generating a prosody data set of the text. The literature simultaneously uses a maximum entropy classifier to learn and predict the prosody of the text, and the classifier can achieve about 79% of prediction accuracy rate under the condition of only using text features. The above documents do not apply the generated prosodic data set to assist other natural language processing tasks.
Another related document:
3)Hovy,D.;Anumanchipalli,G.K.;Parlikar,A.;Vaughn,C.;Lammert,A.;Hovy,E.;and Black,A.W.2013.Analysis and Modeling of“Focus”in Context.InINTERSPEECH,402-406.
a method for predicting prosody from context using plain text is provided. The literature uses context to assist text prosody prediction and uses a crowd sourcing (crowdsourceing) method to perform artificial prosody data set annotation with certain scale on the basis of related work.
In all three related documents listed above, the prosody attribute of the word needs to be labeled manually, and the segmentation of the speech and the alignment of the speech and the text need to be performed before labeling, which causes a limitation on the efficiency of generating the data set, so that the method cannot obtain a large amount of labeled data in a short time, and thus the methods mentioned in the above documents lack effectiveness and cannot be applied to actual production. Meanwhile, the sample size of the data set generated by the method is not enough to cover all problem spaces of prosody prediction, so that the expandability of the algorithm is not strong, and the situation of insufficient performance in application is caused.
Therefore, no method capable of automatically extracting prosodic characteristics corresponding to words from speech is found in the prior art, all of which are manually extracted, and no record or practical application of a natural language processing task assisted by text prosodic characteristics corresponding to speech is found in the prior related documents.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art.
Therefore, the invention aims to provide a method for efficient automatic prosody extraction and application of the method on natural language processing tasks, which can overcome the defects of low efficiency, inconsistent standards and incapability of large-scale application of the traditional manual labeling, can migrate semantic and pragmatic characteristics existing in a large amount of voice data to other tasks as an unsupervised data generation mode on labeling, and can effectively utilize prosody modes in voice to improve the performance of other natural language processing tasks.
In order to achieve the above object, the present invention provides an automatic speech rhythm extracting and labeling method, which comprises the following steps:
step 1, receiving voice data to be marked, and acquiring a corresponding text of the voice data;
step 2, aligning the collected voice data and the corresponding text on a time axis by using a text-voice alignment technology to form an aligned text;
step 3, sentence segmentation is carried out on the aligned text, and therefore a sample taking a sentence as a unit is generated;
and 4, applying an automatic prosody saliency labeling algorithm to each sentence in the sample so as to construct and obtain an automatically labeled text prosody data set, wherein the prosody saliency labels (or the prosody labels of the sentences) of the sentences refer to numerical sequences corresponding to the sentences, and the sequences reflect the prosody saliency strengths of different parts (or basic units) of the sentences through the numerical values.
More specifically, the aligning the voice data and the corresponding text in the step 2 on the time axis specifically includes: enabling the basic units in each text to correspond to a time axis on the voice data, and thus obtaining the voice data fragments corresponding to each basic unit in the text, wherein the basic units refer to Chinese characters or words and English words.
More specifically, the step 4 further includes: if the original voice data contains a plurality of readers or a plurality of different reading environments, the pronunciation habits of different readers need to be standardized respectively, and the prosodic features of the voice data need to be discretized.
According to another aspect of the present invention, there is also provided an application of an automatic prosody extraction method to a natural language processing task, the method including:
the prosody of text data is used as a sequence labeling task, a long-short term memory artificial neural network (LSTM) is adopted to model a prosody sequence, the input of the LSTM model is a word vector sequence corresponding to a sentence, and prosody saliency labeling of a current position basic unit is predicted and output at each time point.
More specifically, the LSTM model may be extended to bi-directional LSTM networks, multi-layer bi-directional LSTM networks, or time-cycled neural networks, and their derivative types and structures, among others.
More specifically, the method further comprises:
using the text prosody data set for a Recurrent Neural Network (RNN) based sentence compression task: and (3) marking text prosody saliency as an auxiliary task, using a sentence compression task as a main task, adopting an alternate training mode under multi-task learning, inputting a part of text prosody data or sentence compression data into the model in each time period, inputting another task in the next time period, and alternately performing the two tasks until the model converges.
More specifically, the method further comprises:
using the text prosodic data set to assist natural language processing tasks based on recurrent neural networks and their associated extended improved structures: and (2) marking text prosody saliency as an auxiliary task, using a sentence compression task as a main task, adopting an alternate training mode under multi-task learning, inputting a part of text prosody data or sentence compression data into the model in each time period, inputting the other task in the next time period, alternately performing the two tasks, and optimizing the model parameters until the model converges.
According to another aspect of the present invention, there is also provided an automatic speech prosody extraction labeling system, including:
the acquisition module is used for receiving voice data to be marked and acquiring a corresponding text of the voice data;
the alignment module is used for aligning the collected voice data and the text thereof on a time axis by using a text-voice alignment technology to form an aligned text;
the segmentation module is used for carrying out sentence segmentation on the aligned text to generate a sample taking a sentence as a unit;
and the automatic prosody labeling module is used for applying an automatic prosody saliency labeling algorithm to each sentence in the sample so as to construct and obtain an automatically labeled text prosody data set, wherein the prosody saliency labels (or the prosody labels of the sentences) of the sentences refer to numerical sequences corresponding to the sentences, and the numerical sequences reflect the prosody saliency strength of different parts (or basic units) of the sentences through numerical values.
More specifically, the aligning the voice data and the corresponding text in the aligning module on the time axis specifically means: enabling the basic units in each text to correspond to a time axis on the voice data, and thus obtaining the voice data fragments corresponding to each basic unit in the text, wherein the basic units refer to Chinese characters or words and English words.
More specifically, the segmentation module is further configured to:
if the original voice data contains a plurality of readers or a plurality of different reading environments, different reader pronunciation habits need to be standardized respectively, and the prosodic features need to be discretized according to the needs.
The invention has the following beneficial technical effects:
1) the method has the advantages that the method uses an automatic text-voice alignment technology to generate a large-scale prosody data set, uses aligned voice segments as prosody indexes, can control the labeling quality of prosody prominence on the basis of certain strength, constructs a text prosody data set with weak supervision characteristics, has the advantages of higher efficiency and obviously better expansibility than the traditional mode compared with the traditional manual labeling means, can add prior knowledge at any time to adjust the actual labeling result and performance of the data set, has high processing speed and low cost, and constructs huge amount of data (the data volume generated in the same time is more than two orders of magnitude more than that of the traditional method) under the condition of saving a large amount of human resources.
2) According to the method, the recurrent neural network is used for modeling the prosody of the sentence, after a bidirectional expansion mechanism is added, the recurrent neural network can effectively consider the context state of the words, the prediction accuracy rate of the prosody saliency annotation of the words can reach more than 90%, the method is obviously superior to the traditional maximum entropy method, meanwhile, the characteristic extraction is not required by expert knowledge, the characteristic engineering is reduced, and meanwhile, the process is more in line with the process of human cognition.
3) The invention uses an automatically constructed text prosody data set on a natural language processing task based on a recurrent neural network.
The method makes full use of isomorphic characteristics of a text prosody sequence and common sequence data in the natural language processing task, and promotes the natural language processing task without the assistance of explicitly labeled semantic information through an alternate training mode under multi-task learning. In the example of sentence compression task, the method of the present invention has significant performance improvement (performance improvement of more than 10%) compared with the prior art.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 shows a flow diagram of an automatic speech prosody extraction labeling method according to the invention;
FIG. 2 illustrates a multitasking LSTM model processing scheme in accordance with the present invention;
FIG. 3 is a diagram illustrating a multitasking two-way LSTM model processing scheme according to the present invention;
FIG. 4 is a system block diagram of an automatic phonetic prosody extraction labeling system according to the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as specifically described herein, and thus the scope of the present invention is not limited by the specific embodiments disclosed below.
FIG. 1 is a flow chart illustrating an automatic phonetic prosody extraction labeling method according to the present invention.
As shown in fig. 1, an automatic speech prosody extraction and labeling method according to the present invention includes the following steps:
step 1, receiving voice data to be marked, and acquiring a corresponding text of the voice data.
Step 2, aligning the collected voice data and the corresponding text on a time axis by using a text-voice alignment technology to form an aligned text;
specifically, the basic unit in each text corresponds to a time axis on the voice data, so as to obtain a voice data clip corresponding to each basic unit in the text. Wherein, the basic unit refers to a word or a word in Chinese or English.
In addition, the text-to-speech alignment technique includes, but is not limited to, obtaining a time axis for each basic unit in the speech data and a time period between the basic units by obtaining a time corresponding to a start pronunciation to a time corresponding to an end pronunciation of each basic unit in the speech data.
And 3, carrying out sentence segmentation on the aligned text to generate a sample taking a sentence as a unit.
For example, the aligned text may be sentence-segmented, but not limited to, according to punctuation characteristics of the sentence, such that each sentence is composed of basic units accompanied by corresponding segments of speech data.
And 4, applying an automatic prosody saliency labeling algorithm to each sentence in the text after the sentence segmentation, thereby constructing and obtaining an automatically labeled text prosody data set.
Specifically, the method further comprises the following steps: if the original voice data contains a plurality of readers or a plurality of different reading environments, different reader pronunciation habits need to be standardized respectively to eliminate the influence of the different reader pronunciation habits, and the prosodic features of the voice data are discretized according to the requirement. Wherein, the prosodic features refer to the maximum value and the minimum value of the pronunciation length, the pronunciation intensity and the pronunciation basic frequency of the basic unit.
Applying an automatic prosody saliency labeling algorithm to each sentence in the text after the sentence segmentation, wherein part or all of the three prosody features can be selected as input of the automatic prosody saliency labeling algorithm, and the prosody saliency labels (or prosody labels of the sentences) of the sentences refer to numerical sequences corresponding to the sentences, and the numerical sequences reflect the prosody saliency strengths of different parts (or basic units) of the sentences through numerical values.
According to the second aspect of the present invention, there is also provided an application method of automatic prosody extraction in a natural language processing task, the application method including:
taking the prosody of the text data as a sequence labeling task, modeling a prosody sequence by adopting a long-short term memory artificial neural network (LSTM), inputting the LSTM model as a word vector sequence corresponding to a sentence, and predicting and outputting the prosody saliency label of the basic unit at the current position at each time point.
More specifically, the LSTM model may be extended to a bidirectional LSTM Network, a multi-layer bidirectional LSTM Network, or a time-cycle neural Network, and derived types and structures thereof, such as a Gated Recurrent Network (GRN).
More specifically, the application method further includes:
using the text prosody data set for a Recurrent Neural Network (RNN) based sentence compression task: and (3) marking text prosody saliency as an auxiliary task, using a sentence compression task as a main task, adopting an alternate training mode under multi-task learning, inputting a part of text prosody data or sentence compression data into the model in each time period, inputting another task in the next time period, and alternately performing the two tasks until the model converges. FIG. 2 illustrates a multitasking LSTM model processing approach with text prosody saliency tagged as an auxiliary task corresponding to the output of the A-series nodes and a sentence compression task as a primary task corresponding to the output of the Y-series nodes, according to the present invention. And (3) adopting an alternate training mode, inputting a part of text prosody saliency labeling task data or sentence compression data into the model in each time period, inputting another task in the next time period, and alternately performing the two tasks until the model converges. The multitasking two-way LSTM model processing according to the present invention is shown in fig. 3.
More specifically, the application method further includes:
using the text prosodic data set for a recurrent neural network-based natural language processing task: and (2) marking text prosody saliency as an auxiliary task, using a sentence compression task as a main task, adopting an alternate training mode under multi-task learning, inputting a part of text prosody data or sentence compression data into the model in each time period, inputting the other task in the next time period, alternately performing the two tasks, and optimizing the model parameters until the model converges. Where the recurrent neural networks include, but are not limited to, LSTM, GRU, and extensions in depth thereof.
For the above manner, which can be described in a formal language, let X be the input text sequence, a be the prosodic saliency sequence corresponding to the text sequence, and Y be the compression label corresponding to the text, and the three sequences correspond to the following form:
X=(x1,...,xN),
A=(a1,...,aN)
Y=(y1,...,yN)
the task is actually to optimize the following problem:
for the LSTM model (top), p can be expressed as:
for the bi-directional LSTM model (below), p can be expressed as:
wherein,
using the optimized parameter θ, the prosodic saliency a prediction output of the model is expressed as:
similarly, for the main prediction task Y of the model, an isomorphic expression can be obtained, and details are not repeated here.
FIG. 4 is a system block diagram of an automatic phonetic prosody extraction labeling system according to the present invention.
As shown in fig. 4, the system includes:
the acquisition module is used for receiving voice data to be marked and acquiring a corresponding text of the voice data;
the alignment module is used for aligning the collected voice data and the text thereof on a time axis by using a text-voice alignment technology to form an aligned text;
the segmentation module is used for carrying out sentence segmentation on the aligned text to generate a sample taking a sentence as a unit;
and the automatic prosody labeling module is used for applying an automatic prosody saliency labeling algorithm to each sentence in the sample so as to construct and obtain an automatically labeled text prosody data set, wherein the prosody saliency labels (or the prosody labels of the sentences) of the sentences refer to numerical sequences corresponding to the sentences, and the numerical sequences reflect the prosody saliency strength of different parts (or basic units) of the sentences through numerical values.
More specifically, the aligning the voice data and the corresponding text in the aligning module on the time axis specifically means: enabling the basic units in each text to correspond to a time axis on the voice data, and thus obtaining the voice data fragments corresponding to each basic unit in the text, wherein the basic units refer to Chinese characters or words and English words.
More specifically, the segmentation module is further configured to:
if the original voice data contains a plurality of readers or a plurality of different reading environments, different reader pronunciation habits need to be standardized respectively, and the prosodic features of the voice data need to be discretized according to the needs.
The method aligns the voice segments with corresponding words in the text by an automatic text-voice alignment technology, and utilizes the voice segments as the indexes of the prosody prominence of the words so as to obtain a large amount of automatically generated text prosody data with labels and construct a text prosody data set.
Meanwhile, the invention utilizes the weak supervision characteristic, and the text prosody data set is alternately trained with other natural language processing tasks in a multi-task learning mode under the model structure of the recurrent neural network, thereby achieving the purpose of improving the performance of other tasks.
In the description of the present specification, the description of the terms "one embodiment," "a specific embodiment," etc., means that a particular feature, structure, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. An automatic speech rhythm extracting and labeling method is characterized by comprising the following steps:
step 1, receiving voice data to be marked, and acquiring a corresponding text of the voice data;
step 2, aligning the collected voice data and the corresponding text on a time axis by using a text-voice alignment technology to form an aligned text;
step 3, sentence segmentation is carried out on the aligned text, and therefore a sample taking a sentence as a unit is generated;
and 4, applying an automatic prosody saliency labeling algorithm to each sentence in the sample so as to construct and obtain an automatically labeled text prosody data set.
2. The method for automatic phonetic prosody extraction and labeling according to claim 1, wherein the aligning the phonetic data and the corresponding text in step 2 on the time axis specifically includes: enabling the basic units in each text to correspond to a time axis on the voice data, and thus obtaining the voice data fragments corresponding to each basic unit in the text, wherein the basic units refer to Chinese characters or words and English words.
3. The method for automatic phonetic prosody extraction labeling according to claim 1, wherein the step 4 further comprises: if the original voice data contains a plurality of readers or a plurality of different reading environments, the pronunciation habits of different readers need to be standardized respectively, and the prosodic features of the voice data need to be discretized according to the needs.
4. Use of an automatic prosody extraction method according to any one of claims 1 to 3 in a natural language processing task, the method comprising:
the prosody of text data is used as a sequence labeling task, a long-short term memory artificial neural network (LSTM) is adopted to model a prosody sequence, the input of the LSTM model is a word vector sequence corresponding to a sentence, and prosody saliency labeling of a current position basic unit is predicted and output at each time point.
5. The application of the automatic prosody extraction method in natural language processing tasks according to claim 4, wherein the LSTM model is extensible to a bidirectional LSTM network, a multi-layer bidirectional LSTM network, or a time-cycling neural network, and derived types and structures thereof.
6. The use of an automatic prosody extraction method in a natural language processing task according to claim 5, further comprising:
using the text prosody data set for a Recurrent Neural Network (RNN) based sentence compression task: and (3) marking text prosody saliency as an auxiliary task, using a sentence compression task as a main task, adopting an alternate training mode under multi-task learning, inputting a part of text prosody data or sentence compression data into the model in each time period, inputting another task in the next time period, and alternately performing the two tasks until the model converges.
7. The use of an automatic prosody extraction method in a natural language processing task according to claim 5, further comprising:
using the text prosodic data set for a natural language processing task based on a recurrent neural network and its associated extended improved structure: and (2) marking text prosody saliency as an auxiliary task, using a sentence compression task as a main task, adopting an alternate training mode under multi-task learning, inputting a part of text prosody data or sentence compression data into the model in each time period, inputting the other task in the next time period, alternately performing the two tasks, and optimizing the model parameters until the model converges.
8. An automatic phonetic prosody extraction labeling system, comprising:
the acquisition module is used for receiving voice data to be marked and acquiring a corresponding text of the voice data;
the alignment module is used for aligning the collected voice data and the text thereof on a time axis by using a text-voice alignment technology to form an aligned text;
the segmentation module is used for carrying out sentence segmentation on the aligned text to generate a sample taking a sentence as a unit;
and the automatic prosody labeling module is used for applying an automatic prosody saliency labeling algorithm to each sentence in the sample so as to construct and obtain an automatically labeled text prosody data set.
9. The system for automatic phonetic prosody extraction and labeling according to claim 8, wherein the aligning the phonetic data and the corresponding text in the aligning module on the time axis specifically includes: enabling the basic units in each text to correspond to a time axis on the voice data, and thus obtaining the voice data fragments corresponding to each basic unit in the text, wherein the basic units refer to Chinese characters or words and English words.
10. The system of claim 8, wherein the segmentation module is further configured to:
if the original voice data contains a plurality of readers or a plurality of different reading environments, different reader pronunciation habits need to be standardized respectively, and the prosodic features of the voice data need to be discretized according to the needs.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710023633.8A CN106683667A (en) | 2017-01-13 | 2017-01-13 | Automatic rhythm extracting method, system and application thereof in natural language processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710023633.8A CN106683667A (en) | 2017-01-13 | 2017-01-13 | Automatic rhythm extracting method, system and application thereof in natural language processing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106683667A true CN106683667A (en) | 2017-05-17 |
Family
ID=58858838
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710023633.8A Pending CN106683667A (en) | 2017-01-13 | 2017-01-13 | Automatic rhythm extracting method, system and application thereof in natural language processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106683667A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108986798A (en) * | 2018-06-27 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Processing method, device and the equipment of voice data |
WO2020024582A1 (en) * | 2018-07-28 | 2020-02-06 | 华为技术有限公司 | Speech synthesis method and related device |
CN111105785A (en) * | 2019-12-17 | 2020-05-05 | 广州多益网络股份有限公司 | Text prosodic boundary identification method and device |
CN111507104A (en) * | 2020-03-19 | 2020-08-07 | 北京百度网讯科技有限公司 | Method and device for establishing label labeling model, electronic equipment and readable storage medium |
CN111989696A (en) * | 2018-04-18 | 2020-11-24 | 渊慧科技有限公司 | Scalable Continuous Learning Neural Networks in Domains with Sequential Learning Tasks |
CN112136141A (en) * | 2018-03-23 | 2020-12-25 | 谷歌有限责任公司 | Control robots based on free-form natural language input |
CN112183086A (en) * | 2020-09-23 | 2021-01-05 | 北京先声智能科技有限公司 | English pronunciation continuous reading mark model based on sense group labeling |
CN112307236A (en) * | 2019-07-24 | 2021-02-02 | 阿里巴巴集团控股有限公司 | Data labeling method and device |
CN117012178A (en) * | 2023-07-31 | 2023-11-07 | 支付宝(杭州)信息技术有限公司 | Prosody annotation data generation method and device |
-
2017
- 2017-01-13 CN CN201710023633.8A patent/CN106683667A/en active Pending
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112136141A (en) * | 2018-03-23 | 2020-12-25 | 谷歌有限责任公司 | Control robots based on free-form natural language input |
US12327169B2 (en) | 2018-03-23 | 2025-06-10 | Google Llc | Controlling a robot based on free-form natural language input |
US11972339B2 (en) | 2018-03-23 | 2024-04-30 | Google Llc | Controlling a robot based on free-form natural language input |
US12020164B2 (en) | 2018-04-18 | 2024-06-25 | Deepmind Technologies Limited | Neural networks for scalable continual learning in domains with sequentially learned tasks |
CN111989696A (en) * | 2018-04-18 | 2020-11-24 | 渊慧科技有限公司 | Scalable Continuous Learning Neural Networks in Domains with Sequential Learning Tasks |
CN108986798A (en) * | 2018-06-27 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Processing method, device and the equipment of voice data |
WO2020024582A1 (en) * | 2018-07-28 | 2020-02-06 | 华为技术有限公司 | Speech synthesis method and related device |
CN112307236A (en) * | 2019-07-24 | 2021-02-02 | 阿里巴巴集团控股有限公司 | Data labeling method and device |
CN111105785A (en) * | 2019-12-17 | 2020-05-05 | 广州多益网络股份有限公司 | Text prosodic boundary identification method and device |
CN111507104A (en) * | 2020-03-19 | 2020-08-07 | 北京百度网讯科技有限公司 | Method and device for establishing label labeling model, electronic equipment and readable storage medium |
US11531813B2 (en) | 2020-03-19 | 2022-12-20 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method, electronic device and readable storage medium for creating a label marking model |
CN112183086A (en) * | 2020-09-23 | 2021-01-05 | 北京先声智能科技有限公司 | English pronunciation continuous reading mark model based on sense group labeling |
CN117012178A (en) * | 2023-07-31 | 2023-11-07 | 支付宝(杭州)信息技术有限公司 | Prosody annotation data generation method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11194972B1 (en) | Semantic sentiment analysis method fusing in-depth features and time sequence models | |
CN106683667A (en) | Automatic rhythm extracting method, system and application thereof in natural language processing | |
Zhong et al. | Deep learning-based extraction of construction procedural constraints from construction regulations | |
CN110083710B (en) | A Word Definition Generation Method Based on Recurrent Neural Network and Latent Variable Structure | |
CN113987147B (en) | Sample processing method and device | |
CN110309514A (en) | A kind of method for recognizing semantics and device | |
CN112541356B (en) | Method and system for recognizing biomedical named entities | |
CN109543181B (en) | Named entity model and system based on combination of active learning and deep learning | |
CN110929030A (en) | A joint training method for text summarization and sentiment classification | |
CN111339750B (en) | Spoken language text processing method for removing stop words and predicting sentence boundaries | |
CA3039280A1 (en) | Method for recognizing network text named entity based on neural network probability disambiguation | |
CN112183058B (en) | Poetry generation method and device based on BERT sentence vector input | |
CN111341293B (en) | Text voice front-end conversion method, device, equipment and storage medium | |
CN110489750A (en) | Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF | |
CN113035311A (en) | Medical image report automatic generation method based on multi-mode attention mechanism | |
CN116151256A (en) | A Few-Shot Named Entity Recognition Method Based on Multi-task and Hint Learning | |
CN111738007A (en) | A Data Augmentation Algorithm for Chinese Named Entity Recognition Based on Sequence Generative Adversarial Networks | |
CN112183064A (en) | Text emotion reason recognition system based on multi-task joint learning | |
CN112216267B (en) | Prosody prediction method, device, equipment and storage medium | |
CN104538025A (en) | Method and device for converting gestures to Chinese and Tibetan bilingual voices | |
Li et al. | Multi-level gated recurrent neural network for dialog act classification | |
CN113408287A (en) | Entity identification method and device, electronic equipment and storage medium | |
CN114611529A (en) | Intention recognition method and device, electronic equipment and storage medium | |
CN112949284B (en) | A Text Semantic Similarity Prediction Method Based on Transformer Model | |
CN115359323A (en) | Image text information generation method and deep learning model training method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170517 |
|
WD01 | Invention patent application deemed withdrawn after publication |