CN111508522A

CN111508522A - Statement analysis processing method and system

Info

Publication number: CN111508522A
Application number: CN201910094372.8A
Authority: CN
Inventors: 夏海荣; 张少飞; 于佳玉; 刘悦
Original assignee: Hujiang Education Technology Shanghai Co ltd
Current assignee: Hujiang Education Technology Shanghai Co ltd
Priority date: 2019-01-30
Filing date: 2019-01-30
Publication date: 2020-08-07

Abstract

The invention discloses a method and a system for analyzing and processing sentences, wherein the method comprises the following steps: performing prosodic hierarchy analysis on the exercise sentences, determining chunk time boundaries of each prosodic chunk in each sentence, and setting intonation marks for the exercise sentences; setting a rereading mark for the exercise sentence; and taking the determined chunk time boundary, the intonation marks and the practice sentences with the marks being re-read as standard prosody level sentences. By the method provided by the invention, prosody hierarchy analysis is carried out on the text of the input sentence, so that a linear word sequence of a whole sentence is converted into a prosody hierarchy structure, and a user can learn and master the method for carrying out prosody structure analysis on the text and use the method in pronunciation. By the method, the user can master the use of intonation and rereading when the sentence is read.

Description

Statement analysis processing method and system

Technical Field

The present application relates to the field of data analysis technologies, and in particular, to a method and a system for analyzing and processing sentences.

Background

Reading is an important learning method in language learning: the reading can improve the accuracy and the fluency of the pronunciation of the learner and the comprehension capacity of the learner on the sentences and even chapters, thereby strengthening the correct use of the rhythm characteristics such as the stress reading, the intonation and the like.

In reading aloud, the learner may experience the following errors or inaccuracies: mispronunciations or impracticalities of words (including vowels, consonants, syllable boundaries, accents, continuations, transcription savers, etc.), intraword and interword disfluences (including inappropriate durations and pauses), prosodic changes (omission or misuse of accents) such as lack of pitch capability, lack of correct grammatical and semantically related intonation changes (e.g., intonation or precipitation at the end of a sentence), inability to correctly understand a sentence and control the rhythm of speech output by phrases (Phrasing).

Currently, more traditional schemes practice reading aloud in two ways:

the first method is as follows: talking dictionary

A standalone electronic dictionary device, or desktop software, software running in a mobile device (including WeChat applets, web pages, etc.). After a user queries a word, the voiced dictionary provides a traditional paraphrase of the word, along with audible audio (live voice or computer synthesized language) of the pronunciation of the word that can be played. The learner learns the pronunciation of the word by playing the audio and may orally mimic it. The voiced dictionary may also provide a number of word-related illustrative sentences, which may also be accompanied by audio that may be played.

The second method comprises the following steps: talking book

The audio file can be an independently distributed audio file (mp3, etc.), a matching optical disk of a book, an early recording tape, or a program form on a certain content platform: such as PodCast, himalayan FM, wechat, public, etc. The way learners use audiobooks is usually "listening". The learner can also imitate by himself.

The third method comprises the following steps: pronunciation evaluation software

Including software running on a desktop system, software running on a mobile device (mobile applications, wechat applets, web programs, etc.), and other smart devices running an operating system (smart televisions, smart speakers, etc.). Such software typically provides demonstration audio that compares the learner's spoken speech to the demonstration speech to produce an overall score, and typically also provides scores for the segmentation dimensions including pronunciation accuracy, completeness, and fluency.

Although the scheme can guide the user to read aloud, the first and second modes cannot evaluate the reading level of the user, and the learner cannot get immediate feedback;

in the third mode, although the training of the readers can be scored, only sentence-level reading scoring can be provided, and the system cannot realize the targeted training of the learner on the structural segment; and the mode only provides recorded demonstration audio and cannot provide a teaching function, so that the user's grasp of reading skills is reduced.

Disclosure of Invention

The invention provides a statement analysis processing method and system, which are used for solving the problem that in the prior art, the user cannot carry out targeted training due to the fact that whole statement analysis evaluation is carried out on user reading data.

The specific technical scheme is as follows:

a method of statement analysis processing, the method comprising:

performing prosodic hierarchy analysis on the exercise sentences to determine chunk time boundaries of each prosodic chunk in each sentence, wherein the prosodic chunk comprises at least one word, and the time boundaries represent pause positions of the sentences;

setting intonation marks for the exercise sentences according to the determined chunk time boundary;

setting a rereading mark for the exercise sentence according to the determined chunk time boundary;

and taking the determined chunk time boundary, the intonation marks and the practice sentences with the re-reading marks as standard prosody level sentences.

Optionally, performing prosody hierarchy analysis on the exercise sentences to determine chunk time boundaries of each prosody chunk in each sentence, includes:

performing prosodic hierarchy analysis on the practice sentences to determine word time boundaries corresponding to all words in the practice sentences;

determining the chunk time boundaries for each prosodic chunk based on the word time boundaries for each word.

Optionally, determining the chunk time boundary of each prosodic chunk according to the word time boundary of each word includes:

determining a sentence layer in the practice sentence according to the word time boundary of each word;

determining a intonation phrase layer in the sentence layer;

determining a prosodic phrase layer in the intonation phrase layer;

determining the chunk time boundary of each prosodic chunk according to the sentence layer, the intonation phrase layer, and the prosodic phrase layer.

Optionally, setting intonation marks for the exercise sentences according to the determined chunk time boundary, including:

acquiring data in the exercise sentence and acquiring a tone labeling set, wherein the data comprises each line of text and voice corresponding to each line of text, and the labeling set comprises each tone;

and setting tone marks for each word based on the data and the label set in the exercise sentence and according to the determined word time boundary.

Optionally, setting a rereading mark for the exercise sentence according to the determined chunk time boundary, including:

acquiring data in the exercise sentence and acquiring a rereading label set;

and based on the data in the exercise sentence and the obtained re-reading labeling set, and according to the determined word time boundary, re-reading labeling is carried out on each word.

Optionally, after taking the determined chunk time boundary, the intonation flag, and the re-reading marked exercise sentence as a standard prosody level sentence, the method further includes:

acquiring an exercise sentence of the user based on the standard prosody level sentence;

determining, based on a prosody hierarchy, that there is an erroneous prosody chunk in the exercise sentence;

and outputting prompt information of a rhythm chunk for prompting the user to repeatedly exercise.

Optionally, after outputting prompting information for prompting a rhythm chunk for performing repeated exercises, the method further includes:

detecting whether a rhythm chunk currently trained by a user passes evaluation;

if not, prompting the user to continue training the current rhythm chunk;

if yes, switching from the current rhythm module to the next rhythm module with errors so as to enable the user to practice the next rhythm module.

A system of statement analysis processing, the system comprising:

the analysis module is used for carrying out prosody level analysis on the exercise sentences, determining the chunk time boundary of each prosody chunk in each sentence, and setting intonation marks for the exercise sentences according to the determined chunk time boundary; setting a re-reading mark for the exercise sentence according to the determined chunk time boundary, wherein the prosodic chunk comprises at least one word, and the time boundary represents the pause position of the sentence;

and the processing module is used for taking the determined chunk time boundary, the intonation marks and the practice sentences with the re-reading marks as standard prosody level sentences.

Optionally, the analysis module is specifically configured to perform prosody hierarchy analysis on the exercise sentence, and determine a word time boundary corresponding to each word in the exercise sentence; determining the chunk time boundaries for each prosodic chunk based on the word time boundaries for each word.

Optionally, the analysis module is specifically configured to determine a sentence level in the practice sentence according to the word time boundary of each word; determining a intonation phrase layer in the sentence layer; determining a prosodic phrase layer in the intonation phrase layer; determining the chunk time boundary of each prosodic chunk according to the sentence layer, the intonation phrase layer, and the prosodic phrase layer.

Optionally, the analysis module is specifically configured to obtain data in the exercise sentence and obtain a tone labeling set, where the data includes each line of text and a voice corresponding to each line of text, and the labeling set includes each tone; and setting tone marks for each word based on the data and the label set in the exercise sentence and according to the determined word time boundary.

Optionally, the analysis module is specifically configured to obtain data in the exercise sentence and obtain a rereading annotation set; and based on the data in the exercise sentence and the obtained re-reading labeling set, and according to the determined word time boundary, re-reading labeling is carried out on each word.

Optionally, the processing module is further configured to obtain an exercise sentence of the user based on the standard prosody level sentence; determining, based on a prosody hierarchy, that there is an erroneous prosody chunk in the exercise sentence; and outputting prompt information of a rhythm chunk for prompting the user to repeatedly exercise.

Optionally, the processing module is further configured to detect whether the currently trained prosody chunk of the user passes the evaluation; if not, prompting the user to continue training the current rhythm chunk; if yes, switching from the current rhythm module to the next rhythm module with errors so as to enable the user to practice the next rhythm module.

By the method provided by the invention, prosody hierarchy analysis is carried out on the text of the input sentence, so that a linear word sequence of a whole sentence is converted into a prosody hierarchy structure, and a user can learn and master the method for carrying out prosody structure analysis on the text and use the method in pronunciation. By the method, the user can master the use of intonation and rereading when the sentence is read.

In addition, the prosodic chunks of the sentences of the user can be decomposed and analyzed, and errors of the user in each prosodic chunk of the sentences are determined, so that the user can do partial exercise aiming at each prosodic chunk and even aiming at a single word, and the pertinence and the learning efficiency of the reading-aloud learning are improved.

Drawings

FIG. 1 is a flowchart of a method for analyzing and processing a statement according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a prosodic hierarchy in an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a statement analysis processing system according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention are described in detail with reference to the drawings and the specific embodiments, and it should be understood that the embodiments and the specific technical features in the embodiments of the present invention are merely illustrative of the technical solutions of the present invention, and are not restrictive, and the embodiments and the specific technical features in the embodiments of the present invention may be combined with each other without conflict.

First, the terms to which the present invention relates are explained:

the sentence: sequences of words and punctuation, organized according to grammatical rules and semantic requirements, express specific meanings, usually ending with punctuation.

And (3) voice: human vocal organs (vocal cords, vocal tract, tongue, mouth, lips, teeth) naturally convert a specific sentence into sound in the form of a sequence of phonemes under the coordination of the brain.

Rhythm: for the expression needs in human natural language, specific phones/syllables are assigned different prosodic parameters: duration (Duration), Pitch (Pitch), Energy (Energy), and Pause (Pause) to produce a "twitch-down, twitch-down" effect. Humans can perceive whether prosodic parameters match text.

Tone: refers to the trend of the pitch trajectory of the pronunciation of a sentence or a segment of a sentence. In general, statement sentences and special interrogations use down-tones, while general interrogations use up-tones.

Semantic rereading: unlike the repeated reading of symbols (stress) in english words, speakers often make the prosody of some words more prominent in peripheral words according to the semantic and expression requirements of sentences, such as increased pitch, increased energy (volume), increased duration, additional pause, and so on.

The grammar structure is as follows: the process of parsing the sentence described by the natural language text into a syntax tree describing the aforementioned components according to the linguistic criteria, such as main, predicate, object, predicate, shape, complement. The syntax tree is usually represented as a nested structure, e.g., S ═ NP + VP, meaning that a sentence S is composed of a name phrase (as subject) plus a verb phrase (as predicate).

The rhythm structure is as follows: the prosodic structure is a process of reorganizing a text sequence into an 'interconnected block structure' according to the communication requirement during the speaking process of a speaker. The correct and proper prosodic structure can reduce the communication cost of a speaker and a listener. The prosodic structure affects the prosodic features of the text after it is read. This block-like structure also has nested (hierarchical) junctions

However, the structure is much shallower than the grammar structure, and generally has only 2-3 layers. For example: S-IP 1+ IP2, IP 1-PP 1+ PP2 indicate that a sentence S is composed of two intonation phrases IP1 and IP2, where IP1 is composed of a prosodic phrase PP1 and a prosodic phrase PP 2.

Fig. 1 shows a statement analysis processing method according to an embodiment of the present invention, where the method includes:

s1, performing prosody hierarchy analysis on the exercise sentences to determine the chunk time boundary of each prosody chunk in each sentence;

first, the prosodic hierarchy needs to be analyzed, and in the embodiment of the present invention, the prosodic hierarchy can be divided into 3 layers: sentence layer (S), intonation phrase layer (IP), prosodic phrase layer (PP). One S may be composed of one to several IPs, one IP may also be composed of one to several PPs, and the marks between the chunks are chunk time boundaries.

Specifically, S ═ w₀,w₁,w₂,,,w_i,,,w_n]Is divided intoThe results of the precipitated hierarchy are: s [ [ w ]₀,w₁,w₂][[w₃,w₄],[w_i,,,w_n]]Wherein the sentence S includes two IPs: IP1 ═ w₀,w₁,w₂]，IP2＝[w₃,,,w_i,,,w_n]Wherein IP2 contains two PPs, namely: PP1 ═ w₃,w₄]，PP2＝[w_i,,,w_i,,,w_n]。

Before performing prosodic hierarchy analysis on an exercise sentence, word time boundaries of individual words in the sentence need first be determined. In the embodiment of the present invention, the word time boundary is: IP _ Boundary, PP _ Boundary, None-Boundary.

For example, the statement is: this is a serous issue and sensing well with discussh Moscow. The word time boundaries for each word are shown in table 1:

TABLE 1

The time boundaries of the individual prosodic chunks in the sentence can be determined based on the time boundaries of the words. For example, the statement is: this is a serous issue and sensing well with disc with Moscow. When the learner reads the sentence with 12 words and 17 syllables, the learner needs to avoid completing the pronunciation at one stroke, but the learner should accord with the characteristics of the sentence to make proper analysis and planning of the prosodic structure.

Thus, after prosody hierarchy analysis, the prosody analysis results are shown in fig. 2, where the complete exercise sentence is divided into corresponding hierarchies in fig. 2.

It should be noted that, in the embodiment of the present invention, the prosodic hierarchy analysis may be calculated and solved by using a machine-learned algorithm model such as a conditional random field algorithm, a hidden markov model, a recurrent neural network, or the like.

Through the prosody level analysis, the prosody levels in the voice can be marked, and the prosody levels can be used for evaluating the voice exercise of the user and used as a scoring basis for subsequent voice exercise.

S2, according to the determined chunk time boundary, tone marks are set for the training sentences;

first, in the embodiment of the present invention, the intonation types, the applicable cases, and the intonation trends are shown in table 2:

TABLE 2

Based on the contents in table 2, the set of labels of the intonation is I ═ None, L ow, High, L ow _ L ow, L ow _ High, High _ L ow, High _ High };

training data set D ═ D₀,D₁,,,D_i,,,,D_kIn which D is_i＝S_i,T_i，S_i＝[w₀,w₁,,,w_i,,,w_n]，T_i＝[t_i0,t_i1,,,t_ij,,,t_in|t_ij]。

Further, in the embodiment of the present invention, the intonation is labeled based on an unsupervised clustering algorithm, and the specific steps are as follows:

1. specifying standard documents to determine record formats, decision bases, arbitration schemes, and the like;

2. preparing data to be marked, including each line of text and corresponding voice;

3. operating a rhythm level marking process to determine a rhythm level boundary;

4. calculating a word time boundary of each word in the speech by using a forced alignment algorithm;

5. extracting acoustic features in the sentence, generating Ai ═ Ai0, Ai1,, aij,, ain ] data;

6. for the set of ij in all Ai, unsupervised clustering similar to K-Means is performed:

6-1) for None-Boundary, skip;

6-2) for PP _ Boundary, the clustering target is 2 types;

6-3) for IP _ Boundary, the clustering target is 4 types;

based on the method, model training is established, and the steps of establishing a machine learning model are as follows:

1. processing the large-scale data set according to the labeling method;

2. extracting text features and constructing pairs between text feature representations and tone types;

3. a model of the relationship between the text feature representation and the tone type is trained using a learning algorithm.

The steps of using the model to distinguish intonation are as follows:

a. initializing classification calculation and loading the learning model;

b. extracting text feature representation;

c. the text feature representation is input to a classification algorithm and an output target pitch type is generated.

The intonation output results are shown in table 3:

TABLE 3

S3, setting a rereading mark for the exercise sentence according to the determined chunk time boundary;

and S4, taking the determined chunk time boundary, the intonation marks and the practice sentences with the repeated reading marks as standard prosody level sentences.

After the intonation analysis of the exercise sentence is completed, the exercise sentence also needs to be re-read and analyzed, that is, a re-read mark of each word segment in the speech is marked, wherein the re-read type is shown in table 4:

type of rereading	Is suitable for	Rereading situation
			None	Common to the null word, or weakened real word	Weak reading
Normal	Common to real words	Is normal
			Emphasized	The real word is required for highlighting the semantic meaning	Rereading

TABLE 4

Based on the contents in table 2, the label set of the rereading is: e ═ None, Nor mal, alpha sized }; for each word in the training data, one label in the set E needs to be labeled, and the training data set D ═ D₀,D₁,,,D_i,,,,D_k}, wherein: d_i＝S_i,A_i,T_iRepresenting one document (sentence) in the training set;

S_i＝[w_i0,w_i1,,,w_ij,,,w_in]a sequence of words (Token) representing the document (sentence), length being; a. the_i＝[a_i0,a_i1,,,a_ij,,,ai_n]Representing a sequence of acoustic features (Acoustics features) corresponding to each word (Token); t is_i＝[t_i0,t_i1,,,t_ij,,,t_in|t_ij]Where E denotes a tag sequence corresponding to each word (Token).

Further, rereading labeling is carried out by using unsupervised clustering, and the labeling method comprises the following steps:

6. unsupervised clustering similar to K-Means is performed for the set of ij in all Ai, with the target of clustering being 3 classes.

1. processing the large-scale data set according to the labeling method;

The steps of using the model to distinguish intonation are as follows:

a. initializing classification calculation and loading the learning model;

b. extracting text feature representation;

The intonation output results are shown in table 4:

TABLE 4

After completing the prosody level analysis, acquiring a practice sentence of the user based on the standard prosody level sentence, determining a prosody chunk with an error in the practice sentence based on the prosody level, and outputting prompting information for prompting the user to practice the prosody chunk. In brief, different prosody chunks exist in one sentence, and the system analyzes each prosody chunk so as to determine whether an error exists in the practice sentence of the user, and if the error exists, a user prompt is given and the position of the error is prompted.

And when errors exist, the system enters a repeated exercise stage of the error rhythm chunk, and detects whether the rhythm chunk currently exercised by the user passes the evaluation, wherein the evaluation is based on the method, if the evaluation fails, the system prompts the user to continue the exercise of the current rhythm chunk, and if the evaluation passes, the system is switched from the current rhythm chunk to the next rhythm chunk with errors to exercise.

Corresponding to the method provided by the embodiment of the present invention, an embodiment of the present invention further provides a statement analysis processing system, and as shown in fig. 3, the present invention is a schematic structural diagram of a statement analysis processing system in the embodiment of the present invention, where the system includes:

the analysis module 301 is configured to perform prosody level analysis on the exercise sentences, determine chunk time boundaries of each prosodic chunk in each sentence, and set intonation marks for the exercise sentences according to the determined chunk time boundaries; setting a re-reading mark for the exercise sentence according to the determined chunk time boundary, wherein the prosodic chunk comprises at least one word, and the time boundary represents the pause position of the sentence;

a processing module 302, configured to use the determined chunk time boundary, the intonation flag, and the practice sentence with the re-reading flag as a standard prosody level sentence.

Further, in this embodiment of the present invention, the analysis module 301 is specifically configured to perform prosody level analysis on the practice sentence, and determine a word time boundary corresponding to each word in the practice sentence; determining the chunk time boundaries for each prosodic chunk based on the word time boundaries for each word.

Further, in the embodiment of the present invention, the analysis module 301 is specifically configured to determine a sentence level in the practice sentence according to the word time boundary of each word; determining a intonation phrase layer in the sentence layer; determining a prosodic phrase layer in the intonation phrase layer; determining the chunk time boundary of each prosodic chunk according to the sentence layer, the intonation phrase layer, and the prosodic phrase layer.

Further, in this embodiment of the present invention, the analysis module 301 is specifically configured to obtain data in the exercise sentence and obtain a tone labeling set, where the data includes each line of text and a voice corresponding to each line of text, and the labeling set includes each tone; and setting tone marks for each word based on the data and the label set in the exercise sentence and according to the determined word time boundary.

Further, in this embodiment of the present invention, the analysis module 301 is specifically configured to obtain data in the exercise sentence and obtain a rereading annotation set; and based on the data in the exercise sentence and the obtained re-reading labeling set, and according to the determined word time boundary, re-reading labeling is carried out on each word.

Further, in this embodiment of the present invention, the processing module 302 is further configured to obtain an exercise sentence of the user based on the standard prosody level sentence; determining, based on a prosody hierarchy, that there is an erroneous prosody chunk in the exercise sentence; and outputting prompt information of a rhythm chunk for prompting the user to repeatedly exercise.

Further, in this embodiment of the present invention, the processing module 302 is further configured to detect whether the prosody module currently trained by the user passes the evaluation; if not, prompting the user to continue training the current rhythm chunk; if yes, switching from the current rhythm module to the next rhythm module with errors so as to enable the user to practice the next rhythm module.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the application, including the use of specific symbols, labels, or other designations to identify the vertices.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of statement analysis processing, the method comprising:

2. The method of claim 1, wherein performing prosodic hierarchy analysis on the exercise sentences to determine chunk time boundaries for each prosodic chunk in each sentence comprises:

3. The method of claim 2, wherein determining the chunk time boundaries for each prosodic chunk based on the word time boundaries for each word comprises:

determining a intonation phrase layer in the sentence layer;

determining a prosodic phrase layer in the intonation phrase layer;

4. The method of claim 2, wherein setting intonation flags to the exercise sentence according to the determined chunk time boundary comprises:

5. The method of claim 2, wherein setting a reread flag on the exercise sentence according to the determined chunk time boundary comprises:

acquiring data in the exercise sentence and acquiring a rereading label set;

6. The method of claim 1, wherein after taking the determined chunk time boundaries, the intonation markers, and the re-read marked exercise sentences as standard prosody level sentences, the method further comprises:

7. The method of claim 6, wherein after outputting prompting information for prompting prosodic chunks for performing repetitive exercises, the method further comprises:

detecting whether a rhythm chunk currently trained by a user passes evaluation;

if not, prompting the user to continue training the current rhythm chunk;

8. A system for sentence analysis processing, the system comprising:

9. The system of claim 8, wherein the analysis module is specifically configured to perform prosody level analysis on the exercise sentence to determine word time boundaries corresponding to each word in the exercise sentence; determining the chunk time boundaries for each prosodic chunk based on the word time boundaries for each word.

10. The system of claim 9, wherein the analysis module is specifically configured to determine a sentence level in the exercise sentence based on the word time boundary of each word; determining a intonation phrase layer in the sentence layer; determining a prosodic phrase layer in the intonation phrase layer; determining the chunk time boundary of each prosodic chunk according to the sentence layer, the intonation phrase layer, and the prosodic phrase layer.

11. The system of claim 9, wherein the analysis module is specifically configured to obtain data in the exercise sentence and obtain a tone labeling set, wherein the data includes each line of text and a voice corresponding to each line of text, and the labeling set includes each tone; and setting tone marks for each word based on the data and the label set in the exercise sentence and according to the determined word time boundary.

12. The system of claim 9, wherein the analysis module is specifically configured to obtain data in the exercise sentence and obtain a re-reading annotation set; and based on the data in the exercise sentence and the obtained re-reading labeling set, and according to the determined word time boundary, re-reading labeling is carried out on each word.

13. The system of claim 8, wherein the processing module is further configured to obtain an exercise sentence of the user based on the standard prosody hierarchy sentence; determining, based on a prosody hierarchy, that there is an erroneous prosody chunk in the exercise sentence; and outputting prompt information of a rhythm chunk for prompting the user to repeatedly exercise.

14. The system of claim 13, wherein the processing module is further configured to detect whether a prosodic chunk currently trained by the user passes the evaluation; if not, prompting the user to continue training the current rhythm chunk; if yes, switching from the current rhythm module to the next rhythm module with errors so as to enable the user to practice the next rhythm module.