Method for establishing and analyzing speech library of dysarthria of motility under big data background
Technical Field
The invention relates to a method for establishing and analyzing a speech library of dysarthria of motility under a big data background.
Background
(1) Current study of motor dysarthria:
motor dysarthria (dysarthria) refers to a group of speech disorders resulting from disturbances in the control of muscles due to damage to the central or peripheral nervous system. Motor dysarthria is often manifested as slowed, weakened, inaccurate and uncoordinated movement of speech-related muscle tissues, and may also affect respiration, resonance, control of throat vocalization, dysarthria and rhythm, and is often referred to as dysarthria clinically. Common causes of motor dysarthria include brain trauma, cerebral palsy, amyotrophic lateral sclerosis, multiple sclerosis, stroke, Parkinson's disease, spinocerebellar ataxia, and the like. Dysarthria can be classified into flaccid, spastic, disorganized, hyperkinetic, and mixed types according to neuroanatomical and speech acoustics. Among the communication disorders associated with brain damage, dysarthria has an incidence rate of up to 54%. At present, the speech acoustics characteristics of dysarthria can be reflected from subjective and objective aspects through examination on aspects of voice, resonance, rhythm and the like in clinic, and the method is favorable for providing targeted treatment and comprehensively and scientifically clarifying the speech acoustics pathological mechanism of dysarthria.
The overall incidence of motor dysarthria has been reported in few domestic and foreign studies, and studies in 125 Parkinson's disease patients by Miller et al showed that 69.6% of patients had lower mean speech intelligibility than the normal control group, with 51.2% of patients having a standard deviation lower, indicating a higher incidence of dysarthria in Parkinson's patients. Bogousslavsky et al screened 1000 patients with primary stroke and found up to 46% of the patients with speech impairment, 12.4% of which were diagnosed with dysarthria. Hartelius et al also found 51% prevalence of dysarthria in patients with multiple sclerosis. This indicates that the incidence of dysarthria is high. At present, there is no unified assessment method for dysarthria at home, the dysarthria of motility has no special assessment standard, the dysarthria assessment method or improvement method and the dysarthria examination table of the Chinese rehabilitation research center are mostly adopted, and the degree and type of dysarthria are examined, scored, recorded and evaluated by clinicians or doctors in rehabilitation departments.
(2) The current research situation of the domestic voice library is as follows:
with the development of information technology and computer science, speech technology makes it possible to interact between machine behaviors and human natural language, and both speech synthesis, speech recognition and speech recognition research are necessarily dependent on the construction of a rear-end excellent speech corpus. At present, foreign speech libraries are developed more maturely, the research of Chinese speech libraries has been rapidly advanced in the last decade, and the research and establishment of speech libraries have fallen to the ground in different languages and cultural contexts. However, the construction of speech libraries for dysarthria is still under investigation.
The evaluation research of the sound-forming voice function in China mainly focuses on subjective evaluation, and only a few researchers distinguish the concept of sound formation and voice. Huang Zhaying et al proposed "Chinese word list for testing the ability to compose sound", the word list contains 50 words, and the speech rehabilitation teacher can comprehensively evaluate the ability to compose sound of 21 initial consonants and 4 tones by evaluating the pronunciation-forming voice of 50 tested words, and meanwhile, the ability to compare the sound position of tested words is evaluated by 18 sound position comparisons and 37 minimum voice pairs. Chen Sanding et al evaluated the initial consonant, vowel and tone of Mandarin Chinese to 50 deaf children, revealed the development law of deaf children's structure sound pronunciation of speaking Mandarin Chinese, still further proposed the pronunciation rehabilitation education principle of early, sequential, fault-tolerant and consolidation. Zhang Jing doctor of the university of east China studied the main wrong trend of hearing-impaired children in the consonant constitution, analyzed the cause, and correspondingly proposed the consonant phoneme treatment framework of hearing-impaired children.
(3) The current research situation of big data in the medical field is as follows:
currently, it is more popular to define big data: data that exceeds the capabilities of a typical database software tool to capture, store, process and analyze. Big data is different from traditional data concepts such as super-large-scale data and mass data, and has four basic characteristics: large amount, diversity, aging and value. Kayyali B et al studied the impact of big data on the U.S. medical industry, indicating that the value of big data will be more and more significant to the medical industry over time. At present, big data in the medical field mainly come from pharmaceutical enterprises, clinical diagnosis data, patient medical data, health management and social network data. For example, drug development is a relatively intensive process, even for small and medium-sized enterprises, data on drug development is above TB; the data of a hospital also increases very fast every day, 3000 images of a patient are imaged once in a dual-source CT examination, 1.5GB image data is generated approximately, a standard pathological examination image is about 5GB image, and the data of the patient such as medical treatment and electronic medical record are added, so that the data increase fast every day. Research methods based on massive big data analysis have led to thinking about scientific methodology. The research does not need to directly contact with a research object, and a new research discovery can be obtained by directly analyzing and mining mass data, so that a new scientific research mode is probably brought forward.
The establishment of the voice corpus is a complicated problem, and the problem that the later perfection of the voice corpus needs to be improved is solved, for example, the existing inter-word tone regulation rules are fully utilized, and the actual situations of tone variation and soft sound are reflected as much as possible. For the deficiency of the corpus, the utilization rate of the existing corpus can be improved in the preprocessing link. For the above reasons, the voice library should be an open database so that it can be added and modified at any time to complete the database. Because the speech conditions are different, the establishment of a specific speech corpus can also encounter various difficulties, and the problems discussed herein are only one kind of discussion for establishing a speech corpus, and hopefully, data support can be provided for speech research, and play an important role in better language development and improvement of the speech corpus.
In addition, the large data volume is undoubtedly a great advantage of the network big data analysis technology, but how to guarantee the quality of the mass data and how to implement the problems of cleaning, managing and analyzing the mass data also become a great technical difficulty of the research of the subject. The massive network big data has the characteristics of multi-source heterogeneity, interactivity, timeliness, burstiness, high noise and the like, so that the network big data has the characteristics of huge value, large noise and low value density. This poses a significant challenge to ensure data quality in network big data analytics research.
Disclosure of Invention
The invention designs a method for establishing and analyzing a speech library for dysarthria of motility under the background of big data, which solves the technical problems that the large data volume is undoubtedly a big advantage of a network big data analysis technology, but the problems of how to ensure the quality of mass data and how to realize the cleaning, management and analysis of the mass data and the like also become a big technical difficulty.
In order to solve the technical problems, the invention adopts the following scheme:
a method for establishing and analyzing a speech library of dysarthria with motility under a big data background comprises the following steps: step 1, designing a pronunciation text;
step 2, recording voice;
step 3, marking the voice file;
step 4, analyzing acoustic parameters of the voice file;
step 5, establishing a database management system;
and 6, analyzing data by a big data technology.
Preferably, the data analysis of the big data technology in step 6 is based on a speech classification mechanism of a Hadoop platform, and specifically includes the following sub-steps:
step 61, collecting a plurality of patient voice files, segmenting and labeling voice segments, constructing a voice database, analyzing the extracted acoustic parameters, and acquiring effective characteristics of voice classification;
step 62, based on a Hadoop platform, subdividing the big data voice classification problem by adopting a Map function, and performing voice classification solution on the subproblems in a multi-node parallel and distributed manner to obtain corresponding voice classification results;
and step 63, finally, combining the voice classification results of the sub-problems by using a Reduce function so as to adapt to the online requirement of the big data voice classification.
Preferably, the designing of the pronunciation text in step 1 includes selecting the pronunciation text, and the selection principle of the corpus of pronunciation texts includes one or more of the following:
a. the single characters in the corpus are required to contain all the phonological phenomena as much as possible, so that the phonetic system characteristics of the voices of different patients can be reflected better and more conveniently;
b. the vocabularies in the corpus are based on the common Chinese survey table, so that the vocabularies can be conveniently compared with the common Chinese speech;
c. sentences in the corpus are mainly obtained by carrying out dialogue with the patient according to a plurality of related topics, so that the method is more suitable for the real situation faced by speech recognition; "several related topics" include daily life topics or medical history topics, such as queries for time to first onset and medical history.
d. Sentences in the corpus are complete in content and semanteme, so that prosodic information of one sentence can be reflected as much as possible;
e. the three phones are not classified and selected, so that the problem of sparse training data can be effectively solved.
Preferably, the designing of the pronunciation text in step 1 further includes compiling the pronunciation text, and the compiling principle of the pronunciation text includes one or more of the following:
a. a single-word part: taking the initial consonants, the simple or compound vowels and some commonly used characters of the tone listed in the survey word list as the language materials used for the main recording of the voice library;
b. vocabulary part: based on a four thousand word list, but not limited to the four thousand word list, the related words are recorded according to the original conclusion about the related sound system, the voice characteristics including the characteristics of tone quality and super-sound quality can be comprehensively reflected, and example words can be added to reflect the characteristics of the voice phenomenon with great particularity; "recording related words for conclusion of related sound system" refers to a general vocabulary summarized according to the characteristics of sound, combination law, rhythm and intonation used in the same language.
The characteristic speech phenomenon refers to the situation that the dialect is easy to read wrongly, such as the situation that the flat tongue sound and the warped tongue sound are difficult to distinguish, and f and h are not divided.
c. Sentence material part: determining the number of the linguistic data according to the language mastering degree of different speakers, wherein the linguistic data is selected to have certain representativeness while the range of the linguistic data is ensured to be as wide as possible; "representative" as used herein refers to a general sentence that characterizes dysarthric speech.
d. And a natural conversation part: the method is characterized in that the method is used for recording 20-40 minutes of voice materials of a speaker in the forms of answering questions and freely talking, relates to words different from common Chinese in daily spoken language and requires the speaker to speak in a dialect.
Preferably, the voice recording of step 2 includes determination of speaker, and the selection principle of the speaker is to select a native speaker who has clear mouth and teeth, moderate speech rate ("moderate speech rate" means moderate speech rate, controlled at 150 words/minute) and proficient use of local language and is willing to actively cooperate with investigation, and to ensure that the language environment of the speaker is relatively stable and has cultural degree; or/and, the voice recording further comprises voice acquisition through a voice acquisition device, and the voice acquisition adopts two modes: one is the reading with prompt text, the prompt is the text material of Chinese, the speaker converts it into own native language and reads aloud; the other is natural voice, and the speaker tells the folk story, the folk life condition and humming of local folk songs by using prompts.
Preferably, the analyzing the acoustic parameters of the voice file in step 4 includes voice labeling of the voice library, where the basic voice labeling includes segmentation and alignment of initials and finals of each syllable, and labeling of initials and finals, and includes two parts: the first part is character marking, Chinese character + pinyin is character pronunciation transcription, and the voice information is recorded by Chinese characters so as to be provided for an identification system and also provide materials for the research of linguistics; the character label must mark basic character information and sublingual phenomenon, and the sublingual phenomenon in the basic label can be represented by a general sublingual symbol; the second part is syllable label, the standard mandarin syllable label is adopted in the mandarin syllable label, and the syllable label is tone label; in the tone notation, 0 indicates a light tone, 1 indicates a yin-ping, 2 indicates a yang-ping, 3 indicates a rising tone, and 4 indicates a falling tone.
Preferably, the analyzing of the acoustic parameters of the voice file in step 4 further includes extracting acoustic parameters; firstly, segmenting recorded voice and eliminating mute sections to ensure that analyzed objects are single words, phrases, sentences and conversations; then, judging the start and end sections of the voice signal in the voice waveform data, and labeling the voice; and finally, obtaining corresponding fundamental frequency and formant acoustic analysis parameter data according to an autocorrelation algorithm.
Preferably, the establishing of the database management system in step 5 includes selecting a database, and selecting an sql database management system which is easier to implement.
A big data voice classification flow method based on a Hadoop platform comprises the following steps: the establishing method is used for establishing a voice library, on the basis of the voice library, a Map function is adopted to subdivide a big data voice classification problem based on a Hadoop platform, and a multi-node parallel and distributed sub-problem is used for carrying out voice classification solution to obtain a corresponding voice classification result; and finally, combining the voice classification results of the sub-problems by using a Reduce function so as to adapt to the online requirement of the voice classification of the big data.
The method comprises the following specific steps:
(1) the Client submits a voice classification task to a Job Tracker of the Hadoop platform, and the Job Tracker copies voice characteristic data to a local distributed file processing system;
(2) initializing voice classified tasks, putting the tasks into a Task queue, and distributing the tasks to corresponding nodes, namely a Task Tracker, by a Job Tracker according to the processing capacity of different nodes;
(3) each Task Tracker adopts a support vector machine to fit the relation between the voice features to be classified and a voice feature library according to the distributed tasks to obtain the corresponding categories of the voice;
(4) taking the corresponding class of the voice as Key/Value, and storing the Key/Value into a local file disk;
(5) if the Key/Value of the voice classification intermediate result is the same, merging the intermediate result, delivering the merged result to Reduce for processing to obtain a voice classification result, and writing the result into a distributed file processing system;
(6) and the Job Tracker performs emptying processing on the task state, and the user obtains a voice classification result from the distributed file processing system.
The method for establishing and analyzing the speech library of dysarthria with motility under the big data background has the following beneficial effects:
(1) the invention aims to research the voice characteristics of patients with motor dysarthria caused by nervous system diseases, can realize measurement covering large-scale groups and collection of related information by relying on the advantages of an open network platform, realizes establishment of voice libraries such as mandarin, dialects, healthy human voices and patient voices, and establishes a word library meeting the condition diagnosis of the patients with motor dysarthria on the basis.
(2) Under the condition that the voice library is continuously expanded, a rich data resource center is finally established according to information such as Putonghua, dialect, different medical histories and different disease conditions, a network autonomous diagnosis way is provided for patients with nervous system diseases, doctors can be assisted in clinical diagnosis and treatment, and a rich and accurate data platform is provided for quantification of the disease conditions of the nervous system diseases.
(3) On the basis of a voice library, based on a Hadoop platform, a Map function is adopted to subdivide a big data voice classification problem, and a multi-node parallel and distributed sub-problem is used for carrying out voice classification solution to obtain a corresponding voice classification result; and finally, combining the voice classification results of the sub-problems by using a Reduce function so as to adapt to the online requirement of the voice classification of the big data.
Drawings
FIG. 1: the speech annotation of "bao" in the embodiment of the present invention is exemplified.
FIG. 2: in the embodiment of the invention, the resonance peak data of the 'bao' voice is obtained.
FIG. 3: the basic framework of the Hadoop platform in the embodiment of the invention.
FIG. 4 shows a big data voice classification process based on a Hadoop platform.
Detailed Description
The invention is further illustrated below with reference to fig. 1 to 4:
the voice library is composed of an unvoiced sound library, a voiced sound library, a tone library, a voice synthesis program and a Chinese-pinyin conversion program.
1. Establishing an unvoiced sound library:
according to the characteristics of unvoiced sound, the quality of synthesized speech is improved. The unvoiced sound library is established by adopting a direct sampling method. That is, the unvoiced parts in front of the voiced speech segments in various pinyin combinations are sampled to form an unvoiced speech library. Since the unvoiced sounds in 1 syllable actually occupy only a small part, the unvoiced sound library constituted by unvoiced sounds extracted from 400 unvoiced syllables actually occupies a small storage space.
2. Establishing a voiced sound library:
voiced sounds are synthesized by a voiced synthesis program calling VTFR synthesis for voiced sounds. The voiced sound library is actually composed of VTFRs of various voiced sounds, VTFRs of various voiced sounds are sequentially extracted by adopting a VTFR extracting program, and the VTFRs of various voiced sounds and a voiced sound synthesizing program are stored in 1 data packet, so that the voiced sound library is formed. The actually extracted VTFR is only 1 curve, and the space occupied by the voiced sound library formed by the curve is very small.
The establishment of the voice corpus mainly comprises the following four main processes: designing a pronunciation text; recording voice; analyzing parameters of the voice file; establishing a database management system; data analysis of big data technology.
1. Designing a pronunciation text;
1.1 selection of pronunciation text:
how to select corpora is the key of corpus database construction. In order to ensure the order and effectiveness of the database building work and the quality of the corpus, a selection principle of the corpus is firstly researched and formulated before the corpus is built. The selection principle of the speech corpus comprises the following steps: firstly, the single characters in the corpus are required to contain all the phonological phenomena as much as possible, so that the phonetic system characteristics of the dialect speech can be reflected better and more conveniently; secondly, the vocabularies in the corpus are based on the Chinese survey common table, so that the vocabularies can be conveniently compared with the Chinese mandarin; thirdly, sentences in the corpus are mainly selected from spoken language corpora | so that the method is more suitable for the real situation faced by speech recognition; the sentences in the corpus are complete in content and semanteme, so that the prosodic information of one sentence can be reflected as much as possible; and fifthly, selecting three phones without classification, so that the problem of sparse training data can be effectively solved.
1.2, preparation of pronunciation texts:
the formulation of pronunciation texts is one of the key links for establishing a voice database. When determining pronunciation materials, the selection principle of pronunciation texts comprises five parts: one is the single word portion. Taking the initial consonants, the simple or compound vowels and some commonly used characters of the tone listed in the survey word list as the language materials used for the main recording of the voice library; the second is the vocabulary part. Based on a four thousand word list, but not limited to the four thousand word list, the related words are recorded according to the original conclusion about the related sound system, the voice characteristics including the characteristics of tone quality and super-sound quality can be comprehensively reflected, and example words can be added to reflect the characteristics of the voice phenomenon with great particularity; thirdly, the statement material part determines the number of the linguistic data according to the language mastering degree of different speakers, and the linguistic data is selected to have certain representativeness while the range of the linguistic data is ensured to be as wide as possible; and fourthly, a natural conversation part, which is a subject of daily life, records voice materials of a speaker for about half an hour in a form of answering questions and freely talking, relates to words in daily spoken language which are different from the common Chinese language, and requires the speaker to speak in a dialect.
2. Recording voice;
2.1 determination of speaker:
the selection principle of the speaker is to select a native speaker who has clear mouth and teeth, moderate speech speed, proficient use of local language and willing to actively cooperate with investigation, and to ensure that the language environment of the speaker is stable and has a certain cultural degree.
2.2 voice collection:
the speaking mode in the recording process directly determines the purpose of the voice library. Because of the particularity of collecting the corpus, according to different research purposes, two modes are adopted: one is reading aloud with prompt text, the prompt is the literal material of Chinese | the speaker converts it into his own native language and reads aloud; the other is natural voice, and the speaker can tell the folk story, the national living condition, the humming of the local folk song and the like by using prompts.
3. Parameter analysis for the speech file:
after the pronunciation text is recorded, the voice data needs to be analyzed to obtain different features of the voice signal, which is a key for designing the voice corpus and a necessary basis for the post-stage voice processing. The invention focuses on researching voice information, so that the basic attribute of the voice signal waveform needs to be labeled, and meanwhile, the related acoustic parameters are extracted.
3.1 information annotation of the voice library:
the voice labeling uses Praat software and carries out hierarchical labeling by referring to a Chinese sound segment labeling system SAMPA-C. The labels of the voice library comprise a text label and a sound mixing label, wherein the voice "bao" is taken as an example, and is shown in fig. 1.
The first part is character marking, Chinese character + pinyin is character pronunciation transcription, and the phonetic information is recorded with Chinese character for identification system and linguistic research. The character label must mark basic character information and sublingual phenomena, and the sublingual phenomena in the basic label can be represented by a universal sublingual symbol.
The second part is a syllable label, the mandarin syllable label adopts a standard mandarin syllable label, and the syllable label is a tonal label. In the tone notation, 0 indicates a light tone, 1 indicates a yin-ping, 2 indicates a yang-ping, 3 indicates a rising tone, and 4 indicates a falling tone.
3.2 extraction of acoustic parameters:
for the recorded voice signals, the acoustic parameters of each speech segment need to be extracted, and in actual operation, the recorded voice is firstly segmented and the mute segments are eliminated so as to ensure that the analyzed objects are single words; then, judging the start and end sections of the voice signals in the voice waveform data, and marking the range of the vowels; finally, corresponding fundamental frequency and resonance peak data are obtained according to an autocorrelation algorithm, taking voice "bao" as an example, as shown in fig. 2.
4. Establishing a database management system:
4.1 database selection
For the selection of the database, because a large amount of voice waveform data needs to be stored in the voice database, the voice waveform data is characterized by large data volume, unfixed length, and lower requirements on aspects of transaction processing and recovery, safety, network support and the like. Therefore, we can choose a more easily implemented sql database management system.
4.2 creation of database management System
Establishing a database management system in a voice corpus needs to store four materials, namely, speaker attribute materials, such as age, gender, education condition, Chinese mastering condition, mother language use condition and the like of a speaker; secondly, a pronunciation text material is recorded and stored, and the pronunciation of the pronunciation person and text materials such as dialect pronunciation and mandarin international phonetic symbol corresponding to the pronunciation person are recorded and stored; thirdly, actual voice data material is mainly used for storing original parameters of the recorded voice waveform graph; and fourthly, storing acoustic analysis parameter data, namely the acoustic parameters extracted from the processed voice waveform.
5. Data analysis for big data technology
The big data is a data set with large scale which greatly exceeds the capability range of the traditional database software tools in the aspects of acquisition, storage, management and analysis, and has the four characteristics of large data scale, rapid data circulation, various data types and low value density. The strategic significance of big data technology is not to grasp huge data information, but to specialize the data containing significance. In other words, if big data is compared to an industry, the key to realizing profitability in the industry is to improve the "processing ability" of the data and realize the "value-added" of the data through the "processing". In the word bank construction, the important value of the big data technology is that the aim of evaluating the quality of the voice elements in the word bank is achieved through the targeted analysis and research on the data, so that the word bank construction is more complete.
The word stock is shared through the network platform, so that tests of different crowds are facilitated, more data samples are obtained, the voice library is enriched, in the future, the more targeted exercise dysarthria patient word stock can be established according to different regions and different dialects, and more abundant and reliable data samples are provided for subsequent automatic identification of disease classification and classification.
As shown in fig. 3, a speech classification mechanism based on a Hadoop platform is proposed, which includes collecting a large number of images, constructing an image database, and extracting effective features of image classification; then, based on a Hadoop platform, subdividing the big data voice classification problem by adopting a Map function, and performing voice classification solution on the subproblems in a multi-node parallel and distributed manner to obtain a corresponding voice classification result; and finally, combining the voice classification results of the sub-problems by using a Reduce function so as to adapt to the online requirement of the voice classification of the big data.
As shown in fig. 4, the big data speech classification process based on the Hadoop platform includes the following specific steps:
(1) the Client submits a voice classification task to a Job Tracker of the Hadoop platform, and the Job Tracker copies voice characteristic data to a local distributed file processing system;
(2) initializing voice classified tasks, putting the tasks into a Task queue, and distributing the tasks to corresponding nodes, namely a Task Tracker, by a Job Tracker according to the processing capacity of different nodes;
(3) each Task Tracker adopts a support vector machine to fit the relation between the voice features to be classified and a voice feature library according to the distributed tasks to obtain the corresponding categories of the voice;
(4) taking the corresponding class of the voice as Key/Value, and storing the Key/Value into a local file disk;
(5) if the Key/Value of the voice classification intermediate result is the same, merging the intermediate result, delivering the merged result to Reduce for processing to obtain a voice classification result, and writing the result into a distributed file processing system;
(6) and the Job Tracker performs emptying processing on the task state, and the user obtains a voice classification result from the distributed file processing system.
The invention is described above with reference to the accompanying drawings, it is obvious that the implementation of the invention is not limited in the above manner, and it is within the scope of the invention to adopt various modifications of the inventive method concept and solution, or to apply the inventive concept and solution directly to other applications without modification.