[go: up one dir, main page]

CN111583914B - Big data voice classification method based on Hadoop platform - Google Patents

Big data voice classification method based on Hadoop platform Download PDF

Info

Publication number
CN111583914B
CN111583914B CN202010395559.4A CN202010395559A CN111583914B CN 111583914 B CN111583914 B CN 111583914B CN 202010395559 A CN202010395559 A CN 202010395559A CN 111583914 B CN111583914 B CN 111583914B
Authority
CN
China
Prior art keywords
voice
speech
big data
classification
hadoop platform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010395559.4A
Other languages
Chinese (zh)
Other versions
CN111583914A (en
Inventor
杜炜
马春
谷宗运
陈鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University of Traditional Chinese Medicine AHUTCM
Original Assignee
Anhui University of Traditional Chinese Medicine AHUTCM
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University of Traditional Chinese Medicine AHUTCM filed Critical Anhui University of Traditional Chinese Medicine AHUTCM
Priority to CN202010395559.4A priority Critical patent/CN111583914B/en
Publication of CN111583914A publication Critical patent/CN111583914A/en
Application granted granted Critical
Publication of CN111583914B publication Critical patent/CN111583914B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/34Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Public Health (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Primary Health Care (AREA)
  • Pathology (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a big data voice classification method based on a Hadoop platform, which comprises the following steps: step 1, constructing a voice library; step 2, on the basis of the voice library, based on a Hadoop platform, subdividing the big data voice classification problem by adopting a Map function, and performing voice classification solution on the subproblems in a multi-node parallel and distributed manner to obtain corresponding voice classification results; and 3, finally, combining the voice classification results of the sub-problems by using a Reduce function so as to adapt to the online requirement of the voice classification of the big data.

Description

一种基于Hadoop平台的大数据语音分类方法A Big Data Speech Classification Method Based on Hadoop Platform

技术领域technical field

本发明涉及一种基于Hadoop平台的大数据语音分类方法。The invention relates to a big data speech classification method based on Hadoop platform.

背景技术Background technique

(1)运动性构音障碍研究现状:(1) Research status of motor dysarthria:

运动性构音障碍(dysarthria)是指由于中枢神经系统或周围神经系统损害导致,肌肉的控制紊乱而形成的一组言语障碍。运动性构音障碍常表现为言语相关肌肉组织运动减慢、减弱、不精确、不协调,也可能影响到呼吸、共鸣、喉发声的控制、构音和韵律,临床上常简称为构音障碍。运动性构音障碍常见病因包括脑外伤、脑瘫、肌萎缩性侧索硬化、多发性硬化、脑卒中、帕金森病、脊髓小脑共济失调等。构音障碍根据神经解剖和言语声学特点可以分为弛缓型、痉挛型、失调型、运动过弱型、运动过强型和混合型。在与脑损伤相关的交流障碍中, 构音障碍发病率高达54%。目前临床可以通过对嗓音、共鸣、韵律等方面的检查可从主观和客观两个方面反应构音障碍的言语声学特点,有利于提供针对性的治疗和全面科学地阐明构音障碍的言语声学病理机制。Motor dysarthria (dysarthria) refers to a group of speech disorders caused by the disorder of muscle control due to damage to the central nervous system or peripheral nervous system. Motor dysarthria often manifests as slowed, weakened, imprecise, and uncoordinated movement of speech-related muscle tissues, and may also affect breathing, resonance, and control of laryngeal vocalization, articulation, and rhythm. Clinically, it is often referred to as dysarthria for short . Common causes of motor dysarthria include traumatic brain injury, cerebral palsy, amyotrophic lateral sclerosis, multiple sclerosis, stroke, Parkinson's disease, spinocerebellar ataxia, etc. Dysarthria can be divided into flaccid, spastic, dysarthria, hypokinetic, hyperkinetic, and mixed types according to neuroanatomical and speech acoustic characteristics. Dysarthria occurs in up to 54% of communication disorders associated with brain injury. At present, the speech-acoustic characteristics of dysarthria can be reflected from both subjective and objective aspects through the examination of voice, resonance, and rhythm in clinical practice, which is conducive to providing targeted treatment and comprehensively and scientifically clarifying the speech-acoustic pathology of dysarthria mechanism.

对于运动性构音障碍总体的发病率国内外研究报道均较少,Miller等对125例帕金森病患者研究显示,有69.6%的患者的言语清晰度均值比正常对照组低,其中51.2%的患者低一个标准差,表明在帕金森患者中构音障碍的发病率较高。Bogousslavsky等对1000例初次卒中患者进行筛选,发现有言语障碍的患者高达46%,其中12.4%确诊为构音障碍患者。Hartelius等研究也发现多发性硬化患者中构音障碍发病率为51%。由此可见构音障碍的发病率较高。构音障碍的评定,目前国内尚无统一的评定方法,运动性构音障碍更无专门评定标准,多数采用Frenchay构音障碍评价法或改良法和中国康复研究中心构音障碍检查表,由临床医师或康复科医师检查、评分、记录、评价构音障碍程度、类型。There are few domestic and foreign research reports on the overall incidence of motor dysarthria. Miller et al. conducted a study on 125 patients with Parkinson's disease, showing that 69.6% of the patients had lower speech intelligibility than the normal control group, and 51.2% of them had patients were one standard deviation lower, suggesting a higher prevalence of dysarthria in Parkinson's patients. Bogousslavsky et al. screened 1,000 first-time stroke patients and found that as many as 46% of patients had speech impairment, of which 12.4% were diagnosed as dysarthria. Hartelius and other studies also found that the incidence of dysarthria in patients with multiple sclerosis was 51%. This shows that the incidence of dysarthria is higher. There is no unified evaluation method for dysarthria in China at present, and there is no special evaluation standard for motor dysarthria. Most of them use the Frenchay dysarthria evaluation method or the improved method and the dysarthria checklist of the China Rehabilitation Research Center. Physicians or rehabilitation physicians check, score, record, and evaluate the degree and type of dysarthria.

(2)国内语音库研究现状:(2) Current status of domestic speech database research:

随着信息技术与计算机科学的发展,语音技术使机器行为与人类自然语言的交互成为可能,不论是语音合成、语音识别还是语音辨认研究,都必定依靠于后端优秀语音语料库的建设。目前国外语音库的发展较为成熟,中国的语音库研究也已在近十几年间突飞猛进,语音库的研究与建立已在不同的语言和文化语境中落地。但是针对运动性构音障碍语音库的建设目前还处于研究状态。With the development of information technology and computer science, speech technology enables the interaction between machine behavior and human natural language. Whether it is speech synthesis, speech recognition or speech recognition research, it must rely on the construction of an excellent back-end speech corpus. At present, the development of foreign speech databases is relatively mature, and the research on Chinese speech databases has also advanced by leaps and bounds in the past ten years. The research and establishment of speech databases have been implemented in different language and cultural contexts. However, the construction of the speech library for motor dysarthria is still in the research state.

国内的构音语音功能评估研究主要集中在主观评估方面,而且只有少数的研究者将构音与语音的概念有所区分。黄昭鸣等提出了《汉语构音能力测验词表》,该词表包含50个字,言语康复师通过评价被试的50个字的构音语音,能够全面评价被试对21个声母和4种声调的构音能力,同时,通过18项音位对比、37个最小语音对来评估被试的音位对比能力。陈三定等人对50名聋儿进行了汉语普通话声母、韵母和声调的评价,揭示了说汉语普通话的聋儿构音语音的发展规律,还进一步提出了及早、顺序、容错和巩固”的言语康复教育原则。华东师范大学的张晶博士研究了听障儿童在个辅音构音时的主要错误走向,分析成因,并相应的提出了听障儿童辅音音位治疗框架。Domestic studies on the evaluation of articulation and phonetic functions mainly focus on subjective evaluation, and only a few researchers differentiate the concepts of articulation and phonetics. Huang Zhaoming and others proposed the "Chinese Articulation Ability Test Vocabulary", which contains 50 characters. The speech rehabilitation teacher can comprehensively evaluate the subject's understanding of 21 initials and 4 kinds of words by evaluating the pronunciation of the 50 characters of the subject. The articulation ability of the tone, meanwhile, the phoneme contrast ability of the subjects was evaluated through 18 phoneme contrast and 37 smallest phoneme pairs. Chen Sanding and others evaluated the initials, finals and tones of Mandarin Chinese on 50 deaf children, revealed the developmental law of articulation and pronunciation of deaf children who speak Mandarin Chinese, and further proposed the speech rehabilitation of "early, sequential, fault-tolerant and consolidated" Educational principles. Dr. Zhang Jing from East China Normal University studied the main mistakes of hearing-impaired children in the articulation of consonants, analyzed the causes, and proposed a corresponding framework for the treatment of hearing-impaired children's consonant phonemes.

(3)大数据在医疗领域研究现状:(3) Research status of big data in the medical field:

目前,对大数据定义比较流行的是:超过典型数据库软件工具所能撷取、储存、处理和分析能力的资料。大数据区别于超大规模数据、海量数据等传统数据概念,其具有四个基本特征:大量、多样、时效、价值。国务院在《关于积极推进“互联网+”行动的指导意见》指出,大力发展以互联网为载体,线上线下互动的新兴模式,加快发展互联网的医疗健康等新兴业务。Kayyali B等研究了大数据在美国医疗行业的影响,指出随着时间推移,大数据对医疗行业的价值将越来越显著。目前医疗领域内的大数据主要来自制药企业,临床诊断数据,患者就医数据,健康管理、社交网络数据。例如药物研发是一个相对密集的过程,即使对中小型企业而言,一项药物研发的数据也在TB以上;医院的数据每天增长也非常快,一个病人的双源CT检查一次成像在3000张,大概产生1.5GB影像资料,一个标准病理检查图像有将近5GB图像,加上患者就医、电子病历等数据,每天都在快速增长。基于海量大数据分析的研究方法引发了人们对于科学方法论的思考。研究无需直接接触研究对象,而通过直接分析和挖掘海量数据便可获得新的研究发现,这或许催生了一种新的科研模式。At present, the popular definition of big data is: data that exceeds the capabilities of typical database software tools to capture, store, process and analyze. Big data is different from traditional data concepts such as ultra-large-scale data and massive data. It has four basic characteristics: large number, variety, timeliness, and value. The State Council pointed out in the "Guiding Opinions on Actively Promoting the "Internet Plus" Action" that we should vigorously develop emerging models of online and offline interaction with the Internet as the carrier, and accelerate the development of Internet-based medical and health services. Kayyali B et al. studied the impact of big data on the medical industry in the United States, and pointed out that as time goes by, the value of big data to the medical industry will become more and more significant. At present, big data in the medical field mainly comes from pharmaceutical companies, clinical diagnosis data, patient medical data, health management, and social network data. For example, drug research and development is a relatively intensive process. Even for small and medium-sized enterprises, the data of a drug research and development is more than TB; the data of hospitals is also growing very fast every day, and a single dual-source CT examination of a patient has 3,000 images. , about 1.5GB of image data is generated, and a standard pathological examination image has nearly 5GB of images, plus data such as patient medical treatment and electronic medical records, which are growing rapidly every day. Research methods based on massive big data analysis have triggered people's thinking about scientific methodology. Research does not need to directly contact the research objects, but new research discoveries can be obtained through direct analysis and mining of massive data, which may give birth to a new scientific research model.

语音语料库的建立是一个繁琐复杂的问题,对于语音语料库的后期完善还有待改进的问题,例如充分利用现有的词间变调规则,尽量体现变调和轻声的实际情况。对于语料的不足,可以在预处理环节提高现有语料利用率。鉴于以上原因,语音库应采取开放型数据库,以便可以随时添加、修改,以便完善该数据库。由于语音情况不尽相同,因而具体的语音语料库的建立也会碰到各种各样的困难,我们在这里所讨论的问题,只是对于建立语音语料库的一种探讨,希望可以为语音的研究提供数据支持,为更好的发展语言,完善语音语料库起着重要作用。The establishment of a speech corpus is a cumbersome and complicated problem. There are still some issues to be improved in the later improvement of the speech corpus, such as making full use of the existing inter-word tone sandhi rules to reflect the actual situation of tone shift and soft voice as much as possible. For the shortage of corpus, the utilization rate of existing corpus can be improved in the preprocessing link. In view of the above reasons, the voice library should adopt an open database, so that it can be added and modified at any time, so as to improve the database. Due to the different speech situations, the establishment of a specific speech corpus will also encounter various difficulties. The issues we discuss here are just a discussion on the establishment of a speech corpus, hoping to provide some information for the study of speech. Data support plays an important role in better language development and improving the speech corpus.

此外,数据量大毫无疑问是网络大数据分析技术的一大优势,但如何保证海量数据的质量,以及如何实现对海量数据进行清洗、管理和分析等问题,也成为本课题研究的一大技术难点。海量的网络大数据具有多源异构、交互性、时效性、突发性和高噪声等特点,因而导致了网络大数据虽然价值巨大但噪声也大,价值密度低的特征。这对保证网络大数据分析研究中的数据质量则构成了巨大挑战。In addition, the large amount of data is undoubtedly a major advantage of network big data analysis technology, but how to ensure the quality of massive data, and how to realize the cleaning, management and analysis of massive data have also become a major research topic of this topic. Technical Difficulties. Massive network big data has the characteristics of multi-source heterogeneity, interactivity, timeliness, burstiness and high noise, which leads to the characteristics of network big data, although it is huge in value, it is also noisy and low in value density. This poses a huge challenge to ensure the data quality in network big data analysis research.

发明内容Contents of the invention

本发明设计了一种基于Hadoop平台的大数据语音分类方法,其解决的技术问题是数据量大毫无疑问是网络大数据分析技术的一大优势,但如何保证海量数据的质量,以及如何实现对海量数据进行清洗、管理和分析等问题,也成为一大技术难点。The present invention designs a big data voice classification method based on the Hadoop platform. The technical problem it solves is that the large amount of data is undoubtedly a major advantage of network big data analysis technology, but how to ensure the quality of massive data and how to realize The cleaning, management and analysis of massive data has also become a major technical difficulty.

为了解决上述存在的技术问题,本发明采用了以下方案:In order to solve the above-mentioned technical problems, the present invention adopts the following scheme:

一种基于Hadoop平台的大数据语音分类方法,包括以下步骤:A kind of big data speech classification method based on Hadoop platform, comprises the following steps:

步骤1、进行语音库的构建;Step 1, carry out the construction of voice database;

步骤2、在此语音库基础上,基于Hadoop平台,采用Map函数对大数据语音分类问题进行细分,用多节点并行、分布式地对子问题进行语音分类求解,得到相应的语音分类结果;Step 2. On the basis of this voice library, based on the Hadoop platform, the Map function is used to subdivide the big data voice classification problem, and the sub-problems are solved in a parallel and distributed manner with multiple nodes to obtain the corresponding voice classification results;

步骤3、最后利用Reduce函数对子问题的语音分类结果进行组合,以适应大数据语音分类的在线要求。Step 3. Finally, use the Reduce function to combine the speech classification results of the sub-problems to meet the online requirements of big data speech classification.

优选地,所述步骤2包括以下内容:(1)Client向Hadoop平台的Job Tracker提交一个语音分类任务,Job Tracker将语音特征数据复制到本地的分布式文件处理系统中;Preferably, the step 2 includes the following content: (1) Client submits a voice classification task to the Job Tracker of the Hadoop platform, and the Job Tracker copies the voice feature data to the local distributed file processing system;

(2)对语音分类的任务进行初始化,将任务放入任务队列中,Job Tracker根据不同节点的处理能力将任务分配到相应的节点上,即Task Tracker上;(2) Initialize the task of speech classification, put the task into the task queue, and the Job Tracker assigns the task to the corresponding node according to the processing capabilities of different nodes, that is, the Task Tracker;

(3)各Task Tracker根据分配的任务,采用支持向量机拟合待分类语音特征与语音特征库之间的关系,得到语音相应的类别;(3) Each Task Tracker uses a support vector machine to fit the relationship between the speech features to be classified and the speech feature library according to the assigned tasks, and obtains the corresponding category of the speech;

(4)将语音相应的类别作为 Key/Value,保存到本地文件磁盘中;(4) Save the corresponding voice category as Key/Value to the local file disk;

(5)如果语音分类中间结果的Key/Value相同,则对其进行合并,将合并的结果交给Reduce进行处理,得到语音分类的结果,并将结果写入到分布式文件处理系统中;(5) If the Key/Value of the intermediate results of speech classification are the same, they are merged, and the merged result is handed over to Reduce for processing to obtain the result of speech classification, and the result is written into the distributed file processing system;

(6)Job Tracker将任务状态进行清空处理,用户从分布式文件处理系统中得到语音分类的结果。(6) Job Tracker clears the task status, and the user gets the result of voice classification from the distributed file processing system.

优选地,步骤1、进行语音库的构建包括以下步骤:步骤11、发音文本的设计;步骤12、语音录制;步骤13、语音文件的标注;步骤14、对语音文件的声学参数分析;步骤15、数据库管理系统的建立。Preferably, step 1, carrying out the construction of voice bank comprises the following steps: step 11, the design of pronunciation text; Step 12, voice recording; Step 13, the mark of voice file; Step 14, the acoustic parameter analysis to voice file; Step 15 , The establishment of the database management system.

优选地,所述步骤11中发音文本的设计包括发音文本的选择,所述发音文本的语料库的选择原则包括以下一种或多种:Preferably, the design of the pronunciation text in the step 11 includes the selection of the pronunciation text, and the selection principle of the corpus of the pronunciation text includes one or more of the following:

a、语料库中的单字要求尽量包含所有的声韵现象,能够更好更方便的反映不同患者语音的音系特征;a. The words in the corpus are required to include all the phonetic phenomena as much as possible, which can better and more conveniently reflect the phonological characteristics of the voices of different patients;

b、语料库中的词汇依据汉语调查常用表为基础,所以能方便的与汉语普通话进行比较;b. The vocabulary in the corpus is based on the common Chinese survey table, so it can be easily compared with Mandarin Chinese;

c、语料库中的句子主要是根据几个相关主题,与患者进行对话所得,所以更符合语音识别面对的真实情形;“几个相关主题”包括日常生活主题或病史主题,例如询问首次发病时间及病史情况。c. The sentences in the corpus are mainly obtained from conversations with patients based on several related topics, so they are more in line with the real situation faced by speech recognition; "several related topics" include daily life topics or medical history topics, such as asking the time of first onset and medical history.

d、语料库中的句子在内容和语义上都是完整的,所以能够尽可能的反映一个句子的韵律信息;d. The sentences in the corpus are complete in content and semantics, so it can reflect the prosodic information of a sentence as much as possible;

e、对三音子不进行归类的挑选,这样能够有效的解决训练数据稀疏的问题。e. The selection of triphones without classification can effectively solve the problem of sparse training data.

优选地,所述步骤11中所述发音文本的设计还包括发音文本的编制,所述发音文本的编制原则包括以下一种或多种:Preferably, the design of the pronunciation text in the step 11 also includes the compilation of the pronunciation text, and the compilation principles of the pronunciation text include one or more of the following:

a、单字部分:将调查字表中列举的声母韵母以及声调的一些常用字作为本次语音库的主要录音所用语料;a. Character part: use the initials, finals, and tones listed in the survey word list as the main recording materials for this speech database;

b、词汇部分:以至少一个四千词词表为基础,根据原来关于相关音系的结论记录相关词语,力求能够全面反映其语音特点,包括音质和超音质特点,针对一些很有特色的语音现象,可增加例词来反映其特征;“相关音系的结论记录相关词语”指的是,根据在同一语言中使用的音,组合规律以及节律和语调的特点,总结的常用词汇。“特色的语音现象”指的是方言中容易读错的,比如平舌音翘舌音难区分的,f和h不分等情况。b. Vocabulary part: Based on at least a 4,000-word vocabulary, record related words according to the original conclusions about related phonology, and strive to fully reflect their phonetic characteristics, including the characteristics of sound quality and super-tone quality, for some very distinctive sounds Phenomenon, example words can be added to reflect its characteristics; "Conclusion Record Related Words of Related Phonetics" refers to common vocabulary summarized according to the characteristics of sounds, combination rules, rhythm and intonation used in the same language. "Characteristic phonetic phenomena" refer to dialects that are prone to mispronunciation, such as flat-tongue sounds that are hard to distinguish, and f and h that are not distinguished.

c、语句材料部分:根据不同发音人的语言掌握程度决定语料数量,选取时既要保证语料的范围尽可能广,还需使其具有一定的代表性;“代表性”在此指的是可以体现运动性构音障碍语言特点,具有普遍性的语句。c. Sentence materials: The number of corpus is determined according to the language mastery of different speakers. When selecting, it is necessary to ensure that the range of corpus is as wide as possible, and to make it representative; "representative" here refers to the ability to Reflect the language characteristics of motor dysarthria and have universal sentences.

d、自然对话部分:日常生活为题,采用回答问题和自由谈话的形式,录制发音人20-40分钟的语音材料,涉及日常口语中和普通话说法不同的词汇,要求发音人用方言说出来。d. Natural dialogue part: Daily life as the topic, in the form of answering questions and free conversation, recording 20-40 minutes of audio materials from the speaker, involving vocabulary in daily spoken language that is different from Mandarin, and requiring the speaker to speak in dialect.

优选地,所述步骤12的语音录制包括发音人的确定,所述发音人的选取原则是挑选口齿清晰、语速适中(“语速适中”是指语速适中,控制在120-150字/分钟)、熟练使用本地语且愿意主动配合调查的母语发音人,还要保证其所处的语言环境比较稳定,同时又要有文化程度;Preferably, the voice recording in step 12 includes the determination of the speaker, and the selection principle of the speaker is to choose articulate and moderate speech rate ("moderate speech rate" refers to a moderate rate of speech, controlled at 120-150 words/ Minutes), a native speaker who is proficient in using the local language and willing to actively cooperate with the survey, and must ensure that the language environment in which he or she lives is relatively stable, and at the same time must have a level of education;

或者/和,所述语音录制还包括通过语音采集器进行的语音采集,所述语音采集采用两种方式:一种是具有提示文本的朗读,提示是汉语的文字材料,发音人将其转换成自己的母语并朗读;另一种是自然语音,发音人利用提示讲述民间故事、民族生活状况以及当地民歌的哼唱。Or/and, the voice recording also includes voice collection by a voice collector, and the voice collection adopts two methods: one is reading aloud with a prompt text, the prompt is a Chinese text material, and the speaker converts it into The other is natural speech, where the speaker uses prompts to tell folk stories, ethnic living conditions and humming of local folk songs.

优选地,步骤14中所述对语音文件的声学参数分析包括语音库的语音标注,基本的语音标注包括各个音节的声韵母切分和对齐,以及声韵调的标注,包括两个部分:Preferably, the acoustic parameter analysis of the voice file described in step 14 includes the voice annotation of the voice library, and the basic voice annotation includes the segmentation and alignment of the consonants and vowels of each syllable, and the labeling of the sound and rhyme tone, including two parts:

第一部分是文字标注,汉字+pinyin即字音转写,将语音信息用汉字记录下来,以便提供给识别系统使用,也能为语言学的研究提供素材;文字标注必须标明基本文字信息以及副语言学现象,基本标注中的副语言学现象可用通用副语言学符号表示;The first part is text labeling, Chinese characters + pinyin means phonetic transliteration, and the voice information is recorded in Chinese characters, so as to provide it to the recognition system and provide materials for linguistic research; text labeling must indicate basic text information and paralinguistics Phenomena, the paralinguistic phenomena in the basic notation can be represented by general paralinguistic symbols;

第二部分是音节标注,普通话音节标注采用标准普通话音节标注,音节标注为有调标注;声调标注中0表示轻声,1表示阴平,2表示阳平,3表示上声,4表示去声。The second part is syllable labeling. Standard Mandarin syllable labeling is used for standard Mandarin syllable labeling, and syllable labeling is tone labeling; 0 in the tone labeling indicates soft tone, 1 indicates Yinping, 2 indicates Yangping, 3 indicates Shangsheng, and 4 indicates Qusheng.

优选地,步骤14中所述对语音文件的声学参数分析还包括声学参数的提取;Preferably, the acoustic parameter analysis of the voice file described in step 14 also includes the extraction of acoustic parameters;

首先对所录制的语音进行切分和消除静音段的处理,以保证分析的对象为单个字词、词组、语句、对话;然后在语音波形数据中对于语音信号的起止段做出判定,对语音进行标注;最后再根据自相关算法得到相应的基频和共振峰声学分析参数数据。First, the recorded voice is segmented and the silent segment is eliminated to ensure that the object of analysis is a single word, phrase, sentence, and dialogue; Marking; finally, the corresponding fundamental frequency and formant acoustic analysis parameter data are obtained according to the autocorrelation algorithm.

优选地,步骤15中所述数据库管理系统的建立包括数据库的选取,选用较易实现的sql数据库管理系统;Preferably, the establishment of the database management system described in step 15 includes the selection of the database, and the sql database management system that is easier to implement is selected for use;

或者/和,步骤15中所述数据库管理系统的建立中需存储四种素材:一是发音人属性素材;二是发音文本素材,录入和存储患者发音素材及其对应的发音和普通话国际音标等文本材料;三是实际语音数据材料,用于保存录制好的语音波形图形的原始参数;四是声学分析参数数据,即对处理后的语音波形提取的声学参数的保存。Or/and, four kinds of materials need to be stored in the establishment of the database management system described in step 15: the one, the speaker attribute material; The third is the actual voice data material, which is used to save the original parameters of the recorded voice waveform graphics; the fourth is the acoustic analysis parameter data, that is, the storage of the acoustic parameters extracted from the processed voice waveform.

该基于Hadoop平台的大数据语音分类方法具有以下有益效果:This big data voice classification method based on Hadoop platform has the following beneficial effects:

(1)本发明基于Hadoop平台的大数据语音分类方法解决了大数据语音分类耗时长、实时性差等技术问题。采用基于Hadoop平台的大数据语音分类机制利用云计算技术的优点,以获得理想的大数据语音分类结果为目标,较好地克服了当前语音分类机制存在的弊端,大幅度缩短了语音分类的时间,分类速度可适应大数据语音分类的在线要求,且语音分类的整体效果明显优于当前其他语音分类机制。(1) The big data voice classification method based on the Hadoop platform of the present invention solves technical problems such as time-consuming big data voice classification and poor real-time performance. The big data speech classification mechanism based on Hadoop platform is used to take advantage of cloud computing technology to obtain the ideal big data speech classification result as the goal, which overcomes the disadvantages of the current speech classification mechanism and greatly shortens the time of speech classification , the classification speed can adapt to the online requirements of big data speech classification, and the overall effect of speech classification is obviously better than other current speech classification mechanisms.

(1)本发明旨在研究神经系统疾病引起的运动性构音障碍的患者语音特性,依托于开放网络平台的优势,可以实现覆盖大规模群体的测量以及相关信息的收集,实现普通话、方言、健康人语音、患者语音等语音库的建立,并在此基础上,建立满足运动性构音障碍患者病情诊断的词库。(1) This invention aims to study the speech characteristics of patients with motor dysarthria caused by nervous system diseases. Relying on the advantages of the open network platform, it can realize the measurement covering large-scale groups and the collection of related information, and realize the The establishment of sound databases such as the voice of healthy people and the voice of patients, and on this basis, the establishment of a lexicon to meet the diagnosis of patients with motor dysarthria.

(2)本发明在语音库不断扩充下,最终分别根据普通话、方言、不同病史、不同病情等信息建立丰富的数据资源中心,为神经系统疾病患者提供一种网络自主诊断的途径,也可辅助医生进行临床诊疗,为神经系统疾病病情的量化提供丰富精准的数据平台。(2) With the continuous expansion of the voice database, the present invention finally establishes a rich data resource center based on information such as Mandarin, dialects, different medical histories, and different illnesses, and provides a way for patients with neurological diseases to self-diagnose on the Internet. It can also assist Doctors conduct clinical diagnosis and treatment, and provide a rich and accurate data platform for the quantification of neurological diseases.

(3)本发明在语音库基础上,基于Hadoop平台,采用Map函数对大数据语音分类问题进行细分,用多节点并行、分布式地对子问题进行语音分类求解,得到相应的语音分类结果;最后利用Reduce函数对子问题的语音分类结果进行组合,以适应大数据语音分类的在线要求。(3) On the basis of the voice library, the present invention uses the Map function to subdivide the big data voice classification problem based on the Hadoop platform, uses multiple nodes to solve the voice classification problem in parallel and distributed manner, and obtains the corresponding voice classification result ; Finally, the Reduce function is used to combine the speech classification results of the sub-problems to meet the online requirements of big data speech classification.

附图说明Description of drawings

图1:本发明实施例中“bao”的语音标注示例。Figure 1: An example of voice annotation of "bao" in the embodiment of the present invention.

图2:本发明实施例中“bao”语音的共振峰数据。Figure 2: Formant data of "bao" speech in an embodiment of the present invention.

图3:本发明实施例中Hadoop平台的基本框架。Fig. 3: the basic frame of Hadoop platform in the embodiment of the present invention.

图4: 本发明基于Hadoop平台的大数据语音分类流程。Fig. 4: The present invention is based on the big data speech classification process of Hadoop platform.

具体实施方式Detailed ways

下面结合图1至图4,对本发明做进一步说明:Below in conjunction with Fig. 1 to Fig. 4, the present invention will be further described:

语音库由清音库、浊音库、声调库、语音合成程序、汉语—拼音转化程序构成。The speech library is composed of unvoiced sound library, voiced sound library, tone library, speech synthesis program, and Chinese-Pinyin conversion program.

1. 清音库的建立:1. Establishment of voiceless library:

根据清音的特性,为了提高合成语音的质量。清音库采取直接采样法建立。即对各种拼音组合中的浊音段前面的清音部分取样,构成清音库。由于1个音节中清音实际只占很小的一部分,所以,由400多个无调音节中提取出的清音构成的清音库,实际所占的存储空间很小。According to the characteristics of unvoiced sounds, in order to improve the quality of synthesized speech. The voiceless library is established by direct sampling method. That is, the unvoiced part in front of the voiced segment in various pinyin combinations is sampled to form the unvoiced bank. Since the unvoiced sounds in one syllable actually only account for a very small part, the actual storage space occupied by the unvoiced sounds library composed of unvoiced sounds extracted from more than 400 atonal syllables is very small.

2. 浊音库的建立:2. Establishment of voiced sound library:

浊音由浊音合成程序调用对应浊音的VTFR合成。浊音库实际是由各种浊音的VTFR构成,采用提取VTFR程序依次提取各种浊音的VTFR,将各种浊音的VTFR和浊音合成程序保存在1个数据包内,就构成了浊音库。实际提取出的VTFR只是1条曲线,这样构成的浊音库所占的空间非常小。The voiced sound is synthesized by calling the VTFR corresponding to the voiced sound by the voiced sound synthesis program. The voiced sound library is actually composed of VTFRs of various voiced sounds. The VTFRs of various voiced sounds are sequentially extracted by using the extraction VTFR program, and the voiced sound library is formed by storing the VTFRs of various voiced sounds and the voiced sound synthesis program in one data package. The actually extracted VTFR is only one curve, and the space occupied by the voiced sound library formed in this way is very small.

本发明语音语料库的建立主要包括以下四个主要过程:发音文本的设计;语音录制;对语音文件的参数分析;数据库管理系统的建立;大数据技术的数据分析。The establishment of the speech corpus of the present invention mainly includes the following four main processes: design of pronunciation text; speech recording; parameter analysis of speech files; establishment of a database management system; data analysis of big data technology.

1. 发音文本的设计;1. Design of pronunciation text;

1.1 发音文本的选择:1.1 Selection of pronunciation text:

如何选取语料,是语料库建库工作的关键。为了保证建库工作的有序有效,保证语料库的质量,在语料库建库之前,首先要研究制定好语料库的选择原则。本语音语料库的选择原则包括:一、语料库中的单字要求尽量包含所有的声韵现象,所以可以更好更方便的反映该方言语音的音系特征;二、语料库中的词汇依据汉语调查常用表为基础,所以能方便的与汉语普通话进行比较;三、语料库中的句子主要是从口语语料挑选来的!所以更符合语音识别面对的真实情形;四、语料库中的句子在内容和语义上都是完整的,所以能够尽可能的反映一个句子的韵律信息;五、我们对三音子不进行归类的挑选,这样可以有效的解决训练数据稀疏的问题。How to select corpus is the key to corpus construction. In order to ensure the orderly and effective construction of the database and the quality of the corpus, before the construction of the corpus, the selection principles of the corpus must first be studied and formulated. The selection principles of this speech corpus include: 1. The single characters in the corpus are required to include all phonetic phenomena as much as possible, so it can better and more conveniently reflect the phonological characteristics of the dialect speech; 2. The vocabulary in the corpus is based on the commonly used table in Chinese survey Therefore, it can be easily compared with Mandarin Chinese; 3. The sentences in the corpus are mainly selected from the spoken corpus! Therefore, it is more in line with the real situation faced by speech recognition; 4. The sentences in the corpus are both content and semantic. It is complete, so it can reflect the prosodic information of a sentence as much as possible; 5. We do not classify triphones, which can effectively solve the problem of sparse training data.

1.2 发音文本的编制:1.2 Preparation of pronunciation text:

发音文本的编制是建立语音数据库的关键环节之一。我们在确定发音素材时,依据发音文本选取原则,包括五个部分:一是单字部分。将调查字表中列举的声母韵母以及声调的一些常用字作为本次语音库的主要录音所用语料;二是词汇部分。以一个四千词词表为基础但不局限于此,根据原来关于相关音系的结论记录相关词语,力求能够全面反映其语音特点,包括音质和超音质特点,针对一些很有特色的语音现象,可增加例词来反映其特征;三是语句材料部分,根据不同发音人的语言掌握程度决定语料数量,选取时既要保证语料的范围尽可能广,还需使其具有一定的代表性;四是自然对话部分,日常生活为题,采用回答问题和自由谈话的形式,录制发音人约半个小时的语音材料,涉及日常口语中和普通话说法不同的词汇,要求发音人用方言说出来。The compilation of the pronunciation text is one of the key links in the establishment of the speech database. When we determine the pronunciation material, we follow the selection principle of the pronunciation text, including five parts: one is the word part. The initials, finals and tones listed in the survey word list are used as the main recording materials for this speech database; the second is the vocabulary part. Based on but not limited to a 4,000-word vocabulary, record related words according to the original conclusions about related phonology, and strive to fully reflect their phonetic characteristics, including the characteristics of sound quality and super-quality sound, and aim at some very distinctive phonetic phenomena , you can add example words to reflect its characteristics; the third is the sentence material part, the number of corpus is determined according to the language mastery of different speakers. When selecting, it is necessary to ensure that the scope of the corpus is as wide as possible, and it must also be representative to a certain extent; The fourth is the natural dialogue part, with daily life as the topic. It adopts the form of answering questions and free conversation, and records about half an hour of audio materials from the speaker.

2、语音录制;2. Voice recording;

2.1 发音人的确定:2.1 Determination of speaker:

发音人的选取原则是挑选口齿清晰、语速适中、熟练使用本地语且愿意主动配合调查的母语发音人,还要保证其所处的语言环境比较稳定,同时又要有一定的文化程度。The selection principle of speakers is to select native speakers who are articulate, moderate in speed, proficient in using the local language, and willing to actively cooperate with the survey. They must also ensure that their language environment is relatively stable, and at the same time, they must have a certain level of education.

2.2 语音采集:2.2 Voice collection:

录音过程中的说话方式直接决定语音库的用途。由于收集语料的特殊性,根据不同的研究目的,采用两种方式:一种是具有提示文本的朗读,提示是汉语的文字材料!发音人将其转换成自己的母语并朗读;另一种是自然语音,发音人可以利用提示讲述民间故事、民族生活状况以及当地民歌的哼唱等。The way you speak during the recording directly determines the purpose of the voice bank. Due to the particularity of the collected corpus, according to different research purposes, two methods are used: one is reading aloud with prompt text, and the prompt is written in Chinese! The speaker converts it into his native language and reads it aloud; the other is Natural voice, the speaker can use prompts to tell folk stories, ethnic living conditions and humming of local folk songs, etc.

3、对语音文件的参数分析:3. Analysis of the parameters of the voice file:

录制了发音文本后,需要对语音数据进行分析处理以得到语音信号的不同特征,这是语音语料库设计的关键,也是后期语音处理所必须的基础。本发明着眼于研究语音信息,因此需要对语音信号波形的基本属性进行标注,同时提取出相关的声学参数。After recording the pronunciation text, it is necessary to analyze and process the voice data to obtain different characteristics of the voice signal. This is the key to the design of the voice corpus and the necessary basis for the later voice processing. The present invention focuses on the study of voice information, so it is necessary to mark the basic attributes of the voice signal waveform and extract relevant acoustic parameters at the same time.

3.1 语音库的信息标注:3.1 Information labeling of the voice library:

语音标注使用Praat软件,参照汉语音段标注系统SAMPA-C进行分级标注。语音库的标注包括文字标注和有调音节标注两部分,在此以语音“bao”为例,如图1所示。Speech annotation uses Praat software, and refers to the Chinese speech segment annotation system SAMPA-C for hierarchical annotation. The annotation of the speech database includes two parts: text annotation and tonal syllable annotation. Here, the pronunciation "bao" is taken as an example, as shown in Figure 1.

第一部分是文字标注,汉字+pinyin即字音转写,将语音信息用汉字记录下来,以便提供给识别系统使用,也能为语言学的研究提供素材。文字标注必须标明基本文字信息以及副语言学现象,基本标注中的副语言学现象可用通用副语言学符号表示。The first part is text labeling, Chinese characters + pinyin means phonetic transliteration, and the voice information is recorded in Chinese characters, so that it can be used by the recognition system, and can also provide materials for linguistic research. Text annotation must indicate basic text information and paralinguistic phenomena, and the paralinguistic phenomena in basic annotations can be represented by general paralinguistic symbols.

第二部分是音节标注,普通话音节标注采用标准普通话音节标注,音节标注为有调标注。声调标注中0表示轻声,1表示阴平,2表示阳平,3表示上声,4表示去声。The second part is the syllable labeling. The standard Mandarin syllable labeling is adopted for the standard Mandarin syllable labeling, and the syllable labeling is tonal labeling. In the tone label, 0 means soft tone, 1 means Yinping tone, 2 means Yangping tone, 3 means Shang tone, and 4 means Qu tone.

3.2 声学参数的提取:3.2 Extraction of acoustic parameters:

对于录制好的语音信号,还需提取出各语段的声学参数,实际操作中首先对所录制的语音进行切分和消除静音段的处理,以保证分析的对象均为单个字词;然后在语音波形数据中对于语音信号的起止段做出判定,标注出韵母的范围;最后再根据自相关算法得到相应的基频和共振峰数据,以语音“bao”为例,如图2所示。For the recorded speech signal, it is also necessary to extract the acoustic parameters of each segment. In actual operation, the recorded speech is first segmented and the silent segment is eliminated to ensure that the object of analysis is a single word; In the voice waveform data, the start and end segments of the voice signal are judged, and the range of finals is marked; finally, the corresponding fundamental frequency and formant data are obtained according to the autocorrelation algorithm. Taking the voice "bao" as an example, as shown in Figure 2.

4、数据库管理系统的建立:4. Establishment of database management system:

4.1 数据库的选取4.1 Selection of database

对于数据库的选择,由于在语音库中,需要存储大量的语音波形数据,其特点是数据量大,长度不固定,对事务处理和恢复、安全性和对网络的支持等方面要求较低。因此,我们可以选用较易实现的sql数据库管理系统。For the choice of database, because in the voice library, a large amount of voice waveform data needs to be stored, which is characterized by a large amount of data, variable length, and low requirements for transaction processing and recovery, security, and network support. Therefore, we can choose the sql database management system which is easier to implement.

4.2 数据库管理系统的建立4.2 Establishment of database management system

语音语料库中数据库管理系统的建立需存储四种素材:一是发音人属性素材,如发音人年龄、性别、受教育情况、对汉语掌握情况、本人对母语使用状况等;二是发音文本素材,录入和存储发音人发音素材及其对应的方言发音和普通话国际音标等文本材料;三是实际语音数据材料,主要用于保存录制好的语音波形图形的原始参数;四是声学分析参数数据,即对处理后的语音波形提取的声学参数的保存。The establishment of the database management system in the speech corpus needs to store four kinds of materials: one is the speaker’s attribute material, such as the speaker’s age, gender, education, mastery of Chinese, and the use of the mother tongue by the person; the other is the pronunciation text material, Input and store the pronunciation material of the speaker and its corresponding dialect pronunciation and Mandarin International Phonetic Alphabet and other text materials; the third is the actual voice data material, which is mainly used to save the original parameters of the recorded voice waveform graphics; the fourth is the acoustic analysis parameter data, namely Save the acoustic parameters extracted from the processed speech waveform.

5、大数据技术的数据分析5. Data analysis of big data technology

大数据是一种规模大到在获取、存储、管理、分析方面大大超出了传统数据库软件工具能力范围的数据集合,具有海量的数据规模、快速的数据流转、多样的数据类型和价值密度低四大特征。大数据技术的战略意义不在于掌握庞大的数据信息,而在于对这些含有意义的数据进行专业化处理。换而言之,如果把大数据比作一种产业,那么这种产业实现盈利的关键,在于提高对数据的“加工能力”,通过“加工”实现数据的“增值”。在词库建设中,采用大数据技术的重要价值在于通过对数据的针对性分析与研究,实现评定词库中语音元素优劣的目的,从而使词库建设更加完善。Big data is a collection of data that is so large that it exceeds the capabilities of traditional database software tools in terms of acquisition, storage, management, and analysis. great feature. The strategic significance of big data technology lies not in mastering huge data information, but in professional processing of these meaningful data. In other words, if big data is compared to an industry, the key to achieving profitability in this industry lies in improving the "processing capability" of data and realizing the "value-added" of data through "processing". In the construction of the lexicon, the important value of using big data technology is to achieve the purpose of evaluating the pros and cons of phonetic elements in the lexicon through targeted analysis and research on the data, so as to make the lexicon construction more perfect.

通过网络平台将词库共享,以方便不同人群的测试,同时也将获得更多的数据样本,丰富语音库,在未来,可以根据不同地域、不同方言,建立更具有针对性的运动性构音障碍患者词库,为后续对病情分类和分级的自动识别提供更丰富可靠的数据样本。The lexicon is shared through the network platform to facilitate the testing of different groups of people. At the same time, more data samples will be obtained to enrich the voice library. In the future, more targeted sports articulation can be established according to different regions and dialects. The lexicon of patients with disorders provides more abundant and reliable data samples for subsequent automatic identification of disease classification and grading.

如图3所示,提出一种基于Hadoop平台的语音分类机制,首先收集大量的图像,构建图像数据库,并提取图像分类的有效特征;然后基于Hadoop平台,采用Map函数对大数据语音分类问题进行细分,用多节点并行、分布式地对子问题进行语音分类求解,得到相应的语音分类结果;最后利用Reduce函数对子问题的语音分类结果进行组合,以适应大数据语音分类的在线要求。As shown in Figure 3, a speech classification mechanism based on the Hadoop platform is proposed. Firstly, a large number of images are collected, an image database is constructed, and effective features of image classification are extracted. Then, based on the Hadoop platform, the Map function is used to solve the big data speech classification problem. Subdivision, use multi-node parallel and distributed to solve the sub-problems for speech classification, and obtain the corresponding speech classification results; finally use the Reduce function to combine the speech classification results of the sub-problems to meet the online requirements of big data speech classification.

如图4所示,基于Hadoop平台的大数据语音分类流程,其具体步骤如下:As shown in Figure 4, the big data voice classification process based on the Hadoop platform, the specific steps are as follows:

(1)Client向Hadoop平台的Job Tracker提交一个语音分类任务,Job Tracker将语音特征数据复制到本地的分布式文件处理系统中;(1) Client submits a voice classification task to the Job Tracker of the Hadoop platform, and the Job Tracker copies the voice feature data to the local distributed file processing system;

(2)对语音分类的任务进行初始化,将任务放入任务队列中,Job Tracker根据不同节点的处理能力将任务分配到相应的节点上,即Task Tracker上;(2) Initialize the task of speech classification, put the task into the task queue, and the Job Tracker assigns the task to the corresponding node according to the processing capabilities of different nodes, that is, the Task Tracker;

(3)各Task Tracker根据分配的任务,采用支持向量机拟合待分类语音特征与语音特征库之间的关系,得到语音相应的类别;(3) Each Task Tracker uses a support vector machine to fit the relationship between the speech features to be classified and the speech feature library according to the assigned tasks, and obtains the corresponding category of the speech;

(4)将语音相应的类别作为 Key/Value,保存到本地文件磁盘中;(4) Save the corresponding voice category as Key/Value to the local file disk;

(5)如果语音分类中间结果的Key/Value相同,则对其进行合并,将合并的结果交给Reduce进行处理,得到语音分类的结果,并将结果写入到分布式文件处理系统中;(5) If the Key/Value of the intermediate results of speech classification are the same, they are merged, and the merged result is handed over to Reduce for processing to obtain the result of speech classification, and the result is written into the distributed file processing system;

(6)Job Tracker将任务状态进行清空处理,用户从分布式文件处理系统中得到语音分类的结果。(6) Job Tracker clears the task status, and the user gets the result of voice classification from the distributed file processing system.

上面结合附图对本发明进行了示例性的描述,显然本发明的实现并不受上述方式的限制,只要采用了本发明的方法构思和技术方案进行的各种改进,或未经改进将本发明的构思和技术方案直接应用于其它场合的,均在本发明的保护范围内。Above, the present invention has been exemplarily described in conjunction with the accompanying drawings. Obviously, the realization of the present invention is not limited by the above-mentioned manner, as long as various improvements of the method concept and technical solutions of the present invention are adopted, or the present invention is implemented without improvement. The ideas and technical schemes directly applied to other occasions are within the protection scope of the present invention.

Claims (9)

1.一种基于Hadoop平台的大数据语音分类方法,包括以下步骤:1. A big data voice classification method based on Hadoop platform, comprising the following steps: 步骤1、进行语音库的构建;Step 1, carry out the construction of voice library; 步骤2、在此语音库基础上,基于Hadoop平台,采用Map函数对大数据语音分类问题进行细分,用多节点并行、分布式地对子问题进行语音分类求解,得到相应的语音分类结果;Step 2. On the basis of this voice library, based on the Hadoop platform, the Map function is used to subdivide the big data voice classification problem, and the sub-problems are solved in a parallel and distributed manner with multiple nodes to obtain the corresponding voice classification results; 步骤3、最后利用Reduce函数对子问题的语音分类结果进行组合,以适应大数据语音分类的在线要求。Step 3. Finally, use the Reduce function to combine the speech classification results of the sub-problems to meet the online requirements of big data speech classification. 2.根据权利要求1所述的基于Hadoop平台的大数据语音分类方法,其特征在于:所述步骤2包括以下内容:2. the big data voice classification method based on Hadoop platform according to claim 1, is characterized in that: described step 2 comprises the following content: (1)Client向Hadoop平台的Job Tracker提交一个语音分类任务,Job Tracker将语音特征数据复制到本地的分布式文件处理系统中;(1) Client submits a voice classification task to the Job Tracker of the Hadoop platform, and the Job Tracker copies the voice feature data to the local distributed file processing system; (2)对语音分类的任务进行初始化,将任务放入任务队列中,Job Tracker根据不同节点的处理能力将任务分配到相应的节点上,即Task Tracker上;(2) Initialize the task of speech classification, put the task into the task queue, and the Job Tracker assigns the task to the corresponding node according to the processing capabilities of different nodes, that is, the Task Tracker; (3)各Task Tracker根据分配的任务,采用支持向量机拟合待分类语音特征与语音特征库之间的关系,得到语音相应的类别;(3) Each Task Tracker uses a support vector machine to fit the relationship between the speech features to be classified and the speech feature library according to the assigned tasks, and obtains the corresponding category of the speech; (4)将语音相应的类别作为 Key/Value,保存到本地文件磁盘中;(4) Save the corresponding voice category as Key/Value to the local file disk; (5)如果语音分类中间结果的Key/Value相同,则对其进行合并,将合并的结果交给Reduce进行处理,得到语音分类的结果,并将结果写入到分布式文件处理系统中;(5) If the Key/Value of the intermediate results of speech classification are the same, they are merged, and the merged result is handed over to Reduce for processing to obtain the result of speech classification, and the result is written into the distributed file processing system; (6)Job Tracker将任务状态进行清空处理,用户从分布式文件处理系统中得到语音分类的结果。(6) Job Tracker clears the task status, and the user gets the result of voice classification from the distributed file processing system. 3.根据权利要求2所述的基于Hadoop平台的大数据语音分类方法,其特征在于:3. the big data voice classification method based on Hadoop platform according to claim 2, is characterized in that: 步骤1、进行语音库的构建包括以下步骤:步骤11、发音文本的设计;步骤12、语音录制;步骤13、语音文件的标注;步骤14、对语音文件的声学参数分析;步骤15、数据库管理系统的建立。Step 1, carrying out the construction of voice bank comprises the following steps: step 11, the design of pronunciation text; Step 12, voice recording; Step 13, the mark of voice file; Step 14, the acoustic parameter analysis to voice file; Step 15, database management System establishment. 4.根据权利要求3所述的基于Hadoop平台的大数据语音分类方法,其特征在于:4. the big data voice classification method based on Hadoop platform according to claim 3, is characterized in that: 所述步骤11中发音文本的设计包括发音文本的选择,所述发音文本的语料库的选择原则包括以下一种或多种:The design of pronunciation text in the described step 11 comprises the selection of pronunciation text, and the selection principle of the corpus of described pronunciation text comprises following one or more: a、语料库中的单字要求尽量包含所有的声韵现象,能够更好更方便的反映不同患者语音的音系特征;a. The words in the corpus are required to include all the phonetic phenomena as much as possible, which can better and more conveniently reflect the phonological characteristics of the voices of different patients; b、语料库中的词汇依据汉语调查常用表为基础,所以能方便的与汉语普通话进行比较;b. The vocabulary in the corpus is based on the common Chinese survey table, so it can be easily compared with Mandarin Chinese; c、语料库中的句子主要是根据几个相关主题,与患者进行对话所得,所以更符合语音识别面对的真实情形;c. The sentences in the corpus are mainly obtained from conversations with patients based on several related topics, so they are more in line with the real situation faced by speech recognition; d、语料库中的句子在内容和语义上都是完整的,所以能够尽可能的反映一个句子的韵律信息;d. The sentences in the corpus are complete in content and semantics, so it can reflect the prosodic information of a sentence as much as possible; e、对三音子不进行归类的挑选,这样能够有效的解决训练数据稀疏的问题。e. The selection of triphones without classification can effectively solve the problem of sparse training data. 5.根据权利要求4所述的基于Hadoop平台的大数据语音分类方法,其特征在于:5. the big data voice classification method based on Hadoop platform according to claim 4, is characterized in that: 所述步骤11中所述发音文本的设计还包括发音文本的编制,所述发音文本的编制原则包括以下一种或多种:The design of the pronunciation text described in the step 11 also includes the preparation of the pronunciation text, and the compilation principle of the pronunciation text includes the following one or more: a、单字部分:将调查字表中列举的声母韵母以及声调的一些常用字作为本次语音库的主要录音所用语料;a. Character part: use the initials, finals, and tones listed in the survey word list as the main recording materials for this speech database; b、词汇部分:以一个四千词词表为基础但不局限于此,根据原来关于相关音系的结论记录相关词语,力求能够全面反映其语音特点,包括音质和超音质特点,针对一些很有特色的语音现象,可增加例词来反映其特征;b. Vocabulary part: Based on but not limited to a 4,000-word vocabulary, record related words according to the original conclusions about related phonology, and strive to fully reflect its phonetic characteristics, including sound quality and super-tone quality characteristics, for some very For characteristic phonetic phenomena, example words can be added to reflect its characteristics; c、语句材料部分:根据不同发音人的语言掌握程度决定语料数量,选取时既要保证语料的范围尽可能广,还需使其具有一定的代表性;c. Sentence material part: the number of corpus is determined according to the language mastery of different speakers. When selecting, it is necessary to ensure that the range of corpus is as wide as possible and make it representative; d、自然对话部分:日常生活为题,采用回答问题和自由谈话的形式,录制发音人20-40分钟的语音材料,涉及日常口语中和普通话说法不同的词汇,要求发音人用方言说出来。d. Natural dialogue part: Daily life as the topic, in the form of answering questions and free conversation, recording 20-40 minutes of audio materials from the speaker, involving vocabulary in daily spoken language that is different from Mandarin, and requiring the speaker to speak in dialect. 6.根据权利要求5所述的基于Hadoop平台的大数据语音分类方法,其特征在于:6. the big data voice classification method based on Hadoop platform according to claim 5, is characterized in that: 所述步骤12的语音录制包括发音人的确定,所述发音人的选取原则是挑选口齿清晰、语速适中、熟练使用本地语且愿意主动配合调查的母语发音人,还要保证其所处的语言环境比较稳定,同时又要有文化程度;The voice recording in step 12 includes the determination of the speaker. The selection principle of the speaker is to select a native speaker who is articulate, moderate in speed, proficient in using the local language and willing to actively cooperate with the investigation. The language environment is relatively stable, and at the same time, there must be a level of education; 或者/和,所述语音录制还包括通过语音采集器进行的语音采集,所述语音采集采用两种方式:一种是具有提示文本的朗读,提示是汉语的文字材料,发音人将其转换成自己的母语并朗读;另一种是自然语音,发音人利用提示讲述民间故事、民族生活状况以及当地民歌的哼唱。Or/and, the voice recording also includes voice collection by a voice collector, and the voice collection adopts two methods: one is reading aloud with a prompt text, the prompt is a Chinese text material, and the speaker converts it into The other is natural speech, where the speaker uses prompts to tell folk stories, ethnic living conditions and humming of local folk songs. 7.根据权利要求6所述的基于Hadoop平台的大数据语音分类方法,其特征在于:7. the big data voice classification method based on Hadoop platform according to claim 6, is characterized in that: 步骤14中所述对语音文件的声学参数分析包括语音库的语音标注,基本的语音标注包括各个音节的声韵母切分和对齐,以及声韵调的标注,包括两个部分:The acoustic parameter analysis of the voice file described in step 14 includes the voice annotation of the voice library, and the basic voice annotation includes the segmentation and alignment of the consonants of each syllable, and the labeling of the sound and tone, including two parts: 第一部分是文字标注,汉字+pinyin即字音转写,将语音信息用汉字记录下来,以便提供给识别系统使用,也能为语言学的研究提供素材;文字标注必须标明基本文字信息以及副语言学现象,基本标注中的副语言学现象可用通用副语言学符号表示;The first part is text labeling, Chinese characters + pinyin means phonetic transliteration, and the voice information is recorded in Chinese characters, so as to provide it to the recognition system and provide materials for linguistic research; text labeling must indicate basic text information and paralinguistics Phenomena, the paralinguistic phenomena in the basic notation can be represented by general paralinguistic symbols; 第二部分是音节标注,普通话音节标注采用标准普通话音节标注,音节标注为有调标注;声调标注中0表示轻声,1表示阴平,2表示阳平,3表示上声,4表示去声。The second part is syllable labeling. Standard Mandarin syllable labeling is used for standard Mandarin syllable labeling, and syllable labeling is tone labeling; 0 in the tone labeling indicates soft tone, 1 indicates Yinping, 2 indicates Yangping, 3 indicates Shangsheng, and 4 indicates Qusheng. 8.根据权利要求7所述的基于Hadoop平台的大数据语音分类方法,其特征在于:步骤14中所述对语音文件的声学参数分析还包括声学参数的提取;8. the big data voice classification method based on Hadoop platform according to claim 7, is characterized in that: described in the step 14 to the acoustic parameter analysis of voice file also comprises the extraction of acoustic parameter; 首先对所录制的语音进行切分和消除静音段的处理,以保证分析的对象为单个字词、词组、语句、对话;然后在语音波形数据中对于语音信号的起止段做出判定,对语音进行标注;最后再根据自相关算法得到相应的基频和共振峰声学分析参数数据。First, the recorded voice is segmented and the silent segment is eliminated to ensure that the object of analysis is a single word, phrase, sentence, and dialogue; Marking; finally, the corresponding fundamental frequency and formant acoustic analysis parameter data are obtained according to the autocorrelation algorithm. 9.根据权利要求8所述的基于Hadoop平台的大数据语音分类方法,其特征在于:步骤15中所述数据库管理系统的建立包括数据库的选取,选用较易实现的sql数据库管理系统;9. the big data voice classification method based on Hadoop platform according to claim 8, is characterized in that: the establishment of database management system described in the step 15 comprises the selection of database, selects the sql database management system that is easier to realize; 步骤15中所述数据库管理系统的建立中需存储四种素材:一是发音人属性素材;二是发音文本素材,录入和存储患者发音素材及其对应的发音和普通话国际音标等文本材料;三是实际语音数据材料,用于保存录制好的语音波形图形的原始参数;四是声学分析参数数据,即对处理后的语音波形提取的声学参数的保存。Four kinds of materials need to be stored in the establishment of the database management system described in step 15: the one, the pronunciation person attribute material; It is the actual voice data material, which is used to save the original parameters of the recorded voice waveform graphics; the fourth is the acoustic analysis parameter data, that is, the storage of the acoustic parameters extracted from the processed voice waveform.
CN202010395559.4A 2020-05-12 2020-05-12 Big data voice classification method based on Hadoop platform Active CN111583914B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010395559.4A CN111583914B (en) 2020-05-12 2020-05-12 Big data voice classification method based on Hadoop platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010395559.4A CN111583914B (en) 2020-05-12 2020-05-12 Big data voice classification method based on Hadoop platform

Publications (2)

Publication Number Publication Date
CN111583914A CN111583914A (en) 2020-08-25
CN111583914B true CN111583914B (en) 2023-03-28

Family

ID=72112583

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010395559.4A Active CN111583914B (en) 2020-05-12 2020-05-12 Big data voice classification method based on Hadoop platform

Country Status (1)

Country Link
CN (1) CN111583914B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6067520A (en) * 1995-12-29 2000-05-23 Lee And Li System and method of recognizing continuous mandarin speech utilizing chinese hidden markou models
CN104538036A (en) * 2015-01-20 2015-04-22 浙江大学 Speaker recognition method based on semantic cell mixing model
CN105261246A (en) * 2015-12-02 2016-01-20 武汉慧人信息科技有限公司 Spoken English error correcting system based on big data mining technology
CN106128450A (en) * 2016-08-31 2016-11-16 西北师范大学 The bilingual method across language voice conversion and system thereof hidden in a kind of Chinese

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6067520A (en) * 1995-12-29 2000-05-23 Lee And Li System and method of recognizing continuous mandarin speech utilizing chinese hidden markou models
CN104538036A (en) * 2015-01-20 2015-04-22 浙江大学 Speaker recognition method based on semantic cell mixing model
CN105261246A (en) * 2015-12-02 2016-01-20 武汉慧人信息科技有限公司 Spoken English error correcting system based on big data mining technology
CN106128450A (en) * 2016-08-31 2016-11-16 西北师范大学 The bilingual method across language voice conversion and system thereof hidden in a kind of Chinese

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈小莹 ; 陈晨 ; 华侃 ; 于洪志 ; .语音语料库的设计研究.科技信息.2008,(36),全文. *

Also Published As

Publication number Publication date
CN111583914A (en) 2020-08-25

Similar Documents

Publication Publication Date Title
Hua Phonological development in specific contexts: Studies of Chinese-speaking children
Jing et al. Prominence features: Effective emotional features for speech emotion recognition
CN112599119B (en) Method for establishing and analyzing mobility dysarthria voice library in big data background
Mohanta et al. Analysis and classification of speech sounds of children with autism spectrum disorder using acoustic features
CN109841231B (en) Early AD (AD) speech auxiliary screening system for Chinese mandarin
Jessen Speaker classification in forensic phonetics and acoustics
Stasak et al. An investigation of linguistic stress and articulatory vowel characteristics for automatic depression classification
CN109727608A (en) A kind of ill voice appraisal procedure based on Chinese speech
Stelma et al. Intonation units in spoken interaction: Developing transcription skills
Wells et al. Prosodic impairments
Cox Eriksson Children's vocabulary development: the role of parental input, vocabulary composition and early communicative skills
Mieskes et al. Preparing data from psychotherapy for natural language processing
Zhang et al. Adolescent depression detection model based on multimodal data of interview audio and text
Green et al. Range in the use and realization of BIN in African American English
Wehrle et al. Filled pauses produced by autistic adults differ in prosodic realisation, but not rate or lexical type
Porretta et al. Perception of non-native consonant length contrast: The role of attention in phonetic processing
CN114916921A (en) A rapid speech cognitive assessment method and device
Holt et al. F0 declination and reset in read speech of African American and White American women
CN111583914B (en) Big data voice classification method based on Hadoop platform
Alsulaiman Arabic fluency assessment: Procedures for assessing stuttering in arabic preschool children
Lai et al. Intonation and voice quality of Northern Appalachian English: A first look
Nunes Whispered speech segmentation based on Deep Learning
Wayland et al. Lenition in L2 Spanish: The Impact of Study Abroad on Phonological Acquisition
Huang et al. An Experimental Study on Declarative and Interrogative Sentences in Shanghai Chinese
Jiang How does modification affect the processing of formulaic language? Evidence from L1 and L2 speakers of Chinese

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant