[go: up one dir, main page]

CN113611287A - Pronunciation error correction method and system based on machine learning - Google Patents

Pronunciation error correction method and system based on machine learning Download PDF

Info

Publication number
CN113611287A
CN113611287A CN202110730385.7A CN202110730385A CN113611287A CN 113611287 A CN113611287 A CN 113611287A CN 202110730385 A CN202110730385 A CN 202110730385A CN 113611287 A CN113611287 A CN 113611287A
Authority
CN
China
Prior art keywords
signal
pronunciation
ultrasonic
ultrasonic reflection
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110730385.7A
Other languages
Chinese (zh)
Other versions
CN113611287B (en
Inventor
伍楷舜
王璐
王泰华
涂栋亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202110730385.7A priority Critical patent/CN113611287B/en
Publication of CN113611287A publication Critical patent/CN113611287A/en
Application granted granted Critical
Publication of CN113611287B publication Critical patent/CN113611287B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01HMEASUREMENT OF MECHANICAL VIBRATIONS OR ULTRASONIC, SONIC OR INFRASONIC WAVES
    • G01H13/00Measuring resonant frequency
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • General Physics & Mathematics (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

本发明公开一种基于机器学习的发音纠错方法和系统。该方法包括:通过电子设备内置的扬声器产生超声波,并使用麦克风作为信号接收端接收来自于用户的发音信号以及超声波反射信号;对用户发音信号以及超声波反射信号进行去噪,提取出有效发音信号和对应的有效超声波反射信号;对有效发音信号和有效超声波反射信号统一进行峰值归一化处理;对归一化后的超声波反射信号进行时间窗分割,计算发音信号的共振峰频率,并基于多普勒效应计算超声波反射信号的频率偏移值;将一个时序中共振峰频率值以及对应的超声波反射信号的频率偏移值输入经训练的机器学习模型,识别出语音音素分类。本发明可准确识别多种语言的发音并辅助发音训练,具有普适性。

Figure 202110730385

The invention discloses a pronunciation error correction method and system based on machine learning. The method includes: generating ultrasonic waves through a built-in speaker of an electronic device, and using a microphone as a signal receiving end to receive a voiced signal and an ultrasonic reflection signal from a user; denoising the user's voiced signal and the ultrasonic reflection signal, and extracting an effective voiced signal and an ultrasonic reflection signal. The corresponding effective ultrasonic reflection signal; the peak normalization processing is performed on the effective pronunciation signal and the effective ultrasonic reflection signal uniformly; the time window is divided for the normalized ultrasonic reflection signal, and the formant frequency of the pronunciation signal is calculated, and based on the Doppler The Le effect calculates the frequency offset value of the ultrasonic reflected signal; the formant frequency value in a time series and the frequency offset value of the corresponding ultrasonic reflected signal are input into the trained machine learning model to identify the phoneme classification. The invention can accurately recognize the pronunciation of multiple languages and assist pronunciation training, and has universality.

Figure 202110730385

Description

Pronunciation error correction method and system based on machine learning
Technical Field
The invention relates to the technical field of machine learning, in particular to a pronunciation error correction method and system based on machine learning.
Background
In recent years, the demand of various language talents in each professional field has been increasing, and the foreign language learning market has been greatly increased globally. However, in foreign language education, pronunciation learning is more challenging than learning of other language skills. In the actual teaching process, because pronunciation teaching resources are lacked and the class time is limited, many teachers are difficult to teach students the correct pronunciation, so students are usually required to practice the pronunciation. Currently, automatic speech recognition software has been applied to computer aided pronunciation training to facilitate pronunciation learning, but the accuracy of automatic speech recognition software still needs to be improved continuously. In terms of language assessment, it was successful to employ automatic speech recognition software. For example, previous toffee or english exams have been computer tested by human hearing instead. Some educational service providers score the pronunciation and accent of a test taker using an automatic speech recognition system called a "speech scorer".
However, automatic speech recognition systems also present problems and challenges in language teaching. The development of automatic speech recognition systems tends to focus on evaluating the tone of a scene utterance and the familiarity of the utterance, and the process of adjusting the utterance by self-correction using an automatic speech recognition system does not achieve the desired effect. For example, many non-native speakers do not understand the specific situation of intonation and instead blindly imitate intonation according to the heard sounds, thereby giving wrong guidance to pronunciation learning. In recent years, automatic speech recognition systems have attracted researchers in different disciplines, including computer science, linguistics, speech recognition, signal processing, psycholinguistics, education, hearing and pronunciation studies, and the like. However, in practice, most automatic speech recognition systems focus on pronunciation error detection and evaluation, and focus only on the intonation and fluency of the pronunciation, rather than the pronunciation itself. When learners realize that their pronunciation is incorrect by looking at the score, they do not know how to adjust the correct pronunciation by either the score or the speech signal.
In general, existing automatic speech recognition systems focus primarily on evaluating scores rather than encouraging language learners to learn themselves. Therefore, most second language learners are inadvertently misled by the pronunciation training application. For example, learners attempt to achieve the highest score by adjusting their pitch or intonation (prosody level) to approximate examples, rather than adjusting their pronunciation according to pronunciation rules (phonetic level). Despite the many drawbacks of the automated voice scoring procedure, the correctness of using the pronunciation system is not assessed by authority, and the existing pronunciation training applications in the market mainly focus on two aspects: how to pronounce phonemes, words, or sentences; the learner's pronunciation quality is assessed by speech recognition techniques to indicate incorrect phonemes. That is, the method based on the pronunciation error detection score cannot diagnose pronunciation errors and a specific error pattern.
In addition, second language (non-native language) learners typically use one foreign language phoneme that is closest to the native language (first language). This situation can lead to speech replacement, insertion and deletion, which is explained by the migration model in linguistics. This situation causes a great disadvantage in the learning process of the language learner, and the user cannot effectively learn a foreign language. For example, English learners have a vowel problem when learning Mandarin because there is no vowel/y/. Therefore, a person in English will often send/u/instead of/y/, which will allow the user to frequently replace syllables rather than actively correct the error. Since the formant frequencies of/u/and/y/are in the same frequency band, the user's mispronunciations cannot be distinguished by the language model for recognizing the sound signals.
Disclosure of Invention
The present invention is directed to overcoming the above-mentioned drawbacks of the prior art and providing a pronunciation error correction method and system based on machine learning.
According to a first aspect of the present invention, a pronunciation error correction method based on machine learning is provided. The method comprises the following steps:
step S1, generating ultrasonic waves through a speaker built in the electronic device, and receiving a vocal signal from the user and an ultrasonic wave reflection signal, which is a sound signal of the user' S lips reflected by the ultrasonic waves, using a microphone as a signal receiving end;
step S2, respectively carrying out denoising processing on the received user pronunciation signal and the ultrasonic reflection signal to extract an effective pronunciation signal and a corresponding effective ultrasonic reflection signal;
step S3, uniformly performing peak value normalization processing on the effective pronunciation signal and the effective ultrasonic reflection signal;
step S4, performing time window segmentation on the normalized ultrasonic reflection signal, calculating the formant frequency of the sounding signal, and calculating the frequency deviation value of the ultrasonic reflection signal based on the Doppler effect;
step S5, inputting the formant frequency value in a time sequence and the frequency offset value of the corresponding ultrasonic reflection signal into the trained machine learning model, and identifying the phonetic phoneme classification.
According to a second aspect of the present invention, a machine learning based pronunciation correction system is provided. The system comprises:
a signal receiving unit for generating ultrasonic waves through a speaker built in the electronic device and receiving a vocal signal from a user and an ultrasonic reflection signal, which is a sound signal of a user's lips reflected by the ultrasonic waves, using a microphone as a signal receiving end;
the preprocessing unit is used for respectively carrying out denoising processing on the received user pronunciation signal and the received ultrasonic reflection signal so as to extract an effective pronunciation signal and a corresponding effective ultrasonic reflection signal;
a normalization processing unit for performing peak normalization processing on the effective sound signal and the effective ultrasonic reflection signal in a unified manner;
a feature extraction unit for performing time window segmentation on the normalized ultrasonic reflection signal, calculating a formant frequency of the sounding signal, and calculating a frequency offset value of the ultrasonic reflection signal based on a doppler effect;
and the classification and identification unit is used for inputting the formant frequency value in a time sequence and the frequency deviation value of the corresponding ultrasonic reflection signal into the trained machine learning model and identifying the phonetic phoneme classification.
Compared with the prior art, the invention has the advantages that the change of the voice formant when a user pronounces and the Doppler effect generated by ultrasonic signal reflection are utilized to respectively judge the position of the tongue and the movement of the lips when the user pronounces, and a more accurate pronunciation identification scheme is provided. In addition, the invention can be carried on commercial electronic equipment (such as a smart phone), does not need additional equipment and reduces the application cost.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a flow diagram of a method for machine learning-based pronunciation error correction according to one embodiment of the present invention;
FIG. 2 is a process diagram of a machine learning based pronunciation error correction method according to one embodiment of the invention;
FIG. 3 is a block diagram of a framework of a machine learning based pronunciation correction system according to an embodiment of the present invention;
FIG. 4 is an experimental layout according to one embodiment of the present invention.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
The invention aims to provide a method for carrying out voice evaluation and intelligent pronunciation by using common electronic equipment, which can identify pronunciation, particularly single vowel, voice pronunciation and the like according to the voice rules of tongue position and lip shape. The pronunciation recognition and analysis system can be carried on commercial electronic equipment for pronunciation recognition and analysis, wherein the electronic equipment comprises but is not limited to a smart phone, a tablet electronic equipment, a desktop computer, a Personal Digital Assistant (PDA), a vehicle-mounted device, an intelligent wearable device and the like. For clarity, the following description will be made by taking a smart phone as an example and taking a user (or called user) as an example to pronounce vowels.
Specifically, as shown in fig. 1 and fig. 2, the provided pronunciation error correction method based on machine learning includes the following steps.
In step S100, ultrasonic waves are generated by a speaker built in the smart phone, and a microphone is used as a signal receiving end to receive a sound signal and an ultrasonic wave reflection signal from a user.
The sound signal is a sound signal directly generated when the user reads the voice, the ultrasonic wave reflection signal is a sound signal of the lip movement or the face movement of the user reflected by the ultrasonic wave, and the microphone can receive a mixed signal of the two sounds. In practical application, the smart phone can be placed in front of a user, and ultrasonic waves with the frequency of 20KHz are generated through a built-in loudspeaker.
In one embodiment, step S100 includes the following sub-steps:
step S101, collecting sound signals (including user pronunciation signals and emission ultrasonic signals) of a microphone of the smart phone;
and step S102, carrying out data processing on the collected sound signals, and removing the interference of high-frequency environmental noise and electromagnetic noise.
For example, de-noising is performed using a Butterworth low pass filter with a cutoff frequency set to 5500Hz and a microphone data sampling rate set to 44 KHz.
Step S200, performing preliminary extraction and preprocessing on the received signal.
In one embodiment, the following substeps may be performed using a smartphone:
step S201, removing high-frequency and low-frequency signals in background human voice sound signals and reflected ultrasonic waves by using a notch filter;
step S202, removing noise in the sound signal by using a least mean square filter;
in step S203, a threshold method is used to detect the start point and the end point of the utterance in the sound signal.
For example, in signal detection, a signal exceeding a set threshold is determined temporally, and when the signal continues to exceed the threshold and is maintained for a period of time t (t is greater than 1s), the signal is considered not to be a noise signal but to be a valid signal to be extracted, and the signal is further cut. Signals of the cutting starting point and the cutting end point are extracted. Specifically, the method for determining the cutting signal is to determine the starting point of the signal first, and the starting point is selected, for example, before the first point exceeding the threshold, M1 sampling points are taken (the size of M1 is about the product of the sampling rate and the primary pronunciation time); the end point of the utterance signal is then determined, for example, the data point after the first point below the threshold after M1 consecutive sample points of the signal.
In step S200, the notch filter used for processing the human voice signal is, for example, 100 to 5500Hz, the notch filter used for processing the voice signal reflecting the ultrasonic wave is, for example, 19.90 to 21.10KHz, and the sampling time of the least mean square filter is set to, for example, 10 ms.
And step S300, carrying out normalization processing for eliminating interference generated by the placement distance of the smart phone in the signal transmission process.
In one embodiment, step S300 includes the following sub-steps:
step S301, searching peak values of the sound signals of the training set in all the signals, and carrying out peak value normalization processing on all the received sound signals;
step S302, in the test sample, processing the peak data exceeding the training set sample with the signal value being 1.
The normalization processing is carried out on the received mixed sound signal (comprising the pronunciation signal and the ultrasonic reflection signal), so that the interference caused by the distance difference between the smart phone and the lips of the user can be eliminated, and the accuracy of the subsequent pronunciation phoneme recognition is improved. For example, the normalization process formula is expressed as:
Figure BDA0003139114500000061
wherein x and y correspond to the signal before and after normalization, respectively, xminIs the minimum peak, x, before normalizationmaxIs the maximum peak before normalization.
And step S400, performing feature extraction on the normalized ultrasonic reflection signal, and cutting a time window to obtain the frequency deviation of the resonance peak of the ultrasonic reflection signal and the frequency deviation of the ultrasonic reflection signal during pronunciation.
In one embodiment, step S400 includes the following sub-steps:
step S401, processing the signal by using Fourier transform, extracting only frequency amplitude information of 100-5500Hz in the obtained sound signal frequency spectrum, and obtaining the formant frequency of the sound signal by using a linear predictive coding technology;
and S402, performing time window segmentation on the segmented data, and detecting the frequency offset of ultrasonic waves emitted by the smart phone reflected by the user at the frequency band of about 20KHz by utilizing the Doppler effect through Fourier transform.
Specifically, the actual speech sample values can be predicted by determining a set of coefficients of the predictor by using linear predictive coding, and the formant frequency at the moment can be obtained by calculating the coefficients of the predictor, so that the position of the tongue and the phoneme of the pronunciation at the moment are mapped. The basic idea of linear prediction is that there is a correlation between speech sample values, so that the current speech sample value can be approximated by combining several past speech sample values. A unique set of prediction coefficients is determined by minimizing the sum of the squares of the differences between the actual speech sample values and the predicted values, i.e., minimizing the mean square error.
For example, the solution is performed by using an autocorrelation method of levenson-Durbin (Levinson-Durbin), and an autocorrelation function of each utterance signal is calculated by using a recursive algorithm, thereby obtaining a coefficient of the predictor.
After obtaining the prediction coefficients, the polynomial coefficients of the linear prediction filter are decomposed, e.g., by finding complex polynomial roots θ using eigenvalue decompositioniCalculating the formant frequency f according to the complex root and the sampling period value TiExpressed as:
Figure BDA0003139114500000071
by the resonant peak frequency fiMapping the lip position of the mouth roughly determines the kind of phoneme pronounced at this time, but still cannot be completely separated by this method due to the similar vowel phonemes, e.g./u/and/y/with similar formant frequencies. The underlying reason is that the tongue position is substantially the same when the vowel is emitted. Therefore, in order to solve this problem, the present invention has been madeOne step is to detect the motion of the lips by measuring the lips using doppler shift.
The Doppler effect is the phenomenon that when a transmitting wave source and a receiving target move relatively, the frequency of the receiving wave of an observer deviates from the frequency emitted by the wave source.
In one embodiment, a short-time Fourier transform method is used to capture signals at frequencies around 20KHz that are emitted by the handset speaker and reflected back to the microphone, and negative frequency shifts are generated by the backward movement of the lips while positive frequency shifts are generated by the forward movement of the lips. For example, when three vowels are pronounced, i, e, the lips are flat, and the lips are generally retracted, the distance from the receiving end is increased, and doppler frequency shift appears as the center frequency of the reflected wave moves to a low frequency band. When three vowels are played, i.e.,/o/,/u/, the lips are circular and have a moving trend close to the receiving end, and the Doppler frequency shift shows that the center frequency of the reflected wave moves to a high frequency band.
In step S400, a signal formant frequency is calculated by a linear predictive coding technique, and a frequency shift value of ultrasonic waves reflected by the doppler effect of lip motion in a frequency domain is calculated by a fourier transform method, and the frequency shift value can reflect both the direction of the frequency shift and the degree of change in frequency, thereby capturing the positions of the lips and the tongue according to the pronunciation rules.
And step S500, inputting the extracted features into a machine learning model to perform pronunciation phoneme classification detection.
In this step, the extracted features, including the formant frequency value and the frequency offset value of the reflected ultrasonic wave at the time of pronunciation, are input into the machine learning model, and the pronunciation phoneme type can be identified. For example, features filtered in a time sequence are input into a support vector machine for classification, and other machine learning models, such as convolutional neural networks, can also be used.
In one embodiment, step S500 includes the following sub-steps:
step S501, inputting the processed time sequence formant frequency information into a support vector machine based on a classification method of the support vector machine;
step S502, the support vector machine has a training state and a testing state, in the training state, the hyperplane parameter established by the support vector machine is trained according to the label given by the data, if the support vector machine is in the testing state, the displayed label result is returned according to the support vector machine classification. When the support vector machine model is trained, the input features are the formant frequencies of the sound signals processed through S100-S400 in each divided time window, and the output classification labels are the vowel phonemes identified by classification, namely the vowel phoneme types.
In the above embodiment, it is preferable to classify the vowel vocal signal using a support vector machine model. The support vector machine model is a suitable model as a mono classifier, with the support vector machine kernel converting the low dimensional input space to a higher dimensional space. In other words, the support vector machine model converts an inseparable problem into a separable problem by adding more dimensions to it. The model employs a gaussian kernel function because it transforms the inseparable problem into a separable problem by adding more dimensions. In this case, vowels with the same tongue position or with the same lip shape can be classified by the choice of formants and the determination of doppler shift.
In summary, in this step S500, correct or incorrect pronunciation of the user is detected by using the support vector machine model, and pronunciation feedback is provided to let the user know the position of the mouth and tongue when pronouncing.
It should be understood that, by training the support vector machine with the training data set and evaluating the training effect of the model with the testing data set, an optimized support vector machine model can be obtained for phoneme type recognition and analysis during actual pronunciation. The training data set reflects the corresponding relationship between the formant frequency value in time sequence, the frequency deviation value of the reflected ultrasonic wave during pronunciation and the pronunciation phoneme type.
Step S600, the detection result and the details which should be noticed when pronouncing are fed back to the user.
After the detection result is obtained, the details which the user should pay attention to when pronouncing can be informed through the smart phone. For example, step S600 includes the following sub-steps:
step S601, displaying the classified pronunciation result according to the result returned in step S500;
step S602, judging the position of the lips and the tongue of the user according to the change of the formant frequency information in the time window analyzed in the step S500, and displaying the change on the smart phone;
in step S603, the change of the classification result and the resonance information reflects details that the user needs to pay attention to when pronouncing, for example, the tongue should be moved up or down when pronouncing, or the lips should be moved forward or backward to provide auxiliary training.
For example, vowel sounds of different nationalities in different languages are actually collected, and correction can be achieved on vowel sounds of multiple languages through data analysis and classification training, and analysis results can be fed back to a user.
Correspondingly, the invention also provides a pronunciation error correction system based on machine learning, which is used for realizing one or more aspects of the method, wherein the involved modules or units can be realized by adopting a special processor or an FPGA (field programmable gate array). For example, referring to FIG. 3, the system includes a sensing and processing module, a computing module, and a feedback module.
The sensing and processing module is used for executing the following processes: when the user speaks, ultrasonic waves are generated by using the speaker of the smartphone and the returned sound wave signal is received at the microphone of the smartphone. The acoustic signals are normalized and the original acoustic signals are processed using a butterworth bandpass filter, a notch filter, a least mean square filter, and a linear predictive coding technique to calculate the formant frequencies and a fourier transform to calculate the doppler shift of the ultrasonic reflected signals. In the acquisition of a user in a time period, the sound wave signal is cut into a plurality of time units, the formant frequency of each unit is calculated, and the movement of lips and tongue of the user during pronunciation can be judged by using the Doppler effect.
The calculation module is used for executing the following processes: and calculating to obtain formant frequencies of a plurality of time windows in one sampling, training the formant frequencies of the plurality of windows through a classification model of a support vector machine, and feeding back to the machine learning model. After data of a plurality of training samples are obtained, the model can carry out classification prediction on the obtained formant frequency data of a plurality of time windows;
the feedback module is used for executing the following processes: the method feeds back the phoneme of the vowel spoken by the user, and meanwhile, the position of the tongue and the shape of the lips can be judged through the change of the formant frequency on the time window in the pronunciation process of the user, and the change is displayed visually. And, by suggesting this pronunciation of the sub-phoneme by reference to a standard library of formant frequencies in a time window, the user is told that the lip pronunciation should be expanded or closed.
In one embodiment, the sensing and processing module comprises the following elements:
the voice transmitting unit transmits ultrasonic waves with a frequency rule of sine wave oscillation by using a loudspeaker;
a voice collecting unit for collecting a voice signal reflecting a user's face (vocal organ) using a microphone and collecting a time slice signal by judging a threshold;
the normalization unit is used for eliminating the influence brought by the size of the vocalization of the user;
a filtering unit for filtering the sound signal data by using a filter;
the analog-digital unit is used for converting the analog sound signal into a digital signal;
an extraction unit calculates a formant frequency by using a linear predictive coding technique and calculates a Doppler shift by processing a signal of the ultrasonic wave reflection by Fourier transform.
In one embodiment, the calculation module comprises the following units:
the time slice cutting unit is used for cutting the signal time slice signal;
and the computing unit is used for carrying out classification prediction on the features in the divided units by utilizing the machine learning model.
In one embodiment, the feedback module comprises the following elements:
the error correction unit is used for displaying the pronunciation types of the vowel phonemes obtained by classifying and predicting the input features by using a machine learning model;
and the visualization unit is used for generating a feedback graph of the positions of the lips and the tongue by segmenting the change of the time sequence characteristics of the time window, and visualizing the user to clarify the sounding errors by comparing with a standard graph.
The invention focuses on the speech definition of the learning target language of the user, and realizes the auxiliary pronunciation training by judging and correcting the vowel phoneme. It was verified that in practical applications, the present invention can recognize 30 vowels in four languages (french, japanese, korean, and mandarin) by the positions of the lips and the tongue. As shown in fig. 4, the present invention can be built in a smart phone, when a user corrects a voice, the user reads the selected vowel phoneme to be trained, and after the reading is finished, the user is informed whether the reading is correct, and how to adjust the sound production part of the user can better produce the vowel phoneme of the language to be learned.
In conclusion, the invention provides a method for implementing pronunciation error correction based on machine learning, which is a novel pronunciation training mode; the resonance in the sound wave of the user is utilized, and the ultrasonic wave is used for respectively detecting the positions of the tongue and the lips of the user based on the Doppler effect, so that the detection accuracy is improved; the accuracy of the signals is obviously improved by processing the signals by using processing methods such as a Butterworth band-pass filter, a notch filter, a least mean square filter, time slicing, normalization, Fourier transform and the like; the proposed method has high recognition rate of vowel phoneme classification by using a support vector machine; the invention also realizes the function of feedback of the detection result, further improves the usability and functionality, can identify the positions of the lip and the tongue, and can also carry out vowel pronunciation training.
Furthermore, the conventional method cannot classify mellow and non-mellow vowels by analyzing the/i/and/y/etc. voice waveforms because they depend on the lip shape. Aiming at the problem, the invention establishes a circular and non-circular vowel classifier model called as a frequency shift model by analyzing the frequency shift direction of the ultrasonic signal, and can extract the pronunciation characteristics of more standard vowel phonemes.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, Python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims (10)

1.一种基于机器学习的发音纠错方法,包括以下步骤:1. A method for correcting pronunciation errors based on machine learning, comprising the following steps: 步骤S1,通过电子设备内置的扬声器产生超声波,并使用麦克风作为信号接收端接收来自于用户的发音信号以及超声波反射信号,该超声波反射信号是超声波所反射的用户唇部的声音信号;In step S1, ultrasonic waves are generated by the built-in loudspeaker of the electronic device, and the microphone is used as the signal receiving terminal to receive the pronunciation signal and the ultrasonic reflection signal from the user, and the ultrasonic reflection signal is the sound signal of the user's lips reflected by the ultrasonic wave; 步骤S2,对接收到的用户发音信号以及超声波反射信号分别进行去噪处理,以提取出有效发音信号和对应的有效超声波反射信号;Step S2, respectively carry out denoising processing to the received user pronunciation signal and the ultrasonic reflection signal, to extract the effective pronunciation signal and the corresponding effective ultrasonic reflection signal; 步骤S3,对所述有效发音信号和所述有效超声波反射信号统一进行峰值归一化处理;Step S3, performing peak normalization processing on the effective pronunciation signal and the effective ultrasonic reflection signal uniformly; 步骤S4,对归一化后的超声波反射信号进行时间窗分割,计算发声信号的共振峰频率,并基于多普勒效应计算超声波反射信号的频率偏移值;Step S4, time window segmentation is performed on the normalized ultrasonic reflection signal, the formant frequency of the sounding signal is calculated, and the frequency offset value of the ultrasonic reflection signal is calculated based on the Doppler effect; 步骤S5,将一个时序中共振峰频率值以及对应的超声波反射信号的频率偏移值输入经训练的机器学习模型,识别出语音音素分类。Step S5, input the formant frequency value and the corresponding frequency offset value of the ultrasonic reflected signal in a time series into the trained machine learning model, and identify the phoneme classification. 2.根据权利要求1所述的方法,其中,步骤S2包括:2. The method according to claim 1, wherein step S2 comprises: 利用陷波滤波器,分别去除背景声音信号以及超声波反射信号中的高频和低频信号;Using a notch filter to remove high-frequency and low-frequency signals in the background sound signal and the ultrasonic reflection signal respectively; 利用最小均方滤波器,去除声音信号中噪声;Use the least mean square filter to remove noise in the sound signal; 通过设定阈值检测有效发音信号的起点和终点,起点设置在超过该阈值的第一个点之前,取M1个采样点,终点设置为在信号连续M1个采样点后低于该阈值的第一个点之后的数据点。Detect the starting point and end point of a valid pronunciation signal by setting a threshold. The starting point is set before the first point that exceeds the threshold, and M1 sampling points are taken. The end point is set at the first point lower than the threshold after M1 consecutive sampling points of the signal data points after one point. 3.根据权利要求1所述的方法,其中,在步骤S4中,利用线性预测编码对发音信号计算共振峰频率,并利用傅里叶变换分析频域上由于唇部运动对所反射超声波带来的多普勒效应产生的频率偏移值。3. method according to claim 1, wherein, in step S4, utilize linear predictive coding to calculate formant frequency to pronunciation signal, and utilize Fourier transform to analyze frequency domain because lip motion brings to reflected ultrasonic wave. The frequency offset value produced by the Doppler effect. 4.根据权利要求1所述的方法,其中,根据以下步骤训练所述机器学习模型:4. The method of claim 1, wherein the machine learning model is trained according to the steps of: 利用训练集训练所构建的机器学习模型,该训练集反映时序共振峰频率值、超声波反射信号的频率偏移值与发音音素类型之间的对应关系;Use the training set to train the constructed machine learning model, and the training set reflects the corresponding relationship between the time series formant frequency value, the frequency offset value of the ultrasonic reflected signal and the phoneme type of pronunciation; 利用测试集评估经训练的机器学习模型,输出发音音素的分类标签,该测试集包含测试样本的时序共振峰频率值、对应的超声波反射信号的频率偏移值,并且将超过训练集样本的共振峰值数据按信号值为1处理。Evaluate the trained machine learning model with a test set containing the time series formant frequency values of the test samples, the frequency offset values of the corresponding ultrasonic reflection signals, and will exceed the resonances of the training set samples, and output the classification labels of the pronunciation phonemes Peak data is treated as a signal value of 1. 5.根据权利要求1所述的方法,其中,该方法还包括:5. The method of claim 1, wherein the method further comprises: 向用户显示所识别的语音音素分类;display to the user the recognized phoneme classification of the speech; 根据时间窗中共振峰频率信息和多普勒频率偏移的变化,确定用户嘴唇和舌头的位置,并将反馈发音时嘴唇和舌头位置信息变化进行显示,以供用户辅助发音训练。According to the change of the formant frequency information and the Doppler frequency offset in the time window, the position of the user's lips and tongue is determined, and the changes of the position information of the lips and tongue during the feedback pronunciation are displayed for the user to assist the pronunciation training. 6.根据权利要求1所述的方法,其中,所述机器学习模型是支持向量机。6. The method of claim 1, wherein the machine learning model is a support vector machine. 7.根据权利要求3所述的方法,其中,根据以下步骤计算所述共振峰频率:7. The method of claim 3, wherein the formant frequency is calculated according to the steps of: 利用递归算法计算每个发音信号的自相关函数,得到预测器的系数;Use recursive algorithm to calculate the autocorrelation function of each pronunciation signal to obtain the coefficient of the predictor; 分解线性预测滤波器的多项式系数,使用特征值分解方法查找多项式根θi,并根据该多项式根计算共振峰频率fi,表示为:Decompose the polynomial coefficients of the linear prediction filter, use the eigenvalue decomposition method to find the polynomial root θ i , and calculate the formant frequency f i from this polynomial root, expressed as:
Figure FDA0003139114490000021
Figure FDA0003139114490000021
其中,T是采样周期值。where T is the sampling period value.
8.一种基于机器学习的发音纠错系统,包括:8. A pronunciation error correction system based on machine learning, comprising: 信号接收单元:用于通过电子设备内置的扬声器产生超声波,并使用麦克风作为信号接收端接收来自于用户的发音信号以及超声波反射信号,该超声波反射信号是超声波所反射的用户嘴唇的声音信号;Signal receiving unit: used to generate ultrasonic waves through the built-in speaker of the electronic device, and use the microphone as the signal receiving end to receive the pronunciation signal from the user and the ultrasonic reflection signal, which is the sound signal of the user's lips reflected by the ultrasonic wave; 预处理单元:用于对接收到的用户发音信号以及超声波反射信号分别进行去噪处理,以提取出有效发音信号和对应的有效超声波反射信号;Preprocessing unit: used to de-noise the received user pronunciation signal and ultrasonic reflection signal respectively, so as to extract the effective pronunciation signal and the corresponding effective ultrasonic reflection signal; 归一化处理单元:用于对所述有效发音信号和所述有效超声波反射信号统一进行峰值归一化处理;Normalization processing unit: for performing peak normalization processing on the effective pronunciation signal and the effective ultrasonic reflection signal uniformly; 特征提取单元:用于对归一化后的超声波反射信号进行时间窗分割,计算发声信号的共振峰频率,并基于多普勒效应计算超声波反射信号的频率偏移值;Feature extraction unit: used to divide the normalized ultrasonic reflection signal into a time window, calculate the formant frequency of the sounding signal, and calculate the frequency offset value of the ultrasonic reflection signal based on the Doppler effect; 分类识别单元:用于将一个时序中共振峰频率值以及对应的超声波反射信号的频率偏移值输入经训练的机器学习模型,识别出语音音素分类。Classification and identification unit: used to input the formant frequency value in a time series and the frequency offset value of the corresponding ultrasonic reflection signal into the trained machine learning model to identify the phoneme classification. 9.一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现根据权利要求1至8中任一项所述方法的步骤。9. A computer-readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the steps of the method according to any one of claims 1 to 8. 10.一种计算机设备,包括存储器和处理器,在所述存储器上存储有能够在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求1至8中任一项所述的方法的步骤。10. A computer device, comprising a memory and a processor, wherein a computer program that can be run on the processor is stored in the memory, wherein the processor implements the programs in claims 1 to 8 when executing the program The steps of any one of the methods.
CN202110730385.7A 2021-06-29 2021-06-29 Pronunciation error correction method and system based on machine learning Active CN113611287B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110730385.7A CN113611287B (en) 2021-06-29 2021-06-29 Pronunciation error correction method and system based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110730385.7A CN113611287B (en) 2021-06-29 2021-06-29 Pronunciation error correction method and system based on machine learning

Publications (2)

Publication Number Publication Date
CN113611287A true CN113611287A (en) 2021-11-05
CN113611287B CN113611287B (en) 2023-09-12

Family

ID=78336957

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110730385.7A Active CN113611287B (en) 2021-06-29 2021-06-29 Pronunciation error correction method and system based on machine learning

Country Status (1)

Country Link
CN (1) CN113611287B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115200695A (en) * 2022-07-14 2022-10-18 扬州大学 Safety early warning method and system for truck brakes based on abnormal sound wave feature recognition
CN115644848A (en) * 2022-10-26 2023-01-31 深圳大学 Lung function parameter measuring method and system based on voice signals
CN115670434A (en) * 2022-10-26 2023-02-03 深圳大学 A voice signal-based diagnosis method and system for chronic obstructive pulmonary disease

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110071830A1 (en) * 2009-09-22 2011-03-24 Hyundai Motor Company Combined lip reading and voice recognition multimodal interface system
CN104217218A (en) * 2014-09-11 2014-12-17 广州市香港科大霍英东研究院 Lip language recognition method and system
CN106328141A (en) * 2016-09-05 2017-01-11 南京大学 Ultrasonic lip reading recognition device and method for mobile terminal
US20170069306A1 (en) * 2015-09-04 2017-03-09 Foundation of the Idiap Research Institute (IDIAP) Signal processing method and apparatus based on structured sparsity of phonological features
CN108346427A (en) * 2018-02-05 2018-07-31 广东小天才科技有限公司 Voice recognition method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110071830A1 (en) * 2009-09-22 2011-03-24 Hyundai Motor Company Combined lip reading and voice recognition multimodal interface system
CN104217218A (en) * 2014-09-11 2014-12-17 广州市香港科大霍英东研究院 Lip language recognition method and system
US20170069306A1 (en) * 2015-09-04 2017-03-09 Foundation of the Idiap Research Institute (IDIAP) Signal processing method and apparatus based on structured sparsity of phonological features
CN106328141A (en) * 2016-09-05 2017-01-11 南京大学 Ultrasonic lip reading recognition device and method for mobile terminal
CN108346427A (en) * 2018-02-05 2018-07-31 广东小天才科技有限公司 Voice recognition method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ASLAN B.WONG等: "Pronunciation Training through Sensing of Tongue and Lip Motion via Smartphone", 2021 IEEE INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING AND COMMUNICATION WORKSHOPS AND OTHER AFFILIATED EVENTS(PERCOM WORKSHOPS), pages 420 - 421 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115200695A (en) * 2022-07-14 2022-10-18 扬州大学 Safety early warning method and system for truck brakes based on abnormal sound wave feature recognition
CN115644848A (en) * 2022-10-26 2023-01-31 深圳大学 Lung function parameter measuring method and system based on voice signals
CN115670434A (en) * 2022-10-26 2023-02-03 深圳大学 A voice signal-based diagnosis method and system for chronic obstructive pulmonary disease

Also Published As

Publication number Publication date
CN113611287B (en) 2023-09-12

Similar Documents

Publication Publication Date Title
EP2387031B1 (en) Methods and systems for grammar fitness evaluation as speech recognition error predictor
Mak et al. PLASER: Pronunciation learning via automatic speech recognition
Dudy et al. Automatic analysis of pronunciations for children with speech sound disorders
US11810471B2 (en) Computer implemented method and apparatus for recognition of speech patterns and feedback
CN113611287B (en) Pronunciation error correction method and system based on machine learning
US20110123965A1 (en) Speech Processing and Learning
US9489864B2 (en) Systems and methods for an automated pronunciation assessment system for similar vowel pairs
Arora et al. Phonological feature-based speech recognition system for pronunciation training in non-native language learning
KR20160122542A (en) Method and apparatus for measuring pronounciation similarity
Bailey Automatic detection of sociolinguistic variation using forced alignment
Doremalen et al. Automatic pronunciation error detection in non-native speech: The case of vowel errors in Dutch
Ismail et al. Mfcc-vq approach for qalqalahtajweed rule checking
KR20200087623A (en) Apparatus and method for evaluating pronunciation accuracy for foreign language education
CN108648527A (en) A kind of pronunciation of English matching correcting method
CN108470476B (en) An English pronunciation matching correction system
Minematsu et al. Structural representation of the pronunciation and its use for CALL
Peabody Methods for pronunciation assessment in computer aided language learning
Maqsood et al. A comparative study of classifier based mispronunciation detection system for confusing
KR20080018658A (en) Voice comparison system for user selection section
Kaur et al. Segmentation of continuous punjabi speech signal into syllables
Khanal et al. Mispronunciation detection and diagnosis for Mandarin accented English speech
JP2006227587A (en) Pronunciation rating device and program
Terbeh et al. Identification of pronunciation defects in spoken Arabic language
Koniaris et al. On mispronunciation analysis of individual foreign speakers using auditory periphery models
Alqadasi et al. Improving Automatic Forced Alignment for Phoneme Segmentation in Quranic Recitation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant