CN113611287A

CN113611287A - Pronunciation error correction method and system based on machine learning

Info

Publication number: CN113611287A
Application number: CN202110730385.7A
Authority: CN
Inventors: 伍楷舜; 王璐; 王泰华; 涂栋亮
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-11-05
Anticipated expiration: 2041-06-29
Also published as: CN113611287B

Abstract

The invention discloses a pronunciation error correction method and system based on machine learning. The method includes: generating ultrasonic waves through a built-in speaker of an electronic device, and using a microphone as a signal receiving end to receive a voiced signal and an ultrasonic reflection signal from a user; denoising the user's voiced signal and the ultrasonic reflection signal, and extracting an effective voiced signal and an ultrasonic reflection signal. The corresponding effective ultrasonic reflection signal; the peak normalization processing is performed on the effective pronunciation signal and the effective ultrasonic reflection signal uniformly; the time window is divided for the normalized ultrasonic reflection signal, and the formant frequency of the pronunciation signal is calculated, and based on the Doppler The Le effect calculates the frequency offset value of the ultrasonic reflected signal; the formant frequency value in a time series and the frequency offset value of the corresponding ultrasonic reflected signal are input into the trained machine learning model to identify the phoneme classification. The invention can accurately recognize the pronunciation of multiple languages and assist pronunciation training, and has universality.

Description

Pronunciation error correction method and system based on machine learning

Technical Field

The invention relates to the technical field of machine learning, in particular to a pronunciation error correction method and system based on machine learning.

Background

In recent years, the demand of various language talents in each professional field has been increasing, and the foreign language learning market has been greatly increased globally. However, in foreign language education, pronunciation learning is more challenging than learning of other language skills. In the actual teaching process, because pronunciation teaching resources are lacked and the class time is limited, many teachers are difficult to teach students the correct pronunciation, so students are usually required to practice the pronunciation. Currently, automatic speech recognition software has been applied to computer aided pronunciation training to facilitate pronunciation learning, but the accuracy of automatic speech recognition software still needs to be improved continuously. In terms of language assessment, it was successful to employ automatic speech recognition software. For example, previous toffee or english exams have been computer tested by human hearing instead. Some educational service providers score the pronunciation and accent of a test taker using an automatic speech recognition system called a "speech scorer".

However, automatic speech recognition systems also present problems and challenges in language teaching. The development of automatic speech recognition systems tends to focus on evaluating the tone of a scene utterance and the familiarity of the utterance, and the process of adjusting the utterance by self-correction using an automatic speech recognition system does not achieve the desired effect. For example, many non-native speakers do not understand the specific situation of intonation and instead blindly imitate intonation according to the heard sounds, thereby giving wrong guidance to pronunciation learning. In recent years, automatic speech recognition systems have attracted researchers in different disciplines, including computer science, linguistics, speech recognition, signal processing, psycholinguistics, education, hearing and pronunciation studies, and the like. However, in practice, most automatic speech recognition systems focus on pronunciation error detection and evaluation, and focus only on the intonation and fluency of the pronunciation, rather than the pronunciation itself. When learners realize that their pronunciation is incorrect by looking at the score, they do not know how to adjust the correct pronunciation by either the score or the speech signal.

In general, existing automatic speech recognition systems focus primarily on evaluating scores rather than encouraging language learners to learn themselves. Therefore, most second language learners are inadvertently misled by the pronunciation training application. For example, learners attempt to achieve the highest score by adjusting their pitch or intonation (prosody level) to approximate examples, rather than adjusting their pronunciation according to pronunciation rules (phonetic level). Despite the many drawbacks of the automated voice scoring procedure, the correctness of using the pronunciation system is not assessed by authority, and the existing pronunciation training applications in the market mainly focus on two aspects: how to pronounce phonemes, words, or sentences; the learner's pronunciation quality is assessed by speech recognition techniques to indicate incorrect phonemes. That is, the method based on the pronunciation error detection score cannot diagnose pronunciation errors and a specific error pattern.

In addition, second language (non-native language) learners typically use one foreign language phoneme that is closest to the native language (first language). This situation can lead to speech replacement, insertion and deletion, which is explained by the migration model in linguistics. This situation causes a great disadvantage in the learning process of the language learner, and the user cannot effectively learn a foreign language. For example, English learners have a vowel problem when learning Mandarin because there is no vowel/y/. Therefore, a person in English will often send/u/instead of/y/, which will allow the user to frequently replace syllables rather than actively correct the error. Since the formant frequencies of/u/and/y/are in the same frequency band, the user's mispronunciations cannot be distinguished by the language model for recognizing the sound signals.

Disclosure of Invention

The present invention is directed to overcoming the above-mentioned drawbacks of the prior art and providing a pronunciation error correction method and system based on machine learning.

According to a first aspect of the present invention, a pronunciation error correction method based on machine learning is provided. The method comprises the following steps:

step S1, generating ultrasonic waves through a speaker built in the electronic device, and receiving a vocal signal from the user and an ultrasonic wave reflection signal, which is a sound signal of the user' S lips reflected by the ultrasonic waves, using a microphone as a signal receiving end;

step S2, respectively carrying out denoising processing on the received user pronunciation signal and the ultrasonic reflection signal to extract an effective pronunciation signal and a corresponding effective ultrasonic reflection signal;

step S3, uniformly performing peak value normalization processing on the effective pronunciation signal and the effective ultrasonic reflection signal;

step S4, performing time window segmentation on the normalized ultrasonic reflection signal, calculating the formant frequency of the sounding signal, and calculating the frequency deviation value of the ultrasonic reflection signal based on the Doppler effect;

step S5, inputting the formant frequency value in a time sequence and the frequency offset value of the corresponding ultrasonic reflection signal into the trained machine learning model, and identifying the phonetic phoneme classification.

According to a second aspect of the present invention, a machine learning based pronunciation correction system is provided. The system comprises:

a signal receiving unit for generating ultrasonic waves through a speaker built in the electronic device and receiving a vocal signal from a user and an ultrasonic reflection signal, which is a sound signal of a user's lips reflected by the ultrasonic waves, using a microphone as a signal receiving end;

the preprocessing unit is used for respectively carrying out denoising processing on the received user pronunciation signal and the received ultrasonic reflection signal so as to extract an effective pronunciation signal and a corresponding effective ultrasonic reflection signal;

a normalization processing unit for performing peak normalization processing on the effective sound signal and the effective ultrasonic reflection signal in a unified manner;

a feature extraction unit for performing time window segmentation on the normalized ultrasonic reflection signal, calculating a formant frequency of the sounding signal, and calculating a frequency offset value of the ultrasonic reflection signal based on a doppler effect;

and the classification and identification unit is used for inputting the formant frequency value in a time sequence and the frequency deviation value of the corresponding ultrasonic reflection signal into the trained machine learning model and identifying the phonetic phoneme classification.

Compared with the prior art, the invention has the advantages that the change of the voice formant when a user pronounces and the Doppler effect generated by ultrasonic signal reflection are utilized to respectively judge the position of the tongue and the movement of the lips when the user pronounces, and a more accurate pronunciation identification scheme is provided. In addition, the invention can be carried on commercial electronic equipment (such as a smart phone), does not need additional equipment and reduces the application cost.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram of a method for machine learning-based pronunciation error correction according to one embodiment of the present invention;

FIG. 2 is a process diagram of a machine learning based pronunciation error correction method according to one embodiment of the invention;

FIG. 3 is a block diagram of a framework of a machine learning based pronunciation correction system according to an embodiment of the present invention;

FIG. 4 is an experimental layout according to one embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The invention aims to provide a method for carrying out voice evaluation and intelligent pronunciation by using common electronic equipment, which can identify pronunciation, particularly single vowel, voice pronunciation and the like according to the voice rules of tongue position and lip shape. The pronunciation recognition and analysis system can be carried on commercial electronic equipment for pronunciation recognition and analysis, wherein the electronic equipment comprises but is not limited to a smart phone, a tablet electronic equipment, a desktop computer, a Personal Digital Assistant (PDA), a vehicle-mounted device, an intelligent wearable device and the like. For clarity, the following description will be made by taking a smart phone as an example and taking a user (or called user) as an example to pronounce vowels.

Specifically, as shown in fig. 1 and fig. 2, the provided pronunciation error correction method based on machine learning includes the following steps.

In step S100, ultrasonic waves are generated by a speaker built in the smart phone, and a microphone is used as a signal receiving end to receive a sound signal and an ultrasonic wave reflection signal from a user.

The sound signal is a sound signal directly generated when the user reads the voice, the ultrasonic wave reflection signal is a sound signal of the lip movement or the face movement of the user reflected by the ultrasonic wave, and the microphone can receive a mixed signal of the two sounds. In practical application, the smart phone can be placed in front of a user, and ultrasonic waves with the frequency of 20KHz are generated through a built-in loudspeaker.

In one embodiment, step S100 includes the following sub-steps:

step S101, collecting sound signals (including user pronunciation signals and emission ultrasonic signals) of a microphone of the smart phone;

and step S102, carrying out data processing on the collected sound signals, and removing the interference of high-frequency environmental noise and electromagnetic noise.

For example, de-noising is performed using a Butterworth low pass filter with a cutoff frequency set to 5500Hz and a microphone data sampling rate set to 44 KHz.

Step S200, performing preliminary extraction and preprocessing on the received signal.

In one embodiment, the following substeps may be performed using a smartphone:

step S201, removing high-frequency and low-frequency signals in background human voice sound signals and reflected ultrasonic waves by using a notch filter;

step S202, removing noise in the sound signal by using a least mean square filter;

in step S203, a threshold method is used to detect the start point and the end point of the utterance in the sound signal.

For example, in signal detection, a signal exceeding a set threshold is determined temporally, and when the signal continues to exceed the threshold and is maintained for a period of time t (t is greater than 1s), the signal is considered not to be a noise signal but to be a valid signal to be extracted, and the signal is further cut. Signals of the cutting starting point and the cutting end point are extracted. Specifically, the method for determining the cutting signal is to determine the starting point of the signal first, and the starting point is selected, for example, before the first point exceeding the threshold, M1 sampling points are taken (the size of M1 is about the product of the sampling rate and the primary pronunciation time); the end point of the utterance signal is then determined, for example, the data point after the first point below the threshold after M1 consecutive sample points of the signal.

In step S200, the notch filter used for processing the human voice signal is, for example, 100 to 5500Hz, the notch filter used for processing the voice signal reflecting the ultrasonic wave is, for example, 19.90 to 21.10KHz, and the sampling time of the least mean square filter is set to, for example, 10 ms.

And step S300, carrying out normalization processing for eliminating interference generated by the placement distance of the smart phone in the signal transmission process.

In one embodiment, step S300 includes the following sub-steps:

step S301, searching peak values of the sound signals of the training set in all the signals, and carrying out peak value normalization processing on all the received sound signals;

step S302, in the test sample, processing the peak data exceeding the training set sample with the signal value being 1.

The normalization processing is carried out on the received mixed sound signal (comprising the pronunciation signal and the ultrasonic reflection signal), so that the interference caused by the distance difference between the smart phone and the lips of the user can be eliminated, and the accuracy of the subsequent pronunciation phoneme recognition is improved. For example, the normalization process formula is expressed as:

wherein x and y correspond to the signal before and after normalization, respectively, x_minIs the minimum peak, x, before normalization_maxIs the maximum peak before normalization.

And step S400, performing feature extraction on the normalized ultrasonic reflection signal, and cutting a time window to obtain the frequency deviation of the resonance peak of the ultrasonic reflection signal and the frequency deviation of the ultrasonic reflection signal during pronunciation.

In one embodiment, step S400 includes the following sub-steps:

step S401, processing the signal by using Fourier transform, extracting only frequency amplitude information of 100-5500Hz in the obtained sound signal frequency spectrum, and obtaining the formant frequency of the sound signal by using a linear predictive coding technology;

and S402, performing time window segmentation on the segmented data, and detecting the frequency offset of ultrasonic waves emitted by the smart phone reflected by the user at the frequency band of about 20KHz by utilizing the Doppler effect through Fourier transform.

Specifically, the actual speech sample values can be predicted by determining a set of coefficients of the predictor by using linear predictive coding, and the formant frequency at the moment can be obtained by calculating the coefficients of the predictor, so that the position of the tongue and the phoneme of the pronunciation at the moment are mapped. The basic idea of linear prediction is that there is a correlation between speech sample values, so that the current speech sample value can be approximated by combining several past speech sample values. A unique set of prediction coefficients is determined by minimizing the sum of the squares of the differences between the actual speech sample values and the predicted values, i.e., minimizing the mean square error.

For example, the solution is performed by using an autocorrelation method of levenson-Durbin (Levinson-Durbin), and an autocorrelation function of each utterance signal is calculated by using a recursive algorithm, thereby obtaining a coefficient of the predictor.

After obtaining the prediction coefficients, the polynomial coefficients of the linear prediction filter are decomposed, e.g., by finding complex polynomial roots θ using eigenvalue decomposition_iCalculating the formant frequency f according to the complex root and the sampling period value T_iExpressed as:

by the resonant peak frequency f_iMapping the lip position of the mouth roughly determines the kind of phoneme pronounced at this time, but still cannot be completely separated by this method due to the similar vowel phonemes, e.g./u/and/y/with similar formant frequencies. The underlying reason is that the tongue position is substantially the same when the vowel is emitted. Therefore, in order to solve this problem, the present invention has been madeOne step is to detect the motion of the lips by measuring the lips using doppler shift.

The Doppler effect is the phenomenon that when a transmitting wave source and a receiving target move relatively, the frequency of the receiving wave of an observer deviates from the frequency emitted by the wave source.

In one embodiment, a short-time Fourier transform method is used to capture signals at frequencies around 20KHz that are emitted by the handset speaker and reflected back to the microphone, and negative frequency shifts are generated by the backward movement of the lips while positive frequency shifts are generated by the forward movement of the lips. For example, when three vowels are pronounced, i, e, the lips are flat, and the lips are generally retracted, the distance from the receiving end is increased, and doppler frequency shift appears as the center frequency of the reflected wave moves to a low frequency band. When three vowels are played, i.e.,/o/,/u/, the lips are circular and have a moving trend close to the receiving end, and the Doppler frequency shift shows that the center frequency of the reflected wave moves to a high frequency band.

In step S400, a signal formant frequency is calculated by a linear predictive coding technique, and a frequency shift value of ultrasonic waves reflected by the doppler effect of lip motion in a frequency domain is calculated by a fourier transform method, and the frequency shift value can reflect both the direction of the frequency shift and the degree of change in frequency, thereby capturing the positions of the lips and the tongue according to the pronunciation rules.

And step S500, inputting the extracted features into a machine learning model to perform pronunciation phoneme classification detection.

In this step, the extracted features, including the formant frequency value and the frequency offset value of the reflected ultrasonic wave at the time of pronunciation, are input into the machine learning model, and the pronunciation phoneme type can be identified. For example, features filtered in a time sequence are input into a support vector machine for classification, and other machine learning models, such as convolutional neural networks, can also be used.

In one embodiment, step S500 includes the following sub-steps:

step S501, inputting the processed time sequence formant frequency information into a support vector machine based on a classification method of the support vector machine;

step S502, the support vector machine has a training state and a testing state, in the training state, the hyperplane parameter established by the support vector machine is trained according to the label given by the data, if the support vector machine is in the testing state, the displayed label result is returned according to the support vector machine classification. When the support vector machine model is trained, the input features are the formant frequencies of the sound signals processed through S100-S400 in each divided time window, and the output classification labels are the vowel phonemes identified by classification, namely the vowel phoneme types.

In the above embodiment, it is preferable to classify the vowel vocal signal using a support vector machine model. The support vector machine model is a suitable model as a mono classifier, with the support vector machine kernel converting the low dimensional input space to a higher dimensional space. In other words, the support vector machine model converts an inseparable problem into a separable problem by adding more dimensions to it. The model employs a gaussian kernel function because it transforms the inseparable problem into a separable problem by adding more dimensions. In this case, vowels with the same tongue position or with the same lip shape can be classified by the choice of formants and the determination of doppler shift.

In summary, in this step S500, correct or incorrect pronunciation of the user is detected by using the support vector machine model, and pronunciation feedback is provided to let the user know the position of the mouth and tongue when pronouncing.

It should be understood that, by training the support vector machine with the training data set and evaluating the training effect of the model with the testing data set, an optimized support vector machine model can be obtained for phoneme type recognition and analysis during actual pronunciation. The training data set reflects the corresponding relationship between the formant frequency value in time sequence, the frequency deviation value of the reflected ultrasonic wave during pronunciation and the pronunciation phoneme type.

Step S600, the detection result and the details which should be noticed when pronouncing are fed back to the user.

After the detection result is obtained, the details which the user should pay attention to when pronouncing can be informed through the smart phone. For example, step S600 includes the following sub-steps:

step S601, displaying the classified pronunciation result according to the result returned in step S500;

step S602, judging the position of the lips and the tongue of the user according to the change of the formant frequency information in the time window analyzed in the step S500, and displaying the change on the smart phone;

in step S603, the change of the classification result and the resonance information reflects details that the user needs to pay attention to when pronouncing, for example, the tongue should be moved up or down when pronouncing, or the lips should be moved forward or backward to provide auxiliary training.

For example, vowel sounds of different nationalities in different languages are actually collected, and correction can be achieved on vowel sounds of multiple languages through data analysis and classification training, and analysis results can be fed back to a user.

Correspondingly, the invention also provides a pronunciation error correction system based on machine learning, which is used for realizing one or more aspects of the method, wherein the involved modules or units can be realized by adopting a special processor or an FPGA (field programmable gate array). For example, referring to FIG. 3, the system includes a sensing and processing module, a computing module, and a feedback module.

The sensing and processing module is used for executing the following processes: when the user speaks, ultrasonic waves are generated by using the speaker of the smartphone and the returned sound wave signal is received at the microphone of the smartphone. The acoustic signals are normalized and the original acoustic signals are processed using a butterworth bandpass filter, a notch filter, a least mean square filter, and a linear predictive coding technique to calculate the formant frequencies and a fourier transform to calculate the doppler shift of the ultrasonic reflected signals. In the acquisition of a user in a time period, the sound wave signal is cut into a plurality of time units, the formant frequency of each unit is calculated, and the movement of lips and tongue of the user during pronunciation can be judged by using the Doppler effect.

The calculation module is used for executing the following processes: and calculating to obtain formant frequencies of a plurality of time windows in one sampling, training the formant frequencies of the plurality of windows through a classification model of a support vector machine, and feeding back to the machine learning model. After data of a plurality of training samples are obtained, the model can carry out classification prediction on the obtained formant frequency data of a plurality of time windows;

the feedback module is used for executing the following processes: the method feeds back the phoneme of the vowel spoken by the user, and meanwhile, the position of the tongue and the shape of the lips can be judged through the change of the formant frequency on the time window in the pronunciation process of the user, and the change is displayed visually. And, by suggesting this pronunciation of the sub-phoneme by reference to a standard library of formant frequencies in a time window, the user is told that the lip pronunciation should be expanded or closed.

In one embodiment, the sensing and processing module comprises the following elements:

the voice transmitting unit transmits ultrasonic waves with a frequency rule of sine wave oscillation by using a loudspeaker;

a voice collecting unit for collecting a voice signal reflecting a user's face (vocal organ) using a microphone and collecting a time slice signal by judging a threshold;

the normalization unit is used for eliminating the influence brought by the size of the vocalization of the user;

a filtering unit for filtering the sound signal data by using a filter;

the analog-digital unit is used for converting the analog sound signal into a digital signal;

an extraction unit calculates a formant frequency by using a linear predictive coding technique and calculates a Doppler shift by processing a signal of the ultrasonic wave reflection by Fourier transform.

In one embodiment, the calculation module comprises the following units:

the time slice cutting unit is used for cutting the signal time slice signal;

and the computing unit is used for carrying out classification prediction on the features in the divided units by utilizing the machine learning model.

In one embodiment, the feedback module comprises the following elements:

the error correction unit is used for displaying the pronunciation types of the vowel phonemes obtained by classifying and predicting the input features by using a machine learning model;

and the visualization unit is used for generating a feedback graph of the positions of the lips and the tongue by segmenting the change of the time sequence characteristics of the time window, and visualizing the user to clarify the sounding errors by comparing with a standard graph.

The invention focuses on the speech definition of the learning target language of the user, and realizes the auxiliary pronunciation training by judging and correcting the vowel phoneme. It was verified that in practical applications, the present invention can recognize 30 vowels in four languages (french, japanese, korean, and mandarin) by the positions of the lips and the tongue. As shown in fig. 4, the present invention can be built in a smart phone, when a user corrects a voice, the user reads the selected vowel phoneme to be trained, and after the reading is finished, the user is informed whether the reading is correct, and how to adjust the sound production part of the user can better produce the vowel phoneme of the language to be learned.

In conclusion, the invention provides a method for implementing pronunciation error correction based on machine learning, which is a novel pronunciation training mode; the resonance in the sound wave of the user is utilized, and the ultrasonic wave is used for respectively detecting the positions of the tongue and the lips of the user based on the Doppler effect, so that the detection accuracy is improved; the accuracy of the signals is obviously improved by processing the signals by using processing methods such as a Butterworth band-pass filter, a notch filter, a least mean square filter, time slicing, normalization, Fourier transform and the like; the proposed method has high recognition rate of vowel phoneme classification by using a support vector machine; the invention also realizes the function of feedback of the detection result, further improves the usability and functionality, can identify the positions of the lip and the tongue, and can also carry out vowel pronunciation training.

Furthermore, the conventional method cannot classify mellow and non-mellow vowels by analyzing the/i/and/y/etc. voice waveforms because they depend on the lip shape. Aiming at the problem, the invention establishes a circular and non-circular vowel classifier model called as a frequency shift model by analyzing the frequency shift direction of the ultrasonic signal, and can extract the pronunciation characteristics of more standard vowel phonemes.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, Python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A method for correcting pronunciation errors based on machine learning, comprising the following steps:

In step S1, ultrasonic waves are generated by the built-in loudspeaker of the electronic device, and the microphone is used as the signal receiving terminal to receive the pronunciation signal and the ultrasonic reflection signal from the user, and the ultrasonic reflection signal is the sound signal of the user's lips reflected by the ultrasonic wave;

Step S2, respectively carry out denoising processing to the received user pronunciation signal and the ultrasonic reflection signal, to extract the effective pronunciation signal and the corresponding effective ultrasonic reflection signal;

Step S3, performing peak normalization processing on the effective pronunciation signal and the effective ultrasonic reflection signal uniformly;

Step S4, time window segmentation is performed on the normalized ultrasonic reflection signal, the formant frequency of the sounding signal is calculated, and the frequency offset value of the ultrasonic reflection signal is calculated based on the Doppler effect;

Step S5, input the formant frequency value and the corresponding frequency offset value of the ultrasonic reflected signal in a time series into the trained machine learning model, and identify the phoneme classification.

2. The method according to claim 1, wherein step S2 comprises:

Using a notch filter to remove high-frequency and low-frequency signals in the background sound signal and the ultrasonic reflection signal respectively;

Use the least mean square filter to remove noise in the sound signal;

Detect the starting point and end point of a valid pronunciation signal by setting a threshold. The starting point is set before the first point that exceeds the threshold, and M1 sampling points are taken. The end point is set at the first point lower than the threshold after M1 consecutive sampling points of the signal data points after one point.

3. method according to claim 1, wherein, in step S4, utilize linear predictive coding to calculate formant frequency to pronunciation signal, and utilize Fourier transform to analyze frequency domain because lip motion brings to reflected ultrasonic wave. The frequency offset value produced by the Doppler effect.

4. The method of claim 1, wherein the machine learning model is trained according to the steps of:

Use the training set to train the constructed machine learning model, and the training set reflects the corresponding relationship between the time series formant frequency value, the frequency offset value of the ultrasonic reflected signal and the phoneme type of pronunciation;

Evaluate the trained machine learning model with a test set containing the time series formant frequency values of the test samples, the frequency offset values of the corresponding ultrasonic reflection signals, and will exceed the resonances of the training set samples, and output the classification labels of the pronunciation phonemes Peak data is treated as a signal value of 1.

5. The method of claim 1, wherein the method further comprises:

display to the user the recognized phoneme classification of the speech;

According to the change of the formant frequency information and the Doppler frequency offset in the time window, the position of the user's lips and tongue is determined, and the changes of the position information of the lips and tongue during the feedback pronunciation are displayed for the user to assist the pronunciation training.

6. The method of claim 1, wherein the machine learning model is a support vector machine.

7. The method of claim 3, wherein the formant frequency is calculated according to the steps of:

Use recursive algorithm to calculate the autocorrelation function of each pronunciation signal to obtain the coefficient of the predictor;

Decompose the polynomial coefficients of the linear prediction filter, use the eigenvalue decomposition method to find the polynomial root θ _i , and calculate the formant frequency f _i from this polynomial root, expressed as:

where T is the sampling period value.

8. A pronunciation error correction system based on machine learning, comprising:

Signal receiving unit: used to generate ultrasonic waves through the built-in speaker of the electronic device, and use the microphone as the signal receiving end to receive the pronunciation signal from the user and the ultrasonic reflection signal, which is the sound signal of the user's lips reflected by the ultrasonic wave;

Preprocessing unit: used to de-noise the received user pronunciation signal and ultrasonic reflection signal respectively, so as to extract the effective pronunciation signal and the corresponding effective ultrasonic reflection signal;

Normalization processing unit: for performing peak normalization processing on the effective pronunciation signal and the effective ultrasonic reflection signal uniformly;

Feature extraction unit: used to divide the normalized ultrasonic reflection signal into a time window, calculate the formant frequency of the sounding signal, and calculate the frequency offset value of the ultrasonic reflection signal based on the Doppler effect;

Classification and identification unit: used to input the formant frequency value in a time series and the frequency offset value of the corresponding ultrasonic reflection signal into the trained machine learning model to identify the phoneme classification.

9. A computer-readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the steps of the method according to any one of claims 1 to 8.

10. A computer device, comprising a memory and a processor, wherein a computer program that can be run on the processor is stored in the memory, wherein the processor implements the programs in claims 1 to 8 when executing the program The steps of any one of the methods.