Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
The invention aims to provide a method for carrying out voice evaluation and intelligent pronunciation by using common electronic equipment, which can identify pronunciation, particularly single vowel, voice pronunciation and the like according to the voice rules of tongue position and lip shape. The pronunciation recognition and analysis system can be carried on commercial electronic equipment for pronunciation recognition and analysis, wherein the electronic equipment comprises but is not limited to a smart phone, a tablet electronic equipment, a desktop computer, a Personal Digital Assistant (PDA), a vehicle-mounted device, an intelligent wearable device and the like. For clarity, the following description will be made by taking a smart phone as an example and taking a user (or called user) as an example to pronounce vowels.
Specifically, as shown in fig. 1 and fig. 2, the provided pronunciation error correction method based on machine learning includes the following steps.
In step S100, ultrasonic waves are generated by a speaker built in the smart phone, and a microphone is used as a signal receiving end to receive a sound signal and an ultrasonic wave reflection signal from a user.
The sound signal is a sound signal directly generated when the user reads the voice, the ultrasonic wave reflection signal is a sound signal of the lip movement or the face movement of the user reflected by the ultrasonic wave, and the microphone can receive a mixed signal of the two sounds. In practical application, the smart phone can be placed in front of a user, and ultrasonic waves with the frequency of 20KHz are generated through a built-in loudspeaker.
In one embodiment, step S100 includes the following sub-steps:
step S101, collecting sound signals (including user pronunciation signals and emission ultrasonic signals) of a microphone of the smart phone;
and step S102, carrying out data processing on the collected sound signals, and removing the interference of high-frequency environmental noise and electromagnetic noise.
For example, de-noising is performed using a Butterworth low pass filter with a cutoff frequency set to 5500Hz and a microphone data sampling rate set to 44 KHz.
Step S200, performing preliminary extraction and preprocessing on the received signal.
In one embodiment, the following substeps may be performed using a smartphone:
step S201, removing high-frequency and low-frequency signals in background human voice sound signals and reflected ultrasonic waves by using a notch filter;
step S202, removing noise in the sound signal by using a least mean square filter;
in step S203, a threshold method is used to detect the start point and the end point of the utterance in the sound signal.
For example, in signal detection, a signal exceeding a set threshold is determined temporally, and when the signal continues to exceed the threshold and is maintained for a period of time t (t is greater than 1s), the signal is considered not to be a noise signal but to be a valid signal to be extracted, and the signal is further cut. Signals of the cutting starting point and the cutting end point are extracted. Specifically, the method for determining the cutting signal is to determine the starting point of the signal first, and the starting point is selected, for example, before the first point exceeding the threshold, M1 sampling points are taken (the size of M1 is about the product of the sampling rate and the primary pronunciation time); the end point of the utterance signal is then determined, for example, the data point after the first point below the threshold after M1 consecutive sample points of the signal.
In step S200, the notch filter used for processing the human voice signal is, for example, 100 to 5500Hz, the notch filter used for processing the voice signal reflecting the ultrasonic wave is, for example, 19.90 to 21.10KHz, and the sampling time of the least mean square filter is set to, for example, 10 ms.
And step S300, carrying out normalization processing for eliminating interference generated by the placement distance of the smart phone in the signal transmission process.
In one embodiment, step S300 includes the following sub-steps:
step S301, searching peak values of the sound signals of the training set in all the signals, and carrying out peak value normalization processing on all the received sound signals;
step S302, in the test sample, processing the peak data exceeding the training set sample with the signal value being 1.
The normalization processing is carried out on the received mixed sound signal (comprising the pronunciation signal and the ultrasonic reflection signal), so that the interference caused by the distance difference between the smart phone and the lips of the user can be eliminated, and the accuracy of the subsequent pronunciation phoneme recognition is improved. For example, the normalization process formula is expressed as:
wherein x and y correspond to the signal before and after normalization, respectively, xminIs the minimum peak, x, before normalizationmaxIs the maximum peak before normalization.
And step S400, performing feature extraction on the normalized ultrasonic reflection signal, and cutting a time window to obtain the frequency deviation of the resonance peak of the ultrasonic reflection signal and the frequency deviation of the ultrasonic reflection signal during pronunciation.
In one embodiment, step S400 includes the following sub-steps:
step S401, processing the signal by using Fourier transform, extracting only frequency amplitude information of 100-5500Hz in the obtained sound signal frequency spectrum, and obtaining the formant frequency of the sound signal by using a linear predictive coding technology;
and S402, performing time window segmentation on the segmented data, and detecting the frequency offset of ultrasonic waves emitted by the smart phone reflected by the user at the frequency band of about 20KHz by utilizing the Doppler effect through Fourier transform.
Specifically, the actual speech sample values can be predicted by determining a set of coefficients of the predictor by using linear predictive coding, and the formant frequency at the moment can be obtained by calculating the coefficients of the predictor, so that the position of the tongue and the phoneme of the pronunciation at the moment are mapped. The basic idea of linear prediction is that there is a correlation between speech sample values, so that the current speech sample value can be approximated by combining several past speech sample values. A unique set of prediction coefficients is determined by minimizing the sum of the squares of the differences between the actual speech sample values and the predicted values, i.e., minimizing the mean square error.
For example, the solution is performed by using an autocorrelation method of levenson-Durbin (Levinson-Durbin), and an autocorrelation function of each utterance signal is calculated by using a recursive algorithm, thereby obtaining a coefficient of the predictor.
After obtaining the prediction coefficients, the polynomial coefficients of the linear prediction filter are decomposed, e.g., by finding complex polynomial roots θ using eigenvalue decompositioniCalculating the formant frequency f according to the complex root and the sampling period value TiExpressed as:
by the resonant peak frequency fiMapping the lip position of the mouth roughly determines the kind of phoneme pronounced at this time, but still cannot be completely separated by this method due to the similar vowel phonemes, e.g./u/and/y/with similar formant frequencies. The underlying reason is that the tongue position is substantially the same when the vowel is emitted. Therefore, in order to solve this problem, the present invention has been madeOne step is to detect the motion of the lips by measuring the lips using doppler shift.
The Doppler effect is the phenomenon that when a transmitting wave source and a receiving target move relatively, the frequency of the receiving wave of an observer deviates from the frequency emitted by the wave source.
In one embodiment, a short-time Fourier transform method is used to capture signals at frequencies around 20KHz that are emitted by the handset speaker and reflected back to the microphone, and negative frequency shifts are generated by the backward movement of the lips while positive frequency shifts are generated by the forward movement of the lips. For example, when three vowels are pronounced, i, e, the lips are flat, and the lips are generally retracted, the distance from the receiving end is increased, and doppler frequency shift appears as the center frequency of the reflected wave moves to a low frequency band. When three vowels are played, i.e.,/o/,/u/, the lips are circular and have a moving trend close to the receiving end, and the Doppler frequency shift shows that the center frequency of the reflected wave moves to a high frequency band.
In step S400, a signal formant frequency is calculated by a linear predictive coding technique, and a frequency shift value of ultrasonic waves reflected by the doppler effect of lip motion in a frequency domain is calculated by a fourier transform method, and the frequency shift value can reflect both the direction of the frequency shift and the degree of change in frequency, thereby capturing the positions of the lips and the tongue according to the pronunciation rules.
And step S500, inputting the extracted features into a machine learning model to perform pronunciation phoneme classification detection.
In this step, the extracted features, including the formant frequency value and the frequency offset value of the reflected ultrasonic wave at the time of pronunciation, are input into the machine learning model, and the pronunciation phoneme type can be identified. For example, features filtered in a time sequence are input into a support vector machine for classification, and other machine learning models, such as convolutional neural networks, can also be used.
In one embodiment, step S500 includes the following sub-steps:
step S501, inputting the processed time sequence formant frequency information into a support vector machine based on a classification method of the support vector machine;
step S502, the support vector machine has a training state and a testing state, in the training state, the hyperplane parameter established by the support vector machine is trained according to the label given by the data, if the support vector machine is in the testing state, the displayed label result is returned according to the support vector machine classification. When the support vector machine model is trained, the input features are the formant frequencies of the sound signals processed through S100-S400 in each divided time window, and the output classification labels are the vowel phonemes identified by classification, namely the vowel phoneme types.
In the above embodiment, it is preferable to classify the vowel vocal signal using a support vector machine model. The support vector machine model is a suitable model as a mono classifier, with the support vector machine kernel converting the low dimensional input space to a higher dimensional space. In other words, the support vector machine model converts an inseparable problem into a separable problem by adding more dimensions to it. The model employs a gaussian kernel function because it transforms the inseparable problem into a separable problem by adding more dimensions. In this case, vowels with the same tongue position or with the same lip shape can be classified by the choice of formants and the determination of doppler shift.
In summary, in this step S500, correct or incorrect pronunciation of the user is detected by using the support vector machine model, and pronunciation feedback is provided to let the user know the position of the mouth and tongue when pronouncing.
It should be understood that, by training the support vector machine with the training data set and evaluating the training effect of the model with the testing data set, an optimized support vector machine model can be obtained for phoneme type recognition and analysis during actual pronunciation. The training data set reflects the corresponding relationship between the formant frequency value in time sequence, the frequency deviation value of the reflected ultrasonic wave during pronunciation and the pronunciation phoneme type.
Step S600, the detection result and the details which should be noticed when pronouncing are fed back to the user.
After the detection result is obtained, the details which the user should pay attention to when pronouncing can be informed through the smart phone. For example, step S600 includes the following sub-steps:
step S601, displaying the classified pronunciation result according to the result returned in step S500;
step S602, judging the position of the lips and the tongue of the user according to the change of the formant frequency information in the time window analyzed in the step S500, and displaying the change on the smart phone;
in step S603, the change of the classification result and the resonance information reflects details that the user needs to pay attention to when pronouncing, for example, the tongue should be moved up or down when pronouncing, or the lips should be moved forward or backward to provide auxiliary training.
For example, vowel sounds of different nationalities in different languages are actually collected, and correction can be achieved on vowel sounds of multiple languages through data analysis and classification training, and analysis results can be fed back to a user.
Correspondingly, the invention also provides a pronunciation error correction system based on machine learning, which is used for realizing one or more aspects of the method, wherein the involved modules or units can be realized by adopting a special processor or an FPGA (field programmable gate array). For example, referring to FIG. 3, the system includes a sensing and processing module, a computing module, and a feedback module.
The sensing and processing module is used for executing the following processes: when the user speaks, ultrasonic waves are generated by using the speaker of the smartphone and the returned sound wave signal is received at the microphone of the smartphone. The acoustic signals are normalized and the original acoustic signals are processed using a butterworth bandpass filter, a notch filter, a least mean square filter, and a linear predictive coding technique to calculate the formant frequencies and a fourier transform to calculate the doppler shift of the ultrasonic reflected signals. In the acquisition of a user in a time period, the sound wave signal is cut into a plurality of time units, the formant frequency of each unit is calculated, and the movement of lips and tongue of the user during pronunciation can be judged by using the Doppler effect.
The calculation module is used for executing the following processes: and calculating to obtain formant frequencies of a plurality of time windows in one sampling, training the formant frequencies of the plurality of windows through a classification model of a support vector machine, and feeding back to the machine learning model. After data of a plurality of training samples are obtained, the model can carry out classification prediction on the obtained formant frequency data of a plurality of time windows;
the feedback module is used for executing the following processes: the method feeds back the phoneme of the vowel spoken by the user, and meanwhile, the position of the tongue and the shape of the lips can be judged through the change of the formant frequency on the time window in the pronunciation process of the user, and the change is displayed visually. And, by suggesting this pronunciation of the sub-phoneme by reference to a standard library of formant frequencies in a time window, the user is told that the lip pronunciation should be expanded or closed.
In one embodiment, the sensing and processing module comprises the following elements:
the voice transmitting unit transmits ultrasonic waves with a frequency rule of sine wave oscillation by using a loudspeaker;
a voice collecting unit for collecting a voice signal reflecting a user's face (vocal organ) using a microphone and collecting a time slice signal by judging a threshold;
the normalization unit is used for eliminating the influence brought by the size of the vocalization of the user;
a filtering unit for filtering the sound signal data by using a filter;
the analog-digital unit is used for converting the analog sound signal into a digital signal;
an extraction unit calculates a formant frequency by using a linear predictive coding technique and calculates a Doppler shift by processing a signal of the ultrasonic wave reflection by Fourier transform.
In one embodiment, the calculation module comprises the following units:
the time slice cutting unit is used for cutting the signal time slice signal;
and the computing unit is used for carrying out classification prediction on the features in the divided units by utilizing the machine learning model.
In one embodiment, the feedback module comprises the following elements:
the error correction unit is used for displaying the pronunciation types of the vowel phonemes obtained by classifying and predicting the input features by using a machine learning model;
and the visualization unit is used for generating a feedback graph of the positions of the lips and the tongue by segmenting the change of the time sequence characteristics of the time window, and visualizing the user to clarify the sounding errors by comparing with a standard graph.
The invention focuses on the speech definition of the learning target language of the user, and realizes the auxiliary pronunciation training by judging and correcting the vowel phoneme. It was verified that in practical applications, the present invention can recognize 30 vowels in four languages (french, japanese, korean, and mandarin) by the positions of the lips and the tongue. As shown in fig. 4, the present invention can be built in a smart phone, when a user corrects a voice, the user reads the selected vowel phoneme to be trained, and after the reading is finished, the user is informed whether the reading is correct, and how to adjust the sound production part of the user can better produce the vowel phoneme of the language to be learned.
In conclusion, the invention provides a method for implementing pronunciation error correction based on machine learning, which is a novel pronunciation training mode; the resonance in the sound wave of the user is utilized, and the ultrasonic wave is used for respectively detecting the positions of the tongue and the lips of the user based on the Doppler effect, so that the detection accuracy is improved; the accuracy of the signals is obviously improved by processing the signals by using processing methods such as a Butterworth band-pass filter, a notch filter, a least mean square filter, time slicing, normalization, Fourier transform and the like; the proposed method has high recognition rate of vowel phoneme classification by using a support vector machine; the invention also realizes the function of feedback of the detection result, further improves the usability and functionality, can identify the positions of the lip and the tongue, and can also carry out vowel pronunciation training.
Furthermore, the conventional method cannot classify mellow and non-mellow vowels by analyzing the/i/and/y/etc. voice waveforms because they depend on the lip shape. Aiming at the problem, the invention establishes a circular and non-circular vowel classifier model called as a frequency shift model by analyzing the frequency shift direction of the ultrasonic signal, and can extract the pronunciation characteristics of more standard vowel phonemes.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, Python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.