[go: up one dir, main page]

CN109872727B - Voice quality evaluation device, method and system - Google Patents

Voice quality evaluation device, method and system Download PDF

Info

Publication number
CN109872727B
CN109872727B CN201910290416.4A CN201910290416A CN109872727B CN 109872727 B CN109872727 B CN 109872727B CN 201910290416 A CN201910290416 A CN 201910290416A CN 109872727 B CN109872727 B CN 109872727B
Authority
CN
China
Prior art keywords
voice
user
speech
accent
predetermined text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910290416.4A
Other languages
Chinese (zh)
Other versions
CN109872727A (en
Inventor
林晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Liulishuo Information Technology Co ltd
Original Assignee
Shanghai Liulishuo Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Liulishuo Information Technology Co ltd filed Critical Shanghai Liulishuo Information Technology Co ltd
Priority to CN201910290416.4A priority Critical patent/CN109872727B/en
Publication of CN109872727A publication Critical patent/CN109872727A/en
Application granted granted Critical
Publication of CN109872727B publication Critical patent/CN109872727B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides voice quality evaluation equipment, method and system based on stress, data processing equipment and method, voice processing equipment and method and a mobile terminal, aiming at overcoming the problem that the existing voice technology does not consider information related to voice stress when evaluating the pronunciation condition of a user. The voice quality evaluation apparatus includes: the storage unit is suitable for storing a predetermined text and a reference accent characteristic corresponding to the predetermined text, wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words; the user voice receiving unit is suitable for receiving user voice which is input by a user aiming at the preset text; the characteristic acquisition unit is suitable for acquiring the stress characteristics of the user voice; and a speech quality calculation unit adapted to calculate speech quality of the user speech based on a correlation between the reference accent feature and the user accent feature. The technology of the invention can be applied to the technical field of voice.

Description

Voice quality evaluation device, method and system
The application is a divisional application of an invention patent application with application number 201410736334.5, named as voice quality evaluation equipment, method and system, filed by the applicant on 04 th 12 th 2014.
Technical Field
The present invention relates to the field of voice technologies, and in particular, to an accent-based voice quality evaluation device, method, and system, a data processing device and method, a voice processing device and method, and a mobile terminal.
Background
With the development of the internet, the internet-based language learning application has also been rapidly developed. In some language learning applications, an application provider sends learning materials to a client through the internet, and a user acquires the learning materials through the client and performs operations on the client according to the instructions of the learning materials, such as inputting characters, inputting voice or selecting, and obtains feedback, so as to improve the language ability of the user.
For language learning, in addition to learning grammar and vocabulary, etc., an important aspect is learning the listening and speaking abilities of a language, particularly the ability to do so. For each language, speaking in different scenes tends to have different stress of speaking, e.g., different sentences and different words have stress that varies from scene to scene. Generally, stress refers to which words should be stressed in a whole sentence (hereinafter referred to as speaking stress) or which syllables should be stressed in a word (hereinafter referred to as pronunciation stress). Thus, the user may also need to learn the pronunciation and/or pronunciation re-readings while learning to speak in that language.
In the existing voice technology, a user records voice through a recording device of a client, a system splits the voice recorded by the user according to a text corresponding to the voice, and compares the voice of the user with an existing acoustic model word by word, thereby providing feedback to the user whether the pronunciation of the word is correct. However, the existing voice technology ignores information about accents of the voice when evaluating the pronunciation of the user, and thus does not allow the learner to learn accents of the speech and/or accents of the pronunciation.
Disclosure of Invention
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or critical elements of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
In view of the above, the present invention provides an accent-based speech quality evaluation apparatus, method and system, a data processing apparatus and method, a speech processing apparatus and method, and a mobile terminal, so as to at least solve the problem that the existing speech technology ignores information about speech accents when evaluating the pronunciation situation of a user.
According to an aspect of the present invention, there is provided an accent-based speech quality assessment apparatus including: the storage unit is suitable for storing a predetermined text and a reference accent characteristic corresponding to the predetermined text, wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words; the user voice receiving unit is suitable for receiving user voice which is input by a user aiming at the preset text; the characteristic acquisition unit is suitable for acquiring the stress characteristics of the user voice; and a speech quality calculation unit adapted to calculate speech quality of the user speech based on a correlation between the reference accent feature and the user accent feature.
According to another aspect of the present invention, there is also provided a data processing apparatus adapted to be executed in a server, comprising: the server storage unit is suitable for storing a preset text and at least one piece of reference voice corresponding to the preset text; and the stress calculation unit is suitable for calculating the characteristic parameters of the reference voice according to the reference voice or calculating the reference stress characteristics of at least one piece of reference voice according to the characteristic parameters to store in the server storage unit.
According to another aspect of the present invention, there is also provided a speech processing apparatus adapted to be executed in a computer and comprising: a reference voice receiving unit adapted to receive a voice, which is entered by a specific user for a predetermined text, as a reference voice; and an accent calculation unit adapted to calculate a feature parameter of the reference speech from the reference speech to transmit the feature parameter to a predetermined server in association with the predetermined text, or obtain a reference accent feature of the reference speech from the feature parameter to transmit the reference accent feature to the predetermined server in association with the predetermined text.
According to another aspect of the present invention, there is also provided an accent-based speech quality assessment method, including the steps of: receiving user voice input by a user aiming at predetermined text, wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words; acquiring the stress feature of the user voice; and calculating the voice quality of the voice of the user based on the correlation between the reference accent characteristics corresponding to the preset text and the accent characteristics of the user.
According to another aspect of the present invention, there is also provided a data processing method adapted to be executed in a server, and including the steps of: storing a predetermined text and at least one piece of reference voice corresponding to the predetermined text; and calculating the characteristic parameters of the reference voice according to the reference voice to store, or obtaining the reference accent characteristics of at least one section of reference voice according to the characteristic parameters to store.
According to another aspect of the present invention, there is also provided a speech processing method adapted to be executed in a computer and including the steps of: receiving voice recorded by a specific user aiming at a preset text as reference voice; and calculating a characteristic parameter of the reference voice according to the reference voice to transmit the characteristic parameter to a predetermined server in association with a predetermined text, or calculating a reference accent characteristic of the reference voice according to the characteristic parameter to transmit the reference accent characteristic to the predetermined server in association with the predetermined text.
According to another aspect of the present invention, there is also provided a mobile terminal including the accent-based speech quality assessment apparatus as described above.
According to still another aspect of the present invention, there is also provided an accent-based speech quality assessment system comprising an accent-based speech quality assessment apparatus as described above and a data processing apparatus as described above.
The above accent-based speech quality evaluation scheme according to the embodiment of the present invention calculates the speech quality of the user speech based on the correlation between the obtained user accent features of the user speech and the reference accent features, and can obtain at least one of the following benefits: the information about the voice stress is considered in the process of calculating the voice quality of the voice of the user, so that the user can know the accuracy of the recorded voice in the stress aspect according to the calculation result, and the user can judge whether the voice stress and/or the pronunciation stress of the user needs to be corrected; the calculation and evaluation of the user voice are completed on the client computer or the client mobile terminal, so that the user can perform off-line learning; the calculated amount is small; time is saved; the operation is simpler and more convenient; and when the representation form of the accent features of the user is changed, the reference accent features calculated according to the accent information of the reference voice can be conveniently represented into the form the same as the accent features of the user, so that the voice quality evaluation equipment is more flexible and convenient to process and has stronger practicability.
These and other advantages of the present invention will become more apparent from the following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings.
Drawings
The invention may be better understood by referring to the following description in conjunction with the accompanying drawings, in which like reference numerals are used throughout the figures to indicate like or similar parts. The accompanying drawings, which are incorporated in and form a part of this specification, illustrate preferred embodiments of the present invention and, together with the detailed description, serve to further explain the principles and advantages of the invention. In the drawings:
fig. 1 is a block diagram schematically showing the structure of a mobile terminal 100;
fig. 2 is a block diagram schematically showing an exemplary structure of an accent-based speech quality evaluation apparatus 200 according to an embodiment of the present invention;
fig. 3 is a block diagram schematically illustrating one possible structure of the feature acquisition unit 230 shown in fig. 2;
fig. 4 is a block diagram schematically showing an exemplary structure of an accent-based speech quality evaluation apparatus 400 according to another embodiment of the present invention;
FIG. 5 is a block diagram that schematically illustrates an exemplary architecture of a data processing apparatus 500, in accordance with an embodiment of the present invention;
FIG. 6 is a block diagram schematically illustrating an exemplary structure of a speech processing apparatus 600 according to an embodiment of the present invention;
FIG. 7 is a flow diagram that schematically illustrates an exemplary process of an stress-based speech quality assessment method, in accordance with an embodiment of the present invention;
FIG. 8 is a flow chart schematically illustrating an exemplary process of a data processing method according to an embodiment of the present invention;
FIG. 9 is a flow diagram schematically illustrating an exemplary process of a speech processing method according to an embodiment of the present invention; and
fig. 10 is a flowchart schematically illustrating another exemplary process of a voice processing method according to an embodiment of the present invention.
Skilled artisans appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help improve the understanding of the embodiments of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the device structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.
An embodiment of the present invention provides a stress-based speech quality evaluation apparatus, including: the storage unit is suitable for storing a predetermined text and a reference accent characteristic corresponding to the predetermined text, wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words; the user voice receiving unit is suitable for receiving user voice which is input by a user aiming at the preset text; the characteristic acquisition unit is suitable for acquiring the stress characteristics of the user voice; and a speech quality calculation unit adapted to calculate speech quality of the user speech based on a correlation between the reference accent feature and the user accent feature.
The above-described accent-based voice quality evaluation apparatus according to an embodiment of the present invention may be an application that performs processing in a conventional desktop or laptop computer (not shown) or the like, may be a client application that performs processing in a mobile terminal (shown in fig. 1) (such as one of the applications 154 in the mobile terminal 100 shown in fig. 1), or may be a web application that is accessed through a browser on the above-described conventional desktop or laptop computer user or mobile terminal, or the like.
Fig. 1 is a block diagram of a mobile terminal 100. The multi-touch capable mobile terminal 100 may include a memory interface 102, one or more data processors, image processors and/or central processing units 104, and a peripheral interface 106.
The memory interface 102, the one or more processors 104, and/or the peripherals interface 106 can be discrete components or can be integrated in one or more integrated circuits. In the mobile terminal 100, the various elements may be coupled by one or more communication buses or signal lines. Sensors, devices, and subsystems can be coupled to peripheral interface 106 to facilitate a variety of functions. For example, motion sensors 110, light sensors 112, and distance sensors 114 may be coupled to peripheral interface 106 to facilitate directional, lighting, and ranging functions. Other sensors 116 may also be coupled to the peripheral interface 106, such as a positioning system (e.g., a GPS receiver), a temperature sensor, a biometric sensor, or other sensing device, to facilitate related functions.
The camera subsystem 120 and optical sensor 122, which may be, for example, a charge-coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) optical sensor, may be used to facilitate implementation of camera functions such as recording photographs and video clips.
Communication functions may be facilitated by one or more wireless communication subsystems 124, which may include radio frequency receivers and transmitters and/or optical (e.g., infrared) receivers and transmitters. The particular design and implementation of the wireless communication subsystem 124 may depend on the one or more communication networks supported by the mobile terminal 100. For example, the mobile terminal 100 may include a communication subsystem 124 designed to support a GSM network, a GPRS network, an EDGE network, a Wi-Fi or WiMax network, and a Bluetooth network.
The audio subsystem 126 may be coupled to a speaker 128 and a microphone 130 to facilitate voice-enabled functions, such as voice recognition, voice replication, digital recording, and telephony functions.
The I/O subsystem 140 may include a touch screen controller 142 and/or one or more other input controllers 144.
The touch screen controller 142 may be coupled to a touch screen 146. For example, the touch screen 146 and touch screen controller 142 may detect contact and movement or pauses made therewith using any of a variety of touch sensing technologies, including but not limited to capacitive, resistive, infrared, and surface acoustic wave technologies.
One or more other input controllers 144 may be coupled to other input/control devices 148 such as one or more buttons, rocker switches, thumbwheels, infrared ports, USB ports, and/or pointing devices such as styluses. The one or more buttons (not shown) may include up/down buttons for controlling the volume of the speaker 128 and/or microphone 130.
The memory interface 102 may be coupled with a memory 150. The memory 150 may include high speed random access memory and/or non-volatile memory, such as one or more magnetic disk storage devices, one or more optical storage devices, and/or flash memory (e.g., NAND, NOR).
The memory 150 may store an operating system 152, such as an operating system like Android, IOS or Windows Phone. The operating system 152 may include instructions for handling basic system services and performing hardware dependent tasks. The memory 150 may also store applications 154. In operation, these applications are loaded from memory 150 onto processor 104 and run on top of an operating system already run by processor 104, and utilize interfaces provided by the operating system and underlying hardware to implement various user-desired functions, such as instant messaging, web browsing, picture management, and the like. The application may be provided independently of the operating system or may be native to the operating system. The application 154 comprises a speech quality assessment device 200 according to the present invention.
Fig. 2 shows an example of an accent-based speech quality assessment apparatus 200 according to an embodiment of the present invention. As shown in fig. 2, the voice quality evaluation apparatus 200 includes a storage unit 210, a user voice receiving unit 220, a feature acquisition unit 230, and a voice quality calculation unit 240.
As described above, the speech quality assessment device 200 is suitable for being executed in a computer or a mobile terminal, wherein the mobile terminal may be a mobile communication device such as a mobile phone (e.g. a smart phone) or a tablet computer.
The storage unit 210 may be, for example, the memory 150 in the mobile terminal, which may store data, information, parameters, and the like in the mobile terminal. In this embodiment, predetermined text downloaded in advance from, for example, a predetermined server and reference accent features corresponding to the predetermined text are stored in the storage unit 210. Wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words. Wherein each word in a sentence may typically comprise a plurality of letters or at least one word. The predetermined server referred to herein may be, for example, a server on which the data processing apparatus 500 described below in connection with fig. 5 resides. The calculation amount in the mode is small, extra time is not needed for calculating the reference accent features, time can be saved, and the operation is simpler and more convenient.
According to one implementation, when the language of the predetermined text is, for example, a language such as english in which words are composed of letters, the predetermined text may selectively include information such as syllables and/or phonemes of each word and correspondence between the information such as syllables and/or phonemes of each word and the letters constituting the word, in addition to text contents such as one or more sentences and one or more words of each sentence. Although the above example describes the case where the language of the predetermined text is english, the actual language of the predetermined text is not limited to english, and may be any language such as chinese, french, or german.
Furthermore, according to other implementations, the reference accent features stored by the storage unit 210 may also be obtained by local computation. For example, the feature parameters of the predetermined text and at least one reference voice may be downloaded from a predetermined server in advance, and the reference accent features may be calculated and obtained according to the feature parameters and stored in the storage unit 210. In this way, when the representation form of the accent features of the user is changed, the reference accent features calculated from the feature parameters of the reference speech can be conveniently represented in the same form as the accent features of the user, so that the processing of the speech quality evaluation apparatus 200 is more flexible, convenient, and more practical. It should be noted that, the process of calculating the reference accent feature according to the feature parameters of the reference speech may refer to the processing procedure described below in conjunction with fig. 5, and will not be described in detail here.
Here, the reference voice may be a voice previously recorded for a predetermined text by a specific user (for example, a user who has a language of the predetermined text as a mother language, or a professional language teacher related to the language of the predetermined text, or the like). The characteristic parameter may be related to a reference speech or multiple reference speech segments. The reference accent features of the multiple segments of reference speech may be obtained by averaging the reference accent features of the multiple segments of reference speech.
When the user activates the speech quality evaluation device 200, the above-described predetermined text and the reference accent feature corresponding to the predetermined text are already stored in the storage unit 210 as described above. Then, through a display device such as the touch screen 146 of the mobile terminal 100, text content (i.e., the above-mentioned predetermined text) corresponding to the voice to be entered is presented to the user, and the user is prompted to record the corresponding voice. In this way, the user can enter a corresponding voice as a user voice through an input device such as the microphone 130 of the mobile terminal 100, and the user voice is received by the user voice receiving unit 220.
Then, the user speech receiving unit 220 forwards the user speech it receives to the feature obtaining unit 230, and the feature obtaining unit 230 obtains the user accent feature of the user speech.
Fig. 3 shows one possible example structure of the feature acquisition unit 230. In this example, the feature acquisition unit 230 may include an alignment subunit 310 and a feature calculation subunit 320.
As shown in fig. 3, the alignment subunit 310 may perform forced alignment (force alignment) on the user speech and the predetermined text by using a predetermined acoustic model (acoustic model) to determine a correspondence between each word and/or each syllable in each word and/or each phoneme of each syllable and a part of the user speech in the predetermined text. Generally, an acoustic model is trained from recordings of a large number of speakers of a native language, and the acoustic model can be used to calculate the possibility that an input speech corresponds to a known character, and further, the input speech can be forcibly aligned with the known character. Here, the "input voice" may be a user voice or a reference voice to be mentioned later, and the "known word" may be a predetermined text.
The related technology of the acoustic model can be obtained by referring to related materials in http:// mi. eng.cam.ac.uk/. mjfg/ASRU _ talk09.pdf, and the related technology of the forced alignment can be obtained by referring to related materials in http:// www.isip.piconepress.com/projects/speed/software/tools/procedures/fuels/v 1.0/section _04/s04_04_ p01.html and http:// www.phon.ox.ac.uk/j coleman/BAAP _ ASR. pdf, or other related technologies can be utilized, which are not detailed herein.
Furthermore, it should be noted that, by performing forced alignment between the user speech and the predetermined text, a corresponding relationship between each sentence in the predetermined text and a partial speech (such as a certain speech segment) of the user speech can be determined, that is, a speech segment corresponding to each sentence in the predetermined text can be determined in the user speech.
In addition to this, as described above, any one or more of the following three correspondences can be obtained as necessary by forced alignment: a correspondence between each word in the predetermined text and a part of speech of the user's speech (such as a certain speech block); a correspondence between each syllable in each word in the predetermined text and a part of speech of the user speech (such as a certain speech block); and a correspondence between each phoneme of each syllable in each word in the predetermined text and a part of speech of the user's speech, such as a certain speech block.
In this way, the feature calculation subunit 320 may calculate the user accent feature of the user's voice based on the correspondence determined by the alignment subunit 310.
For example, for each sentence in the predetermined text, the feature calculating subunit 320 may obtain the feature parameters of the speech block corresponding to each word in the sentence and/or each syllable in each word in the user speech based on the above determined correspondence, and then obtain the perusal attribute (i.e., whether to peruse) of each speech block using the trained predetermined expert model and the above obtained feature parameters of each speech block.
According to one implementation, each speech block may comprise a segment of a sound wave, and the characteristic parameters of each speech block may comprise, for example, at least one of the following parameters: the voice block corresponds to the wave crest and the wave trough of the sound wave; the voice block corresponds to the absolute values of the wave crest and the wave trough of the sound wave and the energy value of the wave; the duration of the speech block or the normalized duration of the speech block; an average value of pitch information (i.e., fundamental frequency information) obtained from the speech block; an average value of difference values obtained by differentiating pitch information obtained from the speech block; and a plurality of correlation values obtained by calculating a degree of correlation between a shape of pitch information obtained from the speech block and a plurality of pitch models which are predefined.
In one example, the feature parameters of each speech block may include the following parameters: the voice block corresponds to the absolute values of the wave crest and the wave trough of the sound wave and the energy value of the wave; the duration of the speech block or the normalized duration of the speech block; and an average value of pitch information obtained from the speech block. The subsequent calculation is carried out by obtaining the three parameters of each voice block, the calculation amount is relatively small, and the calculated accent features are more accurate than the accent features calculated by using other feature parameters because the three parameters have the largest contribution degree to the calculation of the accent features.
The plurality of Correlation values can be obtained, for example, by up/down sampling the pitch model to obtain a sequence with the same number of points as the input pitch (i.e., pitch information obtained from the speech block), and then performing Correlation calculation on the two sequences, wherein the technical details of the Correlation calculation can be referred to the disclosure of http:// en.
In this way, for each word or each syllable in each sentence, the information (for example, the feature vector composed of the feature parameter values) composed of the various feature parameter values of the speech block corresponding to the word or the syllable in the speech of the user is used as the feature parameter information of the speech block, and then the feature parameter information corresponding to the word or the syllable is input into the trained expert model, so that the conclusion whether the word or the syllable is being read with stress can be obtained. It should be noted that the expert model can be obtained by training according to the prior art, and will not be described herein.
For example, for a word or a syllable, if it is determined that the word or the syllable is stressed (i.e., stressed), "1" may be used as the stress attribute value of the word or the syllable; and if it is determined that the word or the syllable is not emphasized (i.e., is not emphasized), then "0" may be used as the value of the emphasis attribute for the word or the syllable. In this way, the vector formed by the rereading attribute values of the speech block corresponding to each word in each sentence in the user speech can be used as the accent feature of the speech segment corresponding to the sentence in the user speech.
For the whole user voice, the accent features of the user voice may be formed by using the accent features of the corresponding voice segments of each sentence in the user voice, that is, the user accent features may be formed.
Thus, the speech quality calculation unit 240 can calculate the speech quality of the user speech based on the correlation between the pre-stored reference accent features and the calculated user accent features.
According to one implementation, the voice quality calculation unit 240 may obtain a score describing the voice quality of the user voice based on a correlation between the user accent feature and the reference accent feature and according to the correlation.
In one example, assuming that, for a sentence a in a predetermined text, the user accent feature of the corresponding speech segment of the sentence a in the user speech is (1, 0, 0) (that is, the accent attributes of three words included in the sentence a are respectively accent, non-accent, and non-accent in turn), and the reference accent feature of the corresponding speech segment of the sentence a in the reference speech is (0, 0, 1), the similarity between the user accent feature (1, 0, 0) and the reference accent feature (0, 0, 1) may be obtained by calculation, and is used as a score describing the speech quality of the user speech. That is, the higher the similarity between the calculated accent features of the user and the reference accent features, the higher the speech quality of the user's speech.
In addition, in another example, it is also possible to calculate a distance between the user accent feature and the reference accent feature based on the correlation therebetween, and obtain a score describing the voice quality of the user voice from the distance. For example, the reciprocal of the distance may be taken as a score describing the speech quality of the user's speech. That is, the greater the distance between the calculated user accent feature and the reference accent feature, the worse the speech quality of the user's speech.
It should be noted that, for those skilled in the art, the calculation of the similarity between vectors or the distance between vectors can be realized according to the common general knowledge and/or public material, and therefore, the detailed description is omitted here.
In addition, it should be noted that, if the reference accent features stored in the storage unit 210 are not represented in the same form as the form of the user accent features (such as in the form of vectors), they may be represented in the same form first, and then the similarity or distance between the two may be calculated.
In addition, it should be further noted that the speech quality calculating unit 240 may calculate the correlation (i.e., the similarity or the distance) between the accent features of the user and the reference accent features sentence by sentence, and then obtain the quality score of the speech segment of each sentence of the predetermined text in the speech of the user sentence by sentence (i.e., sequentially obtain the quality scores of the speech segments of each sentence of the predetermined text in the speech of the user sentence by sentence). Further, the voice quality calculation unit 240 may also selectively obtain a quality score describing the entire user voice, that is, a weighted sum or a weighted average of quality scores of voice sections corresponding in the user voice with respective sentences of a predetermined text, as the quality score of the entire user voice. Wherein the weight for each speech may be determined empirically or experimentally.
Another example of an accent-based speech quality assessment apparatus according to an embodiment of the present invention is described below with reference to fig. 4.
In the example shown in fig. 4, the voice quality evaluation apparatus 400 includes an output unit 450 in addition to the storage unit 410, the user voice receiving unit 420, the feature acquisition unit 430, and the voice quality calculation unit 440. The storage unit 410, the user speech receiving unit 420, the feature obtaining unit 430, and the speech quality calculating unit 440 in the speech quality evaluating apparatus 400 shown in fig. 4 may have the same structures and functions as the corresponding units in the speech quality evaluating apparatus 200 described above with reference to fig. 2, and can achieve similar technical effects, which are not described again here.
The output unit 450 may visually output the calculation result of the voice quality, which may be presented to the user through a display device such as the touch screen 146 of the mobile terminal 100, for example.
According to one implementation, the output unit 450 may output a score reflecting the voice quality as a calculation result of the voice quality.
For example, the output unit 450 may visually output (such as sentence-by-sentence output) a score reflecting the voice quality of each voice segment corresponding to each sentence of the predetermined text in the user voice. Therefore, the user can know the speaking stress and/or pronunciation stress accuracy of each sentence spoken by the user, and particularly when the score of a certain sentence is low, the user can immediately realize that the stress of the sentence needs to be corrected, so that the learning is more targeted.
As another example, the output unit 450 may visually output a score reflecting the voice quality of the entire user voice. Thus, the user can integrally sense whether the stress of the speech spoken by the user is accurate or not.
In addition, in other examples, the output unit 450 may also visually output a score reflecting the voice quality of each speech section corresponding to each sentence of the predetermined text in the user voice and a score reflecting the voice quality of the entire user voice at the same time.
According to another implementation, the output unit 450 may visually output a difference between the user accent features and the reference accent features as a result of the calculation of the voice quality.
For example, the output unit 450 may represent the reference speech and the user speech by two parallel lines, wherein bold display indicates that a certain word or a syllable in a certain word is accented, and if the accent positions are the same, display in a general manner, for example, green; if the accent positions are different, the accent is highlighted, for example, in red.
Thus, through the output display of the output unit 450, the user can conveniently know the difference between the speaking stress and/or pronunciation stress of the user and the speaking stress and/or pronunciation stress of the standard voice (i.e. the reference voice), how much the difference is, etc., so that the speaking stress and/or pronunciation stress of the user can be corrected more specifically and more accurately.
According to other implementation manners, the output unit 450 may also visually output the score reflecting the speech quality and the difference between the user accent feature and the reference accent feature as the calculation result of the speech quality at the same time, and the specific details of the implementation manner may refer to the description about the above two implementation manners, which is not described herein again.
As is apparent from the above description, the above-described accent-based speech quality evaluation apparatus according to the embodiment of the present invention calculates the speech quality of the user speech based on the correlation between the user accent features of the acquired user speech and the reference accent features. Because the equipment considers the information about the voice stress in the process of calculating the voice quality of the voice of the user, the user can know the accuracy of the recorded voice in the stress aspect according to the calculation result, and the equipment is further beneficial to the user to judge whether the speaking stress and/or pronunciation stress of the user needs to be corrected.
Further, the above-described accent-based speech quality evaluation apparatus according to the embodiment of the present invention corresponds to a user client terminal, the calculation and evaluation of the user speech of which are performed on a client computer or a client mobile terminal, whereas the conventional speech technology, in which the calculation and evaluation of the user speech are performed on a server side in general, allows the user to perform offline learning (in the case where a stored learning material has been downloaded) without having to perform online learning as in the conventional art.
Furthermore, an embodiment of the present invention also provides a data processing apparatus, which is adapted to be executed in a server, and includes: the server storage unit is suitable for storing a preset text and at least one piece of reference voice corresponding to the preset text; and the accent calculation unit is suitable for calculating the characteristic parameters of the reference voice according to the reference voice to store in the server storage unit, or obtaining the reference accent characteristics of at least one section of reference voice according to the characteristic parameters to store in the server storage unit.
Fig. 5 shows an example of a data processing device 500 according to an embodiment of the invention. As shown in fig. 5, the data processing apparatus 500 includes a server storage unit 510 and an accent calculation unit 520.
The data processing device 500 may be implemented, for example, as an application residing on a server. The server may comprise, for example, a web server that may communicate with a user client (e.g., voice quality assessment device 200 or 400 described above) using http protocol, but is not so limited.
The server storage unit 510 may store text materials, i.e., predetermined texts, of various language learning materials. Here, the server storage unit 510 may store at least one piece of reference voice corresponding to a predetermined text in addition to the predetermined text for each language, or may receive and store at least one piece of reference voice from an external device such as the voice processing device 600 to be described later. It should be understood that the predetermined text mentioned here is similar to the predetermined text mentioned above, and may optionally include information such as syllables and/or phonemes of each word (for example, when the language of the predetermined text is a language such as english in which words are composed of letters) and correspondence between information such as syllables and/or phonemes of each word and letters constituting the word, in addition to the text contents including one or more sentences and one or more words of each sentence.
Then, the accent calculation unit 520 may obtain the feature parameters of at least one piece of reference speech by calculation to store the feature parameters in the server storage unit 510. The process of obtaining the feature parameters of the reference speech may be similar to the process of obtaining the feature parameters of the user speech described above, and will be illustrated below, and a description of part of the same contents is omitted.
According to one implementation, the stress calculation unit 520 may store the obtained feature parameters of at least one piece of reference speech in the server storage unit 510. In such an implementation, in subsequent processing, the data processing device 500 may provide its stored predetermined text and at least one piece of feature parameters of the reference speech to the user client (e.g., the speech quality assessment device 200 or 400 described above).
In addition, according to another implementation manner, the accent calculating unit 520 may also obtain the reference accent features of the at least one piece of reference speech according to the obtained feature parameters of the at least one piece of reference speech, and store the obtained reference accent features in the server storage unit 510. In such an implementation, in subsequent processing, the data processing device 500 may provide its stored predetermined text and the reference accent feature of at least one piece of reference speech to the user client (e.g., the speech quality assessment device 200 or 400 described above).
The stress feature of each reference speech in at least one reference speech segment can be obtained by the same processing method as the above-described method for obtaining the stress feature of the user, and similar technical effects can be achieved, which are not described herein again.
It should be noted that, the processing performed in the data processing device 500 according to the embodiment of the present invention, which is the same as the processing performed in the accent-based speech quality assessment device 200 or 400 described above with reference to fig. 2 or 4, can obtain similar technical effects, and is not described in detail here.
Furthermore, an embodiment of the present invention also provides a speech processing apparatus adapted to be executed in a computer and including: and the reference voice receiving unit is suitable for receiving the voice which is recorded by a specific user aiming at the preset text as the reference voice and sending the reference voice to the preset server. The language processing device may further include an accent calculation unit adapted to calculate a feature parameter of the reference speech from the reference speech to transmit the feature parameter to a predetermined server in association with a predetermined text, or obtain a reference accent feature of the reference speech from the feature parameter to transmit the reference accent feature to the predetermined server in association with the predetermined text.
FIG. 6 shows an example of a speech processing device 600 according to an embodiment of the invention. As shown in fig. 6, the voice processing apparatus 600 includes a reference voice receiving unit 610. Optionally, the speech processing device 600 may further comprise an accent calculation unit 620.
As shown in fig. 6, according to one implementation, when the voice processing apparatus 600 includes only the reference voice receiving unit 610, it is possible to receive, as reference voice, voice entered by a specific user (such as a user who is native to a predetermined text language or a professional language teacher related to the language) for a predetermined text through the reference voice receiving unit 610, and transmit the reference voice to a predetermined server (such as a server where the data processing apparatus 500 described above in connection with fig. 5 resides).
Furthermore, according to another implementation, the speech processing device 600 may further comprise an accent calculating unit 620. The accent calculation unit 620 calculates a feature parameter of the reference speech from the reference speech received by the reference speech reception unit 610 to transmit the feature parameter to a predetermined server in association with a predetermined text, or obtains a reference accent feature of the reference speech from the feature parameter (the process may refer to the above-mentioned related description) to transmit the reference accent feature to the predetermined server in association with the predetermined text.
In practical applications, the speech processing device 600 may correspond to a teacher client provided on a computer or other terminal, for example implemented in software.
The user of the teacher client can record standard voice for each sentence in the preset text to serve as reference voice to be sent to the corresponding server, and the server executes subsequent processing. Under the condition, the server can conveniently collect the reference voice through the Internet without participating in the processing of recording the voice, and the time and the operation can be saved.
In addition, the teacher client can also directly and locally process and analyze the recorded standard voice (namely, the reference voice), generate parameters (such as the reference accent characteristics) corresponding to the standard voice, and transmit the parameters and the predetermined text to the server for storage, so that the processing load of the server can be reduced.
In addition, the embodiment of the invention also provides a mobile terminal which comprises the stress-based voice quality evaluation device. The mobile terminal may have the functions of the above-described accent-based speech quality assessment apparatus 200 or 400, and may achieve similar technical effects, which will not be described in detail herein.
Further, an embodiment of the present invention also provides an accent-based speech quality assessment system including the accent-based speech quality assessment apparatus 200 or 400 as described above and the data processing apparatus 500 as described above.
According to one implementation, the voice quality evaluation system may optionally include the voice processing apparatus 600 as described above in addition to the voice quality evaluation apparatus 200 or 400 and the data processing apparatus 500 described above. In this implementation, the voice quality evaluation apparatus 200 or 400 in the voice quality evaluation system may correspond to a user client provided in a computer or a mobile terminal, the data processing apparatus 500 may correspond to a server provided, and the voice processing apparatus 600 may correspond to a teacher client. In actual processing, the teacher client may provide reference voice (optionally, feature parameters of the reference voice or reference accent features) to the server, the server is used for storing the information and the predetermined text, and the user client may download the information from the server to analyze the user voice input by the user so as to complete voice quality evaluation. The details of the processing can refer to the descriptions given above in conjunction with fig. 2 or 4, fig. 5, and fig. 6, respectively, and are not described here again.
In addition, the embodiment of the invention also provides an accent-based speech quality evaluation method, which comprises the following steps: receiving user voice input by a user aiming at predetermined text, wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words; acquiring the stress feature of the user voice; and calculating the voice quality of the voice of the user based on the correlation between the reference accent characteristics corresponding to the preset text and the accent characteristics of the user.
An exemplary process of the above-described accent-based speech quality assessment method is described below with reference to fig. 7. As shown in fig. 7, an exemplary process flow 700 of the accent-based speech quality assessment method according to one embodiment of the present invention starts at step S710 and then performs step S720.
In step S720, a user voice entered by a user for a predetermined text is received, the predetermined text including one or more sentences, each sentence including one or more words. Then, step S730 is performed. The processing in step S720 may be the same as the processing of the user speech receiving unit 220 described above with reference to fig. 2, and similar technical effects can be achieved, which is not described herein again.
According to one implementation, the predetermined text and reference accent features are previously downloaded from a predetermined server.
According to another implementation, the predetermined text is downloaded from a predetermined server in advance, and the reference accent feature is calculated according to the feature parameters of at least one reference voice downloaded from the predetermined server in advance.
In step S730, a user accent feature of the user' S voice is acquired. Then, step S740 is performed. The processing in step S730 may be the same as the processing of the feature obtaining unit 230 described above with reference to fig. 2, and similar technical effects can be achieved, which is not described herein again.
According to one implementation, in step S730, the user speech may be forcibly aligned with the predetermined text using, for example, a predetermined acoustic model to determine a correspondence between each word and/or each syllable in each word and/or each phoneme of each syllable in the predetermined text and the portion of the user speech, and obtain the user stress feature of the user speech based on the correspondence.
The step of "obtaining the user accent feature of the user voice based on the correspondence" may be implemented, for example, as follows: for each sentence of the predetermined text: acquiring the characteristic parameters of the voice block corresponding to each word and/or each syllable in each word in the voice of the user based on the corresponding relation; and obtaining the rereading attribute of each voice block by using the trained preset expert model and the characteristic parameters of each voice block. Then, based on the obtained accent attribute of each word of each sentence and/or the pronunciation block corresponding to each syllable in each word, the accent feature of the user pronunciation is formed.
According to one implementation, each speech block includes a segment of a sound wave, and the characteristic parameters of the speech block include at least one of the following parameters: the voice block corresponds to the wave crest and the wave trough of the sound wave; the voice block corresponds to the absolute values of the wave crest and the wave trough of the sound wave and the energy value of the wave; the duration of the speech block or the normalized duration of the speech block; an average value of pitch information obtained from the speech block; an average value of difference values obtained by differentiating pitch information obtained from the speech block; and a plurality of correlation values obtained by calculating a degree of correlation between a shape of pitch information obtained from the speech block and a plurality of pitch models which are predefined.
In step S740, the speech quality of the user speech is calculated based on the correlation between the reference accent feature and the user accent feature corresponding to the predetermined text. The processing in step S740 may be the same as the processing of the speech quality calculating unit 240 described above with reference to fig. 2, and can achieve similar technical effects, which is not described herein again. Then, the process flow 700 ends in step S750.
Furthermore, according to another implementation manner, after the step S740, the following steps may be further optionally included: and visually outputting the calculation result of the voice quality.
Wherein, the calculation result of the voice quality may include: a score reflecting speech quality; and/or differences between the user accent features and the reference accent features.
As is apparent from the above description, the above-described accent-based speech quality assessment method according to an embodiment of the present invention calculates the speech quality of a user's speech based on the correlation between the acquired user accent features of the user's speech and the reference accent features. Because the method considers the information about the voice stress in the process of calculating the voice quality of the voice of the user, the user can know the accuracy of the recorded voice in the stress aspect according to the calculation result, and the method is further beneficial for the user to judge whether the speaking stress and/or pronunciation stress of the user needs to be corrected.
Further, the above-described accent-based speech quality assessment method according to the embodiment of the present invention corresponds to a user client, where the calculation and assessment of the user speech are performed on a client computer or a client mobile terminal, whereas the conventional speech technology is typically performed on a server side, and the speech quality assessment method of the present invention allows a user to perform offline learning (in the case where a stored learning material has been downloaded) without having to perform online learning as in the conventional art.
In addition, an embodiment of the present invention further provides a data processing method, which is suitable for being executed in a server and includes the following steps: storing a predetermined text and at least one piece of reference voice corresponding to the predetermined text; and calculating the characteristic parameters of the reference voice according to the reference voice for storage, or calculating the reference accent characteristics of at least one section of reference voice according to the characteristic parameters for storage.
An exemplary process of the above data processing method is described below with reference to fig. 8. As shown in fig. 8, an exemplary process flow 800 of the data processing method according to one embodiment of the present invention starts at step S810 and then, step S820 is performed.
In step S820, a predetermined text and at least one piece of reference voice corresponding to the predetermined text are stored, or the predetermined text is stored and at least one piece of reference voice is received and stored from the outside. Then, step S830 is performed. The processing in step S820 may be the same as the processing of the server storage unit 510 described above with reference to fig. 5, for example, and similar technical effects can be achieved, and are not described again here.
In step S830, the feature parameters of the at least one segment of reference speech are calculated for saving, or the reference accent features of the at least one segment of reference speech are calculated for saving according to the feature parameters. The processing in step S830 may be the same as the processing of the obtaining unit 520 described above with reference to fig. 5, and similar technical effects can be achieved, and are not described herein again. Then, the process flow 800 ends in step S840.
In addition, an embodiment of the present invention further provides a speech processing method, which is suitable for being executed in a computer and includes the following steps: receiving voice recorded by a specific user aiming at a preset text as reference voice, and sending the reference voice to a preset server; or calculating the characteristic parameters of the reference voice according to the reference voice so as to transmit the characteristic parameters to the predetermined server in association with the predetermined text, or obtaining the reference accent characteristics of the reference voice according to the characteristic parameters so as to transmit the reference accent characteristics to the predetermined server in association with the predetermined text.
An exemplary process of the above-described speech processing method is described below in conjunction with fig. 9. As shown in fig. 9, an exemplary process flow 900 of the speech processing method according to one embodiment of the present invention starts at step S910 and then proceeds to step S920.
In step S920, a voice entered by a specific user for a predetermined text is received as a reference voice. Then, step S930 is performed.
In step S930, the reference voice is transmitted to a predetermined server. The process flow 900 then ends in step S940.
The processing of the processing flow 900 may be the same as the processing of the reference speech receiving unit 610 described above with reference to fig. 6, and similar technical effects can be achieved, which is not described herein again.
Further, fig. 10 shows another exemplary process of the above-described voice processing method. As shown in fig. 10, an exemplary process flow 1000 of a speech processing method according to one embodiment of the present invention begins at step S1010 and then proceeds to step S1020.
In step S1020, a voice entered by a specific user for a predetermined text is received as a reference voice. Then, step S1030 is performed.
According to one implementation, the feature parameters of the reference speech may be obtained in step S1030 to be transmitted to a predetermined server in association with a predetermined text. The process flow 1000 then ends in step S1040.
According to another implementation, the reference accent feature of the reference speech may be obtained according to the feature parameter to transmit the reference accent feature to a predetermined server in association with a predetermined text in step S1030. The process flow 1000 then ends in step S1040.
The processing of the processing flow 1000 may be the same as the processing of the receiving and obtaining unit 620 described above with reference to fig. 6, and similar technical effects can be achieved, which is not described herein again.
A11: in the speech quality evaluation method according to the present invention, the step of obtaining the user accent feature of the user speech includes: and forcibly aligning the user voice with the predetermined text by using a predetermined acoustic model to determine the corresponding relation between each word and/or each syllable in each word and/or each phoneme of each syllable in the predetermined text and the part of the user voice, and obtaining the user stress characteristics of the user voice based on the corresponding relation. A12: the speech quality evaluation method according to a11, wherein the step of obtaining the user accent features of the user speech based on the correspondence includes: for each sentence of the predetermined text: acquiring the characteristic parameters of the voice block corresponding to each word and/or each syllable in each word in the voice of the user based on the corresponding relation, and acquiring the re-reading attribute of each voice block by using a trained preset expert model and the characteristic parameters of each voice block; and forming accent characteristics of the user voice based on the obtained accent attributes of the words of the sentences and/or the voice blocks corresponding to the syllables in the words. A13: in the speech quality evaluation method according to a12, each speech block includes a segment of sound wave, and the characteristic parameters of the speech block include at least one of the following parameters: the voice block corresponds to the wave crest and the wave trough of the sound wave; the voice block corresponds to the absolute values of the wave crest and the wave trough of the sound wave and the energy value of the wave; the duration of the speech block or the normalized duration of the speech block; an average value of pitch information obtained from the speech block; an average value of difference values obtained by differentiating pitch information obtained from the speech block; and a plurality of correlation values obtained by calculating a degree of correlation between a shape of pitch information obtained from the speech block and a plurality of pitch models which are predefined. A14: the voice quality evaluation method according to the present invention further includes: and visually outputting the calculation result of the voice quality. A15: the speech quality evaluation method according to a14, wherein the result of the speech quality calculation includes: a score reflecting the speech quality; and/or a difference between the user accent feature and the reference accent feature. A16: in the speech quality evaluation method according to the present invention, the predetermined text and the reference accent feature are previously downloaded from a predetermined server; or the predetermined text is downloaded from a predetermined server in advance, and the reference accent features are calculated according to the feature parameters of at least one piece of reference voice downloaded from the predetermined server in advance. A17: a data processing method, the method being adapted to be executed in a server and comprising the steps of: storing a predetermined text; storing at least one piece of reference voice corresponding to the predetermined text; and calculating the characteristic parameters of the at least one section of reference voice for storage, and/or calculating the reference accent characteristics of the at least one section of reference voice according to the characteristic parameters for storage. A18: a method of speech processing, the method being adapted to be executed in a computer and comprising the steps of: receiving voice recorded by a specific user aiming at a preset text as reference voice; and calculating a characteristic parameter of the reference voice according to the reference voice to transmit the characteristic parameter to a predetermined server in association with the predetermined text, and/or calculating a reference accent characteristic of the reference voice according to the characteristic parameter to transmit the reference accent characteristic to the predetermined server in association with the predetermined text. A19: a mobile terminal includes an accent-based speech quality assessment apparatus according to the present invention. A20 an accent-based speech quality assessment system comprising an accent-based speech quality assessment apparatus according to the present invention and a data processing apparatus. A21 is an accent-based speech quality evaluation system, comprising an accent-based speech quality evaluation apparatus according to the present invention; a server; and a speech processing device according to the invention.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims (12)

1. A data processing apparatus, the apparatus being adapted to execute in a server and comprising:
a server storage unit adapted to store a predetermined text and at least one piece of reference voice corresponding to the predetermined text, the reference voice being a voice recorded by a specific user in advance for the predetermined text; and
and the stress calculation unit is suitable for calculating the characteristic parameters of the reference voice according to the at least one section of reference voice and calculating the reference stress characteristics of the at least one section of reference voice according to the characteristic parameters to store in the server storage unit, so that the user client calculates the voice quality of the user voice based on the correlation between the reference stress characteristics and the user stress characteristics of the user voice.
2. The data processing device of claim 1, wherein the stress calculation unit is adapted to:
forcibly aligning the reference speech with the predetermined text using a predetermined acoustic model to determine a correspondence between each word and/or each syllable in each word and/or each phoneme of each syllable in the predetermined text and a portion of the reference speech; and
and calculating the reference accent characteristics of the reference voice based on the corresponding relation.
3. The data processing device of claim 2, wherein the stress calculation unit is adapted to:
for each sentence of the predetermined text:
acquiring the characteristic parameters of the voice block corresponding to each word and/or each syllable in each word in the reference voice based on the corresponding relation; and
obtaining the rereading attribute of each voice block by using the trained preset expert model and the characteristic parameters of each voice block; and
and forming the reference accent characteristics of the reference voice based on the obtained accent attributes of the words of the sentences and/or the voice blocks corresponding to the syllables in the words.
4. The data processing apparatus according to claim 3, wherein each speech block comprises a segment of a sound wave, and the characteristic parameters of the speech block comprise at least one of:
the voice block corresponds to the wave crest and the wave trough of the sound wave;
the voice block corresponds to the absolute values of the wave crest and the wave trough of the sound wave and the energy value of the wave;
the duration of the speech block or the normalized duration of the speech block;
an average value of pitch information obtained from the speech block;
an average value of difference values obtained by differentiating pitch information obtained from the speech block; and
a plurality of correlation values obtained by calculating a degree of correlation between a shape of pitch information obtained from the speech block and a plurality of pitch models which are predefined.
5. A speech processing apparatus, the apparatus being adapted to be executed in a computer and comprising:
a reference voice receiving unit adapted to receive a voice, which is entered by a specific user for a predetermined text, as a reference voice; and
and the accent calculation unit is suitable for calculating the characteristic parameters of the reference voice according to the reference voice, sending the characteristic parameters to a predetermined server in a manner of being associated with the predetermined text, obtaining the reference accent characteristics of the reference voice according to the characteristic parameters, sending the reference accent characteristics to the predetermined server in a manner of being associated with the predetermined text, so that the user client downloads the predetermined text and the characteristic parameters of the reference voice from the predetermined server, calculates the reference accent characteristics corresponding to the predetermined text according to the characteristic parameters, downloads the predetermined text and the reference accent characteristics corresponding to the predetermined text from the predetermined server, and calculates the voice quality of the user voice based on the correlation between the reference accent characteristics and the user accent characteristics of the user voice.
6. The speech processing device of claim 5, wherein the stress calculation unit is adapted to:
forcibly aligning the reference speech with the predetermined text using a predetermined acoustic model to determine a correspondence between each word and/or each syllable in each word and/or each phoneme of each syllable in the predetermined text and a portion of the reference speech; and
and calculating the reference accent characteristics of the reference voice based on the corresponding relation.
7. The speech processing device according to claim 6, wherein the stress calculation unit is adapted to:
for each sentence of the predetermined text:
acquiring the characteristic parameters of the voice block corresponding to each word and/or each syllable in each word in the reference voice based on the corresponding relation; and
obtaining the rereading attribute of each voice block by using the trained preset expert model and the characteristic parameters of each voice block; and
and forming the reference accent characteristics of the reference voice based on the obtained accent attributes of the words of the sentences and/or the voice blocks corresponding to the syllables in the words.
8. The speech processing apparatus according to claim 7, wherein each speech block includes a segment of a sound wave, and the characteristic parameter of the speech block includes at least one of:
the voice block corresponds to the wave crest and the wave trough of the sound wave;
the voice block corresponds to the absolute values of the wave crest and the wave trough of the sound wave and the energy value of the wave;
the duration of the speech block or the normalized duration of the speech block;
an average value of pitch information obtained from the speech block;
an average value of difference values obtained by differentiating pitch information obtained from the speech block; and
a plurality of correlation values obtained by calculating a degree of correlation between a shape of pitch information obtained from the speech block and a plurality of pitch models which are predefined.
9. A data processing method, the method being adapted to be executed in a server and comprising the steps of:
storing a predetermined text;
storing at least one piece of reference voice corresponding to the predetermined text, the reference voice being a voice recorded by a specific user in advance for the predetermined text; and
calculating the characteristic parameters of the at least one reference voice segment for saving, and/or calculating the reference accent characteristics of the at least one reference voice segment for saving according to the characteristic parameters, so that the user client can calculate the voice quality of the user voice based on the correlation between the reference accent characteristics and the user accent characteristics of the user voice.
10. A method of speech processing, the method being adapted to be executed in a computer and comprising the steps of:
receiving voice recorded by a specific user aiming at a preset text as reference voice; and
calculating characteristic parameters of the reference voice according to the reference voice to transmit the characteristic parameters to a predetermined server in association with the predetermined text, and/or calculating reference accent characteristics of the reference voice according to the characteristic parameters to transmit the reference accent characteristics to the predetermined server in association with the predetermined text, so that a user client downloads the characteristic parameters of the predetermined text and the reference voice from the predetermined server, calculates reference accent characteristics corresponding to the predetermined text according to the characteristic parameters, or downloads the reference accent characteristics corresponding to the predetermined text and the predetermined text from the predetermined server, and calculates voice quality of the user voice based on correlation between the reference accent characteristics and the user accent characteristics of the user voice.
11. An accent-based speech quality assessment system comprising an accent-based speech quality assessment apparatus and a data processing apparatus according to claim 1; wherein
The accent-based speech quality assessment apparatus includes:
the storage unit is suitable for storing a predetermined text and a reference accent characteristic corresponding to the predetermined text, wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words;
the user voice receiving unit is suitable for receiving user voice which is input by a user aiming at the preset text;
the characteristic acquisition unit is suitable for acquiring the user stress characteristic of the user voice; and
a speech quality calculation unit adapted to calculate a speech quality of the user speech based on a correlation between the reference accent feature and the user accent feature.
12. An accent-based speech quality assessment system includes an accent-based speech quality assessment apparatus;
a server; and
the speech processing device of claim 5; wherein
The accent-based speech quality assessment apparatus includes:
the storage unit is suitable for storing a predetermined text and a reference accent characteristic corresponding to the predetermined text, wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words;
the user voice receiving unit is suitable for receiving user voice which is input by a user aiming at the preset text;
the characteristic acquisition unit is suitable for acquiring the user stress characteristic of the user voice; and
a speech quality calculation unit adapted to calculate a speech quality of the user speech based on a correlation between the reference accent feature and the user accent feature.
CN201910290416.4A 2014-12-04 2014-12-04 Voice quality evaluation device, method and system Active CN109872727B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910290416.4A CN109872727B (en) 2014-12-04 2014-12-04 Voice quality evaluation device, method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410736334.5A CN104485116B (en) 2014-12-04 2014-12-04 Voice quality assessment equipment, method and system
CN201910290416.4A CN109872727B (en) 2014-12-04 2014-12-04 Voice quality evaluation device, method and system

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201410736334.5A Division CN104485116B (en) 2014-12-04 2014-12-04 Voice quality assessment equipment, method and system

Publications (2)

Publication Number Publication Date
CN109872727A CN109872727A (en) 2019-06-11
CN109872727B true CN109872727B (en) 2021-06-08

Family

ID=52759655

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201910290416.4A Active CN109872727B (en) 2014-12-04 2014-12-04 Voice quality evaluation device, method and system
CN201410736334.5A Active CN104485116B (en) 2014-12-04 2014-12-04 Voice quality assessment equipment, method and system

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201410736334.5A Active CN104485116B (en) 2014-12-04 2014-12-04 Voice quality assessment equipment, method and system

Country Status (1)

Country Link
CN (2) CN109872727B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105261362B (en) * 2015-09-07 2019-07-05 科大讯飞股份有限公司 A kind of call voice monitoring method and system
CN106611603A (en) * 2015-10-26 2017-05-03 腾讯科技(深圳)有限公司 Audio processing method and audio processing device
CN106971743B (en) * 2016-01-14 2020-07-24 广州酷狗计算机科技有限公司 User singing data processing method and device
CN106847308A (en) * 2017-02-08 2017-06-13 西安医学院 A kind of pronunciation of English QA system
CN110085261B (en) * 2019-05-16 2021-08-24 上海流利说信息技术有限公司 Pronunciation correction method, device, equipment and computer readable storage medium
CN111951827B (en) * 2019-05-16 2022-12-06 上海流利说信息技术有限公司 Continuous reading identification correction method, device, equipment and readable storage medium
CN110136748A (en) * 2019-05-16 2019-08-16 上海流利说信息技术有限公司 A kind of rhythm identification bearing calibration, device, equipment and storage medium
CN110085260A (en) * 2019-05-16 2019-08-02 上海流利说信息技术有限公司 A kind of single syllable stress identification bearing calibration, device, equipment and medium
CN112309429A (en) * 2019-07-30 2021-02-02 上海流利说信息技术有限公司 Method, device and equipment for explosion loss detection and computer readable storage medium
CN112309371A (en) * 2019-07-30 2021-02-02 上海流利说信息技术有限公司 Intonation detection method, apparatus, device and computer readable storage medium
CN111508525B (en) * 2020-03-12 2023-05-23 上海交通大学 Full-reference audio quality evaluation method and device
CN111583961A (en) * 2020-05-07 2020-08-25 北京一起教育信息咨询有限责任公司 Stress evaluation method and device and electronic equipment
CN112086094B (en) * 2020-08-21 2023-03-14 广东小天才科技有限公司 Method for correcting pronunciation, terminal equipment and computer readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3154725B2 (en) * 1995-01-19 2001-04-09 ジーメンス アクティエンゲゼルシャフト Method for transmitting audio information
CN101251956A (en) * 2008-03-24 2008-08-27 合肥讯飞数码科技有限公司 Interactive teaching device and teaching method
CN101551952A (en) * 2009-05-21 2009-10-07 无敌科技(西安)有限公司 Device and method for evaluating pronunciation
CN101630448A (en) * 2008-07-15 2010-01-20 上海启态网络科技有限公司 Language learning client and system
CN101751919A (en) * 2008-12-03 2010-06-23 中国科学院自动化研究所 Spoken Chinese stress automatic detection method
CN102237081A (en) * 2010-04-30 2011-11-09 国际商业机器公司 Method and system for estimating rhythm of voice
KR101188982B1 (en) * 2011-07-20 2012-10-08 포항공과대학교 산학협력단 Stress studying system and method for studying foreign language

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5487671A (en) * 1993-01-21 1996-01-30 Dsp Solutions (International) Computerized system for teaching speech
JP2000019941A (en) * 1998-06-30 2000-01-21 Oki Hokuriku System Kaihatsu:Kk Pronunciation learning apparatus
US20080147404A1 (en) * 2000-05-15 2008-06-19 Nusuara Technologies Sdn Bhd System and methods for accent classification and adaptation
US7571101B2 (en) * 2006-05-25 2009-08-04 Charles Humble Quantifying psychological stress levels using voice patterns
CN101727903B (en) * 2008-10-29 2011-10-19 中国科学院自动化研究所 Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems
JP5528850B2 (en) * 2010-02-18 2014-06-25 Kddi株式会社 Mobile terminal device, stress estimation system, operation method, stress estimation program
CN101996635B (en) * 2010-08-30 2012-02-08 清华大学 Evaluation method of English pronunciation quality based on stress prominence
CN102436807A (en) * 2011-09-14 2012-05-02 苏州思必驰信息科技有限公司 Method and system for automatically generating voice with stressed syllables
CN102800314B (en) * 2012-07-17 2014-03-19 广东外语外贸大学 English sentence recognizing and evaluating system with feedback guidance and method
US9478146B2 (en) * 2013-03-04 2016-10-25 Xerox Corporation Method and system for capturing reading assessment data
JP6263868B2 (en) * 2013-06-17 2018-01-24 富士通株式会社 Audio processing apparatus, audio processing method, and audio processing program
CN103544311A (en) * 2013-11-04 2014-01-29 北京中搜网络技术股份有限公司 News client evaluation system and method based on mobile phone

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3154725B2 (en) * 1995-01-19 2001-04-09 ジーメンス アクティエンゲゼルシャフト Method for transmitting audio information
CN101251956A (en) * 2008-03-24 2008-08-27 合肥讯飞数码科技有限公司 Interactive teaching device and teaching method
CN101630448A (en) * 2008-07-15 2010-01-20 上海启态网络科技有限公司 Language learning client and system
CN101751919A (en) * 2008-12-03 2010-06-23 中国科学院自动化研究所 Spoken Chinese stress automatic detection method
CN101551952A (en) * 2009-05-21 2009-10-07 无敌科技(西安)有限公司 Device and method for evaluating pronunciation
CN102237081A (en) * 2010-04-30 2011-11-09 国际商业机器公司 Method and system for estimating rhythm of voice
KR101188982B1 (en) * 2011-07-20 2012-10-08 포항공과대학교 산학협력단 Stress studying system and method for studying foreign language

Also Published As

Publication number Publication date
CN104485116A (en) 2015-04-01
CN104485116B (en) 2019-05-14
CN109872727A (en) 2019-06-11

Similar Documents

Publication Publication Date Title
CN109872727B (en) Voice quality evaluation device, method and system
CN104485115B (en) Pronounce valuator device, method and system
US9881615B2 (en) Speech recognition apparatus and method
US9548048B1 (en) On-the-fly speech learning and computer model generation using audio-visual synchronization
CN104361896B (en) Voice quality assessment equipment, method and system
US11217245B2 (en) Customizable keyword spotting system with keyword adaptation
JP2019102063A (en) Method and apparatus for controlling page
CN104505103B (en) Voice quality assessment equipment, method and system
JP6172417B1 (en) Language learning system and language learning program
US20160118039A1 (en) Sound sample verification for generating sound detection model
US20160372110A1 (en) Adapting voice input processing based on voice input characteristics
CN112840396A (en) Electronic device for processing user words and control method thereof
CN106796788A (en) Automatic speech recognition is improved based on user feedback
US10741174B2 (en) Automatic language identification for speech
CN104361895B (en) Voice quality assessment equipment, method and system
US11210964B2 (en) Learning tool and method
KR20220037819A (en) Artificial intelligence apparatus and method for recognizing plurality of wake-up word
CN109326284A (en) The method, apparatus and storage medium of phonetic search
CN112908308B (en) Audio processing method, device, equipment and medium
KR20130052800A (en) Apparatus for providing speech recognition service, and speech recognition method for improving pronunciation error detection thereof
US20140156256A1 (en) Interface device for processing voice of user and method thereof
KR102622350B1 (en) Electronic apparatus and control method thereof
KR20130137367A (en) System and method for providing book-related service based on image
KR20200056754A (en) Apparatus and method for generating personalization lip reading model
JP7495220B2 (en) Voice recognition device, voice recognition method, and voice recognition program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant