[go: up one dir, main page]

CN114170997B - Pronunciation skill detection method, device, storage medium and electronic device - Google Patents

Pronunciation skill detection method, device, storage medium and electronic device Download PDF

Info

Publication number
CN114170997B
CN114170997B CN202111620731.2A CN202111620731A CN114170997B CN 114170997 B CN114170997 B CN 114170997B CN 202111620731 A CN202111620731 A CN 202111620731A CN 114170997 B CN114170997 B CN 114170997B
Authority
CN
China
Prior art keywords
matrix
feature
phoneme
pronunciation
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111620731.2A
Other languages
Chinese (zh)
Other versions
CN114170997A (en
Inventor
李芳足
吴奎
金海�
李�浩
盛志超
竺博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202111620731.2A priority Critical patent/CN114170997B/en
Publication of CN114170997A publication Critical patent/CN114170997A/en
Application granted granted Critical
Publication of CN114170997B publication Critical patent/CN114170997B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

一种发音技巧检测方法、装置、存储介质及电子设备。其中,方法包括获取待检测文本,将待检测文本转换为对应的音素序列;获取说话人说出待检测文本得到的待检测音频,提取待检测音频的声学特征;将音素序列和声学特征输入已训练的发音技巧检测模型进行发音技巧检测处理,得到第一检测结果和第二检测结果;其中,第一检测结果用于表征是否需要采用发音技巧说出待检测文本,第二检测结果用于表征说话人是否采用发音技巧说出待检测文本。本申请能够提升发音技巧检测的准确性。

A pronunciation skill detection method, device, storage medium and electronic device. The method includes obtaining a text to be detected, converting the text to be detected into a corresponding phoneme sequence; obtaining an audio to be detected obtained by a speaker speaking the text to be detected, and extracting acoustic features of the audio to be detected; inputting the phoneme sequence and the acoustic features into a trained pronunciation skill detection model for pronunciation skill detection processing to obtain a first detection result and a second detection result; the first detection result is used to characterize whether it is necessary to use pronunciation skills to speak the text to be detected, and the second detection result is used to characterize whether the speaker uses pronunciation skills to speak the text to be detected. The present application can improve the accuracy of pronunciation skill detection.

Description

Pronunciation skill detection method and device, storage medium and electronic equipment
Technical Field
The application relates to the technical field of voice recognition, in particular to a pronunciation skill detection method and device, a storage medium and electronic equipment.
Background
Currently, for either language, both Chinese and English, spoken language is a great emphasis in mastering these languages. For example, for english learners, spoken pronunciation is often the key point and weak point of the english learning process, and whether to accurately adopt pronunciation skills such as continuous reading, bursting loss, blushing and the like reflects the spoken language ability of the english learners. In the related art, an artificial hearing test is generally used to test the pronunciation ability of a speaker, however, factors such as artificial subjective judgment and auditory fatigue affect the accuracy of the pronunciation skill test result.
Disclosure of Invention
The application provides a pronunciation skill detection method, a pronunciation skill detection device, a storage medium and electronic equipment, which can improve the accuracy of pronunciation skill detection.
The application provides a pronunciation skill detection method, which comprises the following steps:
Obtaining a text to be detected, and converting the text to be detected into a corresponding phoneme sequence;
Acquiring audio to be detected, which is obtained by a speaker speaking a text to be detected, and extracting acoustic characteristics of the audio to be detected;
inputting the phoneme sequence and the acoustic characteristics into a trained pronunciation skill detection model to carry out pronunciation skill detection processing, so as to obtain a first detection result and a second detection result;
the first detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not, and the second detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not.
The application provides a pronunciation skill detecting device, comprising:
the first acquisition module is used for acquiring a text to be detected and converting the text to be detected into a corresponding phoneme sequence;
The second acquisition module is used for acquiring the audio to be detected, which is obtained by speaking the text to be detected, of the speaker and extracting the acoustic characteristics of the audio to be detected;
The detection module is used for inputting the phoneme sequence and the acoustic characteristics into the trained pronunciation skill detection model to carry out pronunciation skill detection processing, so as to obtain a first detection result and a second detection result;
the first detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not, and the second detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not.
The present application provides a storage medium having stored thereon a computer program which, when loaded by a processor, performs the steps in the pronunciation skill detection method as provided by the present application.
The electronic equipment provided by the application comprises a processor and a memory, wherein the memory stores a computer program, and the processor is used for executing the steps in the pronunciation skill detection method provided by the application by loading the computer program.
According to the application, a text to be detected is obtained, a speaker speaks the text to be detected to obtain audio to be detected, the audio to be detected and the text to be detected are used for detecting the pronunciation of the speaker, wherein the text to be detected is converted into a corresponding phoneme sequence, the acoustic characteristics of the audio to be detected are extracted, and then the phoneme sequence and the acoustic characteristics are input into a trained pronunciation skill detection model for pronunciation skill detection processing, so that a first detection result and a second detection result are obtained. The first detection result is used for representing whether the text to be detected needs to be spoken by adopting pronunciation skills, and the second detection result is used for representing whether the speaker needs to speak the text to be detected by adopting pronunciation skills. Compared with the related art, the application can avoid artificial subjective judgment and auditory fatigue by adopting the pronunciation skill detection mode based on artificial intelligence to replace the traditional artificial hearing detection, thereby improving the accuracy of pronunciation skill detection.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of a sound producing skill detecting system according to an embodiment of the present application.
Fig. 2 is a flow chart of a pronunciation skill detection method according to an embodiment of the present application.
Fig. 3 is an exemplary diagram of extracting acoustic features in an embodiment of the present application.
Fig. 4 is a block diagram of a pronunciation skill detection model according to an embodiment of the present application.
Fig. 5 is a block diagram of a structure of a phoneme feature extraction network in a pronunciation skill detection model.
Fig. 6 is a block diagram of the structure of a phoneme feature extraction sub-module inside a phoneme feature module in a phoneme feature extraction network.
Fig. 7 is another block diagram of a phoneme feature extraction network in a pronunciation skill detection model.
FIG. 8 is a block diagram of the feature encoding module within the acoustic feature enhancement network in the pronunciation skill detection model.
FIG. 9 is a detailed block diagram of an acoustic feature enhancement module within an acoustic feature enhancement network in a pronunciation skill detection model.
Fig. 10 is a block diagram of a feature fusion network in a pronunciation skill detection model.
Fig. 11 is a block diagram of the structure of the first voicing skill detection network in the voicing skill detection model.
Fig. 12 is a block diagram of a branch detection network within a second pronunciation skill detection network in the pronunciation skill detection model.
Fig. 13 is a block diagram of a sound producing skill detecting apparatus according to an embodiment of the present application.
Fig. 14 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
It should be noted that the principles of the present application are illustrated as implemented in a suitable computing environment. The following description is based on illustrative embodiments of the application and should not be taken as limiting other embodiments of the application not described in detail herein. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.
Relational terms such as first and second, and the like may be used solely to distinguish one object or operation from another object or operation without necessarily limiting the actual sequential relationship between the objects or operations. In the description of the embodiments of the present application, the meaning of "plurality" is two or more, unless explicitly defined otherwise.
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is a theory, method, technique, and application system that utilizes a digital computer or a machine controlled by a digital computer to simulate, extend and extend the intelligence, sense the environment, acquire knowledge, and use the knowledge to obtain optimal results of a person. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. Artificial intelligence software techniques mainly include machine learning (MACHINE LEARNING, ML) techniques, where deep learning (DEEP LEARNING, DL) is a new direction of research in machine learning that is introduced into machine learning to bring it closer to the original goal, i.e., artificial intelligence. At present, deep learning is mainly applied to the fields of computer vision, natural language processing and the like.
Deep learning is the inherent regularity and presentation hierarchy of learning sample data, and information obtained during such learning processes greatly aids in interpretation of data such as text, image and sound. The deep learning technology and the corresponding training data set are utilized to train and obtain network models realizing different functions, for example, a deep learning network for gender classification can be trained based on one training data set, a deep learning network for image optimization can be trained based on another training data set, and the like.
In order to improve the efficiency of pronunciation skill detection, the application introduces deep learning into pronunciation skill detection, and correspondingly provides a pronunciation skill detection method, a pronunciation skill detection device, a storage medium and an electronic device. Wherein the pronunciation skill detection method may be performed by an electronic device.
Referring to fig. 1, the present application further provides a sound skill detecting system, as shown in fig. 1, where the sound skill detecting system includes an electronic device 100, for example, the electronic device may obtain a text to be detected for detecting sound skill, and convert the text to be detected into a corresponding phoneme sequence, when the electronic device is further configured with a microphone, audio collection may be performed during a speaker speaks the text to be detected, so as to obtain audio to be detected obtained by the speaker speaking the text to be detected, and extract an acoustic feature of the audio to be detected, and then the obtained phoneme sequence and acoustic feature are further input into a trained sound skill detecting model to perform sound skill detecting processing, so as to obtain a first detecting result and a second detecting result, where the first detecting result is used for indicating whether the speaker needs to speak the text to be detected using sound skill, and the second detecting result is used for indicating whether the speaker speaks the text to be detected using sound skill.
The electronic device 100 may be any device equipped with a processor and having processing capabilities, such as a mobile electronic device with a processor, e.g., a smart phone, a tablet computer, a palmtop computer, a notebook computer, or a stationary electronic device with a processor, e.g., a desktop computer, a television, a server, etc.
In addition, as shown in fig. 1, the pronunciation skill detection system may further include a storage device 200 for storing data including, but not limited to, raw data, intermediate data, result data, etc. obtained during the pronunciation skill detection process, for example, the electronic device 100 may store the acquired text to be detected, audio to be detected, a phoneme sequence converted from the text to be detected, acoustic features extracted from the audio to be detected, and the first detection result and the second detection result output by the pronunciation skill detection model in the storage device 200.
It should be noted that, the schematic view of the situation of the pronunciation skill detection system shown in fig. 1 is only an example, and the pronunciation skill detection system and the situation described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of the pronunciation skill detection system and the appearance of a new service situation, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.
Referring to fig. 2, fig. 2 is a flowchart of a pronunciation skill detection method according to an embodiment of the present application. As shown in fig. 2, the flow of the pronunciation skill detection method provided by the embodiment of the present application may be as follows:
in S310, a text to be detected is acquired, and the text to be detected is converted into a corresponding phoneme sequence.
Wherein, the text to be detected refers to the text for detecting the pronunciation skill, and the pronunciation skill detection comprises detecting whether the text to be detected needs to be spoken by using the pronunciation skill or not, and detecting whether the speaker speaks the text to be detected by using the pronunciation skill or not.
It should be noted that, different languages such as chinese and english have respective pronunciation skills, and for example, english has pronunciation skills such as continuous reading, burst loss, and turbidity loss.
The continuous reading means that the head and tail phonemes of two adjacent words are naturally spelled together, and the middle is not stopped.
The loss of explosion means that when two explosion sounds (such as p, d, t, k, g) are adjacent, the former explosion sound only makes a sound mouth shape according to the sound making part to form an obstruction, but does not explode out, and after a small pause, the latter consonant is emitted. The previous plosive is then referred to as a lost plosive, such as goo (d) bye.
The turbidity is a value of s/before a clear consonant, the clear consonant has a corresponding turbidity consonant, and the clear consonant has a vowel after the clear consonant, and at the moment, the clear consonant is read as the corresponding turbidity consonant. Taking spaak as an example, clear consonant/p/preceding with/s/this sound,/p/corresponding voiced consonant is/b/,/p/following with vowel/i:/, where the original/spi: k/reading/sbi: k/.
As described above, the pronunciation skill detection method provided by the application can be used for detecting the pronunciation skill of any language, and correspondingly, the text to be detected can be the text of any language according to the requirement of detecting the actual pronunciation skill. After the obtained text to be detected, the electronic equipment further converts the obtained text to be detected into a corresponding phoneme sequence. For example, the electronic device may convert the acquired text to be detected into a corresponding phoneme sequence according to a pronunciation dictionary.
In an alternative embodiment, converting the text to be detected into a corresponding phoneme sequence includes:
Removing text units which are not pronounciated in the text to be detected to obtain a new text to be detected;
and converting each text unit in the new text to be detected into a corresponding phoneme unit to obtain a phoneme sequence.
It will be appreciated that not all text units in any text need be pronounced when the text is spoken, e.g., punctuation in the text does not.
Therefore, in order to eliminate the interference of the non-sounding text units and improve the accuracy of sounding skill detection, when the electronic equipment converts the text to be detected into a corresponding phoneme sequence, firstly removing the non-sounding text units (such as punctuation marks, emoticons and the like) in the text to be detected to obtain a new text to be detected, and then converting each text unit in the new text to be detected into the corresponding phoneme sequence according to a sounding dictionary.
For example, when the pronunciation skill of english is required to be detected, the electronic device obtains that a text to be detected of english is "Please turn on the light", "a text unit in the text to be detected is a punctuation mark", "is a text unit without pronunciation, after removing the text unit without pronunciation", "a new text to be detected is obtained as" Please turn on the light ", and further, each text unit in the new text to be detected is converted into a corresponding phoneme unit according to a pronunciation dictionary, so as to obtain a phoneme sequence
In addition, to more clearly characterize the phoneme sequence, the electronic device may also add a start flag and an end flag before and after the phoneme sequence, respectively, the start of the phoneme sequence being characterized by the start flag and the end of the phoneme sequence being characterized by the end flag. The specific configuration of the start flag and the end flag is not particularly limited herein, and may be configured by those skilled in the art according to actual needs.
For example, a start flag may be configured as "< bos >" and an end flag may be configured as "< eos >", with the electronic device having the above phoneme sequenceAfter adding the start flag and the end flag, the method changes to
In S320, the audio to be detected obtained by the speaker speaking the text to be detected is obtained, and the acoustic features of the audio to be detected are extracted.
In this embodiment, the electronic device obtains the audio to be detected obtained by the speaker speaking the text to be detected, in addition to converting the text to be detected into the corresponding phoneme sequence. The data format of the audio to be detected is not particularly limited here, and may be configured by those skilled in the art according to the actual detection needs.
The speaker may be a real person or a virtual person.
For example, when the speaker is a real person, the electronic device may perform audio collection on the voice of the text to be detected, which is spoken by the real person, through the configured audio collection device (which may be an internal audio collection device or an external audio collection device), and take the collected audio as the audio to be detected, and in addition, the electronic device may also obtain the audio to be detected, which is collected by other electronic devices, from other electronic devices, where the real person speaks the audio to be detected of the text to be detected. Correspondingly, by utilizing the audio to be detected obtained at the moment, the electronic equipment can apply the pronunciation skill detection method provided by the application to detect the pronunciation ability of the real person.
For another example, when the speaker is a virtual person, for example, based on artificial intelligence, the electronic device may directly input the text to be detected into the speech synthesis software, perform speech synthesis by the speech synthesis software, and output the synthesized audio, and use the audio as the audio to be detected. Correspondingly, the electronic equipment can be used for detecting the voice synthesis capability of the voice synthesis software by using the audio to be detected obtained at the moment.
As described above, the electronic device further extracts the acoustic features of the audio to be detected after obtaining the audio to be detected obtained by the speaker speaking the text to be detected. The acoustic characteristics refer to physical quantities representing acoustic characteristics of speech, and are also collectively called acoustic performance of various elements of sound, such as energy concentration areas representing timbre, formant frequencies, formant intensities, bandwidths and the like, and duration, fundamental frequency, average speech power and the like representing prosody characteristics of speech.
In an alternative embodiment, to further improve accuracy of pronunciation skill detection, extracting acoustic features of the audio to be detected includes:
Extracting Filterbank features, fundamental frequency features and energy features of the audio to be detected;
and fusing the Filterbank characteristic, the fundamental frequency characteristic and the energy characteristic to obtain the acoustic characteristic.
In this embodiment, the Filterbank feature, the fundamental frequency feature and the energy feature are used as acoustic features related to the pronunciation skill, and accordingly, when the acoustic features for detecting the pronunciation skill are extracted, the electronic device extracts the Filterbank feature, the fundamental frequency feature and the energy feature of the audio to be detected. The dimensions of the extracted Filterbank features are not particularly limited herein, and may be configured by those skilled in the art according to actual needs, for example, in this embodiment, the electronic device may extract Filterbank features of 40 dimensions of the audio to be detected.
As above, after extracting the Filterbank feature, the fundamental frequency feature and the energy feature of the audio to be detected, the electronic device further fuses the Filterbank feature, the fundamental frequency feature and the energy feature according to the configured fusion strategy to obtain a fusion feature, and the fusion feature is used as an acoustic feature for detecting pronunciation skills. The configuration of the fusion policy is not particularly limited herein, and may be configured by those skilled in the art according to actual needs.
For example, referring to fig. 3, the fusion strategy configured in this embodiment is to splice the Filterbank feature, the fundamental frequency feature and the energy feature according to the time dimension, so as to obtain the acoustic feature for detecting the pronunciation skill.
In S330, the phoneme sequence and the acoustic feature are input into a trained pronunciation skill detection model to perform a pronunciation skill detection process, so as to obtain a first detection result and a second detection result.
It should be noted that, according to the present application, the corresponding pronunciation skill detection model is pre-trained for different languages, for example, the pronunciation skill detection model for detecting pronunciation skill of chinese is pre-trained for chinese, and the pronunciation skill detection model for detecting pronunciation skill of english is pre-trained for english. The structure and training mode of the pronunciation skill detection model are not particularly limited, and can be selected by those skilled in the art according to actual needs.
The pronunciation skill detection model is configured to take acoustic characteristics of audio to be detected, which is derived from a speaker to speak a text to be detected, and a phoneme sequence, which is derived from the text to be detected, as inputs, and correspondingly output detection results for representing whether the speaker needs to speak the text to be detected by using pronunciation skill or not and detection results for representing whether the speaker uses pronunciation skill to speak the text to be detected or not.
Correspondingly, in this embodiment, after the above phoneme sequence and the acoustic feature are obtained, the electronic device inputs the obtained phoneme sequence and acoustic feature to a trained pronunciation skill detection model to be matched with the language of the text to be detected, and performs a pronunciation skill detection process to obtain a first detection result and a second detection result output by the pronunciation skill detection model. The first detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not, and the second detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not.
Taking the above text to be detected "Please turn on the light" as an example, according to expert knowledge, it can be known that phonemes "n" and "a" are collocations of continuous reading phonemes, turn and on need to be read continuously, aiming at a phoneme sequence and acoustic characteristics corresponding to the text to be detected, a first detection result output by the pronunciation skill detection model will characterize that the text to be detected needs to be output by adopting pronunciation skill to be read continuously, and depending on whether a speaker uses pronunciation skill to read continuously to say the text to be detected, the pronunciation skill detection model will output a second detection result for.
In addition, the phoneme sequence may be input as an original phoneme sequence, or after digital encoding, the phoneme sequence may be converted into a phoneme sequence in a digital form, that is, the corresponding phonemes are represented by numbers in the converted phoneme sequence. Accordingly, if the phoneme sequence is inputted in a digital form, training of the phoneme sequence sample in a digital form is required when training the pronunciation skill detection model.
In an alternative embodiment, the pronunciation skill detection model includes a phoneme feature extraction network, an acoustic feature enhancement network, a feature fusion network, a first pronunciation skill detection network, and a second pronunciation skill detection network, and the inputting the phoneme sequence and the acoustic feature into the trained pronunciation skill detection model to perform pronunciation skill detection processing to obtain a first detection result and a second detection result, including:
Inputting the phoneme sequence into a phoneme feature extraction network to perform feature extraction processing to obtain a phoneme feature matrix;
Inputting the phoneme feature matrix into a first pronunciation skill detection network to carry out pronunciation skill detection processing to obtain a first detection result;
if the first detection result representation needs to adopt pronunciation skills to speak the text to be detected, inputting the acoustic features into an acoustic feature enhancement network for feature enhancement processing to obtain an enhanced acoustic feature matrix;
inputting the enhanced acoustic feature matrix and the phoneme feature matrix into a feature fusion network to perform feature fusion processing to obtain a fusion feature matrix;
and inputting the fusion feature matrix into a second pronunciation skill detection network to carry out pronunciation skill detection processing to obtain a second detection result.
Referring to fig. 4, the pronunciation skill detection model provided in the present embodiment is composed of 5 major parts, which are a phoneme feature extraction network, an acoustic feature enhancement network, a feature fusion network, a first pronunciation skill detection network and a second pronunciation skill detection network.
The phoneme feature extraction network is configured to perform feature extraction on an input phoneme sequence to obtain a phoneme feature matrix reflecting the inter-phoneme relationship in the phoneme sequence.
The acoustic feature enhancement network is configured to perform feature enhancement processing on the input acoustic features to enhance features therein that are more relevant to pronunciation skills, resulting in an enhanced acoustic feature matrix.
The feature fusion network is configured to perform feature fusion on the input phoneme feature matrix and the enhanced acoustic feature matrix, and does not perform interaction information of the phoneme and the acoustic feature, so as to obtain a fusion feature matrix.
The first pronunciation skill detection network is configured to perform pronunciation skill detection processing on the input phoneme feature matrix and output a first detection result for representing whether the text to be detected needs to be uttered by using pronunciation skill.
The second pronunciation skill detection network is configured to perform pronunciation skill detection processing on the input fusion feature matrix, and output a second detection result for representing whether the speaker speaks the text to be detected by using pronunciation skill.
Accordingly, in this embodiment, when inputting the phoneme sequence and the acoustic feature into the trained pronunciation skill detection model to perform the pronunciation skill detection process, the electronic device may perform feature extraction processing on the phoneme feature extraction network of the phoneme sequence to obtain a phoneme feature matrix, and then input the phoneme feature matrix into the first pronunciation skill detection network to perform the pronunciation skill detection process to obtain the first detection result.
And simultaneously, the electronic equipment inputs the acoustic characteristics into an acoustic characteristic enhancement network to perform characteristic enhancement processing to obtain an enhanced acoustic characteristic matrix, inputs the enhanced acoustic characteristic and the acoustic element characteristic matrix into a characteristic fusion network to perform characteristic fusion processing to obtain a fusion characteristic matrix, inputs the fusion characteristic matrix obtained by fusion into a second pronunciation skill detection network to perform pronunciation skill detection processing to obtain a second detection result.
In addition, the electronic device may determine whether to output the second detection result according to the first detection result. After obtaining a first detection result and a second detection result obtained by performing pronunciation skill detection processing on a pronunciation skill detection model, the electronic device determines whether a text to be detected needs to be spoken by using pronunciation skill according to the first detection result, if the text to be detected needs to be spoken by using pronunciation skill, the electronic device outputs the first detection result and the second detection result at the same time, the first detection result indicates that the text to be detected needs to be spoken by using pronunciation skill, and the second detection result indicates whether a speaker uses pronunciation skill to speak the text to be detected, if the text to be detected does not need to be spoken by using pronunciation skill, the electronic device can discard the second detection result and only output the first detection result.
In other embodiments, if the first detection result indicates that the text to be detected needs to be spoken by using pronunciation skills, the electronic device inputs the acoustic feature into the acoustic feature enhancement network to perform feature enhancement processing, so as to obtain an enhanced acoustic feature matrix. And then, the electronic equipment further inputs the enhanced acoustic feature and the phoneme feature matrix into a feature fusion network to perform feature fusion processing to obtain a fusion feature matrix. And finally, the electronic equipment inputs the fusion feature matrix obtained by fusion into a second pronunciation skill detection network to carry out pronunciation skill detection processing, so as to obtain a second detection result.
In addition, if the first detection result indicates that the text to be detected does not need to be spoken by adopting pronunciation skills, further pronunciation skill detection is not needed, and at this time, the electronic equipment does not use acoustic features to perform pronunciation skill detection any more, and only the first detection result can be output.
In an alternative embodiment, the phoneme feature extraction network includes a phoneme embedding module and a phoneme feature extraction module, and inputting the phoneme sequence into the phoneme feature extraction network to perform feature extraction processing to obtain a phoneme feature matrix, including:
Inputting the phoneme sequence into a phoneme embedding module for embedding treatment to obtain a phoneme vector matrix;
Inputting the phoneme vector matrix into a phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix.
Referring to fig. 5, in the present embodiment, the phoneme feature extraction network is composed of two parts, namely a phoneme embedding module and a phoneme feature extraction module, wherein the phoneme embedding module is configured to perform embedding processing on an input phoneme sequence and vector the input phoneme sequence to obtain a phoneme vector matrix, and the phoneme feature extraction module is configured to perform feature extraction on the input phoneme vector matrix to obtain a phoneme feature matrix reflecting the inter-phoneme relationship in the phoneme sequence.
Correspondingly, in this embodiment, when inputting a phoneme sequence into a phoneme feature extraction network to perform feature extraction processing, the electronic device first inputs the phoneme sequence into a phoneme embedding module to perform embedding processing to obtain a phoneme vector matrix, and then inputs the phoneme vector matrix into a phoneme feature extraction module to perform feature extraction processing to obtain a phoneme feature matrix.
The phoneme feature extraction module comprises at least 1 phoneme feature extraction submodule, inputs a phoneme vector matrix into the phoneme feature extraction module to perform feature extraction processing to obtain a phoneme feature matrix, and comprises the following steps:
when the number of the phoneme feature extraction submodules is 1, inputting the phoneme vector matrix into the phoneme feature extraction submodule to perform feature extraction processing to obtain a phoneme feature matrix, or
When the number of the phoneme feature extraction submodules is N, inputting the phoneme vector matrix into the N phoneme feature extraction submodules to sequentially perform feature extraction processing to obtain a phoneme feature matrix, wherein N is an integer larger than 1. The value of N is not particularly limited, and may be configured by those skilled in the art according to actual needs, for example, N may be configured to be 2.
In this embodiment, the audio feature extraction module may be composed of 1 phoneme feature extraction submodule, or may be composed of N phoneme feature extraction submodules connected in sequence. When the audio feature extraction module consists of N phoneme feature extraction sub-modules, each audio extraction sub-module performs the same feature extraction processing. The feature extraction processing procedure of 1 audio feature extraction sub-module will be described below as an example.
Referring to fig. 6, the audio feature extraction sub-module is composed of 3 sub-layers, which are a first matrix conversion layer, a first multi-head attention layer and a first matrix fusion layer, respectively, wherein,
The first matrix conversion layer is configured to perform matrix conversion processing on an input matrix, and convert the input matrix into a query matrix, a key matrix and a value matrix respectively;
the first multi-head attention layer is configured to perform attention enhancement processing on the input query matrix, the key matrix and the value matrix to obtain an attention enhancement matrix;
the first matrix fusion layer is configured to perform matrix fusion processing on an input matrix of the first matrix conversion layer and an output matrix of the first multi-head attention layer to obtain a fusion matrix.
Correspondingly, when the number of the phoneme feature extraction submodules is 1, the electronic device can extract and obtain a phoneme feature matrix according to the following mode:
Inputting the phoneme vector matrix into a first matrix conversion layer for matrix conversion processing to obtain a query matrix, a key matrix and a value matrix, and respectively marking the query matrix, the key matrix and the value matrix as a first query matrix, a first key matrix and a first value matrix;
Inputting the first query matrix, the first key matrix and the first value matrix into a first multi-head attention layer for attention enhancement processing to obtain an attention enhancement matrix, and marking the attention enhancement matrix as the first attention enhancement matrix;
and inputting the first attention enhancement matrix and the phoneme vector matrix into a first matrix fusion layer for matrix fusion processing to obtain a fusion matrix, and recording the fusion matrix as a phoneme characteristic matrix.
It can be understood that when the number of the phoneme feature extraction submodules is N, only the 1 st phoneme feature extraction submodule of the N phoneme feature extraction submodules connected in sequence is needed to input the phoneme vector matrix, the N phoneme feature submodules sequentially perform feature extraction processing in the above manner, and the fusion feature output by the N-th phoneme feature submodule is used as the phoneme feature matrix.
In this embodiment, the first matrix fusion layer is not particularly limited, and may be configured by those skilled in the art according to actual needs.
For example, the first matrix fusion layer may include two sub-layers, namely an addition layer and a layer normalization layer, and when the matrices are fused, the addition layer adds the two input matrices to obtain a sum matrix, and then the layer normalization layer performs layer normalization on the sum matrix to obtain a fusion matrix.
In an alternative embodiment, referring to fig. 7, the phoneme feature extracting network further includes a first position encoding module and a second matrix fusion layer, and before inputting the phoneme vector matrix into the phoneme feature extracting module to perform feature extraction processing, the method further includes:
Inputting the phoneme vector matrix into a first position coding module for position coding processing to obtain a first position coding matrix;
Inputting the first position coding matrix and the phoneme vector matrix into a second matrix fusion layer for matrix fusion processing to obtain a phoneme position fusion matrix;
inputting the phoneme vector matrix into a phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix, wherein the feature extraction module comprises the following steps:
Inputting the phoneme position fusion matrix into a phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix.
In this embodiment, in order to further improve accuracy of pronunciation skill detection, in this embodiment, the original phoneme vector matrix is not input to the phoneme feature extraction module for feature extraction, but after being subjected to position encoding, the phoneme vector matrix carrying the position information is input to the phoneme feature extraction module for feature extraction.
The electronic device inputs the phoneme vector matrix into a first position coding module to perform position coding processing to obtain a position coding matrix, and records the position coding matrix as a first position coding matrix, wherein the first position coding matrix characterizes the position information of each matrix unit in the phoneme vector matrix, and can be relative position information or absolute position information.
After the first position coding matrix is obtained, the electronic equipment inputs the first position coding matrix and the phoneme vector matrix into a second matrix fusion layer to perform matrix fusion processing, so as to obtain a fusion matrix, and the fusion matrix is recorded as a phoneme position fusion matrix. And the electronic equipment further inputs the phoneme position fusion matrix carrying the position information into a phoneme feature extraction module to perform feature extraction processing to obtain a phoneme feature matrix. For how the phoneme feature extraction module performs feature extraction, please refer to the related description of the above embodiments, and details are not repeated here.
In addition, it should be noted that, in this embodiment, the matrix fusion manner of the second matrix fusion layer is not particularly limited, and may be configured by those skilled in the art according to actual needs. For example, the second matrix fusion layer is configured to perform addition processing on two matrices inputted, and output the sum matrix obtained by addition as a fusion matrix.
In an alternative embodiment, the acoustic feature enhancement network includes a feature encoding module and at least 1 acoustic feature enhancement module, and the inputting the acoustic features into the acoustic feature enhancement network performs feature enhancement processing to obtain an enhanced acoustic feature matrix, including:
inputting the acoustic characteristics into a characteristic coding module for characteristic coding treatment to obtain an acoustic characteristic matrix;
When the number of the acoustic feature enhancement modules is 1, inputting the acoustic feature matrix into the acoustic feature enhancement modules for feature enhancement processing to obtain an enhanced acoustic feature matrix, or
When the number of the acoustic feature enhancement modules is M, inputting the acoustic feature matrix into the M acoustic feature enhancement modules to sequentially perform feature enhancement processing, so as to obtain an enhanced acoustic feature matrix, wherein M is an integer larger than 1. The value of M is not particularly limited, and may be configured by those skilled in the art according to actual needs, for example, M may be configured to be 4.
It should be noted that, in the above embodiment, the obtained Filterbank features, the fundamental frequency features and the energy features are presented in the form of feature graphs, and correspondingly, the acoustic features obtained by fusing the Filterbank features, the fundamental frequency features and the energy features are presented in the form of feature graphs.
In order to effectively perform feature enhancement processing on acoustic features, in this embodiment, the acoustic feature enhancement network is composed of 1 feature encoding module and at least 1 acoustic feature enhancement module, wherein the feature encoding module is configured to encode acoustic features and compress feature dimensions to obtain corresponding acoustic feature matrices, and the acoustic feature enhancement module is configured to perform feature enhancement processing on input acoustic feature matrices to enhance features related to pronunciation skills therein and obtain enhanced acoustic feature matrices.
Correspondingly, when the acoustic features are input into the acoustic feature enhancement network to perform feature enhancement processing, the electronic equipment firstly inputs the acoustic features into the feature coding module to perform feature coding processing to obtain an acoustic feature matrix.
Referring to fig. 8, the feature encoding module is composed of 4 sub-layers including a first convolution layer, a first pooling layer, a second convolution layer and a second pooling layer, wherein,
The first convolution layer is configured to carry out convolution processing on the input feature map to obtain a corresponding convolution result;
the first pooling layer is configured to pooling the convolution result output by the first convolution layer to obtain a corresponding pooling result;
the second convolution layer is configured to carry out convolution processing on the pooling result output by the first pooling layer to obtain a corresponding convolution result;
the second pooling layer is configured to pool the convolution result output by the second convolution layer to obtain a feature matrix of the corresponding feature map.
Accordingly, the electronic device may perform feature encoding processing on the acoustic owner input feature encoding module as follows:
Inputting the acoustic features into a first convolution layer for convolution processing to obtain a convolution result, and recording the convolution result as a first convolution result;
Inputting the first convolution result into a first pooling layer for pooling treatment to obtain a pooling result, and marking the pooling result as a first pooling result;
inputting the first pooling result into a second convolution layer for convolution treatment to obtain a convolution result, and recording the convolution result as a second convolution result;
And inputting the second convolution result into a second pooling layer for pooling treatment to obtain an acoustic feature matrix.
It should be noted that, in this embodiment, the convolution kernel sizes, step sizes, and padding sizes of the first convolution layer and the second convolution layer are not specifically limited, and may be configured by those skilled in the art according to actual needs.
For example, in this embodiment, the convolution kernel size of the first convolution layer is [3,3], the step size is [1,1], the padding size is [1,1], the convolution kernel size of the second convolution layer is [3,3], the step size is [1,1], the padding size is [1,1], the pooling type of the first pooling layer is configured as maximum pooling, the pooling kernel size is [2,2], the step size is [1,1], the pooling type of the second pooling layer is configured as maximum pooling, the pooling kernel size is [2,2], and the step size is [1,1].
Further, when the number of the acoustic feature enhancement modules is 1, the electronic device directly inputs the acoustic feature matrix into the acoustic feature enhancement modules to perform feature enhancement processing, so as to obtain an enhanced acoustic feature matrix.
The feature enhancement processing procedure of 1 acoustic feature enhancement module is described below as an example.
Referring to fig. 9, the acoustic feature enhancement module is composed of 6 sub-layers, which are a second matrix conversion layer, a second multi-head attention layer, a third matrix fusion layer, a third convolution layer, an inverse convolution layer, and a fourth matrix fusion layer. Wherein, the
The second matrix conversion layer is configured to perform matrix conversion processing on the input matrix, and convert the input matrix into a query matrix, a key matrix and a value matrix respectively;
the second multi-head attention layer is configured to perform attention enhancement processing on the input query matrix, the key matrix and the value matrix to obtain an attention enhancement matrix;
The third matrix fusion layer is configured to perform matrix fusion processing on the input matrix of the second matrix conversion layer and the output matrix of the second multi-head attention layer to obtain a fusion matrix;
The third convolution layer is configured to carry out convolution processing on the input fusion matrix to obtain a convolution result;
the deconvolution layer is configured to carry out deconvolution processing on the input convolution result to obtain a matrix-form deconvolution result;
The fourth matrix fusion layer is configured to perform matrix fusion processing on the fusion matrix output by the third matrix fusion layer and the deconvolution result output by the deconvolution layer to obtain a fusion matrix.
Accordingly, when the number of the acoustic feature enhancement modules is 1, the electronic device may enhance the obtained enhanced acoustic feature matrix as follows:
inputting the acoustic feature matrix into a second matrix conversion layer for matrix conversion processing to obtain a query matrix, a key matrix and a value matrix, and respectively marking the query matrix, the key matrix and the value matrix as a second query matrix, a second key matrix and a second value matrix;
inputting the second query matrix, the second key matrix and the second value matrix into a second multi-head attention layer for attention enhancement processing to obtain an attention enhancement matrix, and marking the attention enhancement matrix as a second attention enhancement matrix;
inputting the second attention enhancement matrix and the acoustic feature matrix into a third matrix fusion layer for matrix fusion processing to obtain a fusion matrix, and marking the fusion matrix as an acoustic fusion matrix;
Inputting the acoustic fusion matrix into a third convolution layer for convolution processing to obtain a third convolution result;
inputting the third convolution result into the deconvolution layer for deconvolution treatment to obtain a deconvolution result in a matrix form;
and inputting the acoustic fusion matrix and the deconvolution result into a fourth matrix fusion layer for matrix fusion processing to obtain a fusion matrix, and marking the fusion matrix as an enhanced acoustic feature matrix.
The method for fusing the third matrix fusion layer and the fourth matrix fusion layer is not particularly limited in this embodiment, and may be configured by those skilled in the art according to actual needs.
For example, the third matrix fusion layer and the fourth matrix fusion layer have the same structure and respectively comprise two sub-layers, namely an addition layer and a layer normalization layer, when the matrixes are fused, the addition layer is used for adding the two input matrixes to obtain a sum matrix, and then the layer normalization layer is used for carrying out layer normalization on the sum matrix to obtain the fusion matrix.
It should be noted that, when the number of acoustic feature enhancement modules is M, each acoustic feature enhancement module performs the same feature enhancement processing, and only the 1 st acoustic feature enhancement module of the M acoustic feature enhancement modules connected in sequence is required to input the acoustic feature matrix, the M acoustic feature enhancement modules sequentially perform feature enhancement processing in the above manner, and the fusion feature output by the M acoustic feature enhancement modules is used as the enhanced acoustic feature matrix.
In addition, the configuration of the convolution kernel size, the step size, and the padding size in the above third convolution layer and the deconvolution layer is not particularly limited in this embodiment, and may be taken by those skilled in the art according to actual needs.
It will be appreciated that the present implementation can more effectively enhance acoustic features in the form of feature maps by enhancing the convolution process and the deconvolution process during the enhancement of acoustic features, ultimately extracting features that are more relevant to pronunciation skills.
In an optional embodiment, the acoustic feature enhancement network further includes a second position encoding module and a fifth matrix fusion layer, and before inputting the acoustic feature matrix into the acoustic feature enhancement module for feature enhancement processing, the method further includes:
Inputting the acoustic feature matrix into a second position coding module for position coding processing to obtain a second position coding matrix;
inputting the second position coding matrix and the acoustic feature matrix into a fifth matrix fusion layer for matrix fusion processing to obtain an acoustic position fusion matrix;
Inputting the acoustic feature matrix into an acoustic feature enhancement module for feature enhancement processing to obtain an enhanced acoustic feature matrix, comprising:
And inputting the acoustic position fusion matrix into an acoustic feature enhancement module for feature enhancement processing to obtain an enhanced acoustic feature matrix.
In this embodiment, in order to further improve accuracy of pronunciation skill detection, in this embodiment, the original acoustic feature matrix is not input to the acoustic feature enhancement module for feature enhancement, but after being subjected to position encoding, the acoustic feature matrix carrying the position information is input to the acoustic feature enhancement module for feature enhancement.
The electronic device inputs the acoustic feature matrix into a second position coding module to perform position coding processing to obtain a position coding matrix, and records the position coding matrix as a second position coding matrix, wherein the second position coding matrix characterizes the position information of each matrix unit in the acoustic feature matrix, and can be relative position information or absolute position information.
After the second position coding matrix is obtained, the electronic equipment inputs the second position coding matrix and the acoustic feature matrix into a fifth matrix fusion layer to perform matrix fusion processing, so as to obtain a fusion matrix, and the fusion matrix is recorded as an acoustic position fusion matrix. And then, the electronic equipment further inputs the acoustic position fusion matrix carrying the position information into an acoustic feature enhancement module to perform feature enhancement processing to obtain an enhanced acoustic feature matrix. For how the acoustic feature enhancement module performs the feature enhancement process, please refer to the related description of the above embodiments, and the description is omitted here.
In addition, it should be noted that, in this embodiment, the matrix fusion manner of the fifth matrix fusion layer is not specifically limited, and may be configured by those skilled in the art according to actual needs. For example, the fifth matrix fusion layer is configured to perform addition processing on the two matrices inputted, and output the sum matrix obtained by addition as a fusion matrix.
In an alternative embodiment, referring to fig. 10, the feature fusion network includes a third matrix inversion layer, a fourth matrix inversion layer, a third multi-headed attention layer, a sixth matrix fusion layer, a feed forward network layer, and a seventh matrix fusion layer, wherein,
The third matrix conversion layer is configured to perform matrix conversion processing on the input matrix to obtain a key matrix and a value matrix;
the fourth matrix conversion layer is configured to perform matrix conversion processing on the input matrix to obtain a query matrix;
The third multi-head attention layer is configured to perform attention enhancement processing on the key matrix, the value matrix and the query matrix output by the fourth matrix conversion layer to obtain an attention enhancement matrix;
The sixth matrix fusion layer is configured to perform matrix fusion processing on the attention enhancement matrix output by the third multi-head attention layer to obtain a fusion matrix;
the feedforward network layer is configured to perform feedforward calculation processing on the fusion matrix output by the sixth matrix fusion layer to obtain a feedforward matrix;
The seventh matrix fusion layer is configured to perform matrix fusion processing on the feedforward matrix output by the feedforward network layer and the fusion matrix output by the sixth matrix fusion layer to obtain a fusion feature matrix.
Accordingly, the electronic device may input the enhanced acoustic feature matrix and the phoneme feature matrix into the feature fusion network to perform feature fusion processing in the following manner:
inputting the enhanced acoustic feature matrix into a third matrix conversion layer for matrix conversion treatment to obtain a key matrix and a value matrix which are respectively marked as a third key matrix and a third value matrix;
Inputting the phoneme characteristic matrix into a fourth matrix conversion layer for matrix conversion treatment to obtain a query matrix, and marking the query matrix as a third query matrix;
Inputting the third query matrix, the third key matrix and the third value matrix into a third multi-head attention layer for attention enhancement processing to obtain an attention enhancement matrix, and marking the attention enhancement matrix as a third attention enhancement matrix;
inputting the third attention enhancement matrix and the third query matrix into a sixth matrix fusion layer for matrix fusion processing to obtain a fusion matrix, and recording the fusion matrix as an acoustic phoneme fusion matrix;
Inputting the acoustic phoneme fusion matrix into a feedforward network layer for feedforward calculation processing to obtain a feedforward matrix;
And inputting the feedforward matrix and the acoustic phoneme fusion matrix into a seventh matrix fusion layer for matrix fusion processing to obtain a fusion matrix, and recording the fusion matrix as a fusion feature matrix.
The method for fusing the matrix is not particularly limited in this embodiment, and may be configured by those skilled in the art according to actual needs.
For example, the sixth matrix fusion layer and the seventh matrix fusion layer have the same structure and respectively comprise two sub-layers, namely an addition layer and a layer normalization layer, when the matrices are fused, the addition layer adds the two input matrices to obtain a sum matrix, and then the layer normalization layer performs layer normalization processing on the sum matrix to obtain the fusion matrix.
In an alternative embodiment, referring to fig. 11, the first speaking skill detection network includes a first full-connection layer and a first classification function layer, and the inputting the phoneme feature matrix into the first speaking skill detection network to perform the speaking skill detection processing, so as to obtain a first detection result, where the first detection result includes:
Inputting the phoneme characteristic matrix into a first full-connection layer for full-connection processing to obtain a first full-connection result;
And inputting the first full-connection result into a first classification function layer for classification processing to obtain a first detection result.
It should be noted that, since the present embodiment is directed to performing pronunciation skill detection on multiple classes of pronunciation skills, any multiple classification functions may be used for the first classification function layer.
Taking the Softmax function as an example, the dimension of the output vector of the Softmax function is matched with the number of pronunciation skills expected to be detected, for example, taking English language as an example, the pronunciation skills expected to be detected comprise continuous reading, explosion losing and blushing, the output vector of the Softmax function comprises four-dimension elements, wherein 1 element is used for representing whether the pronunciation skill is needed to be used for 'continuous reading' speaking the text to be detected, 1 element is used for representing whether the pronunciation skill is needed to be used for 'explosion losing' speaking the text to be detected, 1 element is used for representing whether the pronunciation skill is needed to be used for 'blushing' speaking the text to be detected, and one element is used for representing that the pronunciation skill is not needed to be used for speaking the text to be detected.
Correspondingly, inputting the first full connection result into the Softmax function to obtain a 4-dimensional output vector of the Softmax function, taking the 4-dimensional output vector as a first detection result, and determining whether to speak the text to be detected by adopting pronunciation skills according to the first detection result, wherein when the text to be detected is required to be spoken by adopting the pronunciation skills, the text to be detected specifically needs to be spoken by adopting the pronunciation skills.
In an alternative embodiment, the second pronunciation skill detection network includes L branch detection networks, each branch detection network corresponds to a different pronunciation skill, L is an integer greater than 1, and the fused feature matrix is input into the second pronunciation skill detection network to perform pronunciation skill detection processing, so as to obtain a second detection result, where the method includes:
Inputting the fusion feature matrix into each branch detection network to detect pronunciation skills to obtain a branch pronunciation skill detection result of each branch detection network, wherein the branch pronunciation skill detection result of each branch detection network represents whether a speaker adopts pronunciation skills corresponding to each branch detection network to speak a text to be detected;
And obtaining a second detection result according to the branch pronunciation skill detection result of each branch detection network.
Wherein each branch detection network corresponds to a pronunciation skill, is configured to detect whether a speaker speaks a text to be detected by adopting the pronunciation skill corresponding to the speaker, and correspondingly, the number of branch detection networks can obtain the number of branch pronunciation skill detection results,
For example, taking english language as an example, when the pronunciation skills desired to be detected include continuous reading, loss of blasting, and blushing, L takes a value of 3, that is, the second pronunciation skill detection network will include 3 branch detection networks, where 1 branch detection network corresponds to pronunciation skill "continuous reading", 1 branch detection network corresponds to pronunciation skill "loss of blasting", and 1 branch detection network corresponds to pronunciation skill "blushing". Correspondingly, the 3 branch detection networks respectively output 1 branch pronunciation skill detection result, and a total of 3 branch pronunciation skill detection results are combined into a second detection result. At this time, the second detection result characterizes whether the speaker uses pronunciation skills to speak the text to be detected, and when the speaker uses pronunciation skills to speak the text to be detected, what pronunciation skills are specifically adopted to speak the text to be detected.
It should be noted that, the structure of each branch detection network is the same, and a branch detection network is taken as an example for description, referring to fig. 12, the branch detection network includes a second full connection layer and a second classification function layer, the fusion feature matrix is input into each branch detection network to perform pronunciation skill detection, so as to obtain a branch pronunciation skill detection result of each branch detection network, which includes:
Inputting the fusion feature matrix into a second full-connection layer for full-connection processing to obtain a second full-connection result;
And inputting the second full-connection result into a second classification function layer for classification processing to obtain a branch pronunciation skill detection result.
It should be noted that the second classification function layer may employ any two-classification function.
Taking a sigmoid function as an example, the output value of the sigmoid function is in [0,1], and after model training, the output of the sigmoid function can represent the probability that a speaker adopts the pronunciation skill corresponding to the branch detection network where the speaker is located to speak the text to be detected. For example, when the output value of the sigmoid function reaches a preset threshold (the output value can be taken by a person skilled in the art according to actual needs), it can be determined that the speaker uses the pronunciation skill corresponding to the branch detection network to speak the text to be detected.
Correspondingly, the second full-connection result is input into the sigmoid function, the output value of the sigmoid function is obtained, and the output value is used as a branch pronunciation skill detection result. According to the branch pronunciation skill detection result, whether the speaker adopts the pronunciation skill corresponding to the branch detection network to speak the text to be detected can be determined.
In an alternative embodiment, the method further includes, before obtaining the text to be detected and converting the text to be detected into the corresponding phoneme sequence:
Acquiring a plurality of types of first sample texts which are known to be required to be uttered by different pronunciation skills, and converting each type of first sample text into a corresponding positive sample phoneme sequence;
acquiring first sample audio of each type of first sample text by a sample user by adopting different pronunciation skills, and extracting positive sample acoustic features of the first sample audio of each type of first sample text;
acquiring a second sample text which is known to be not required to be uttered by adopting pronunciation skills, and converting the second sample text into a corresponding negative sample phoneme sequence;
acquiring second sample audio of a second sample text spoken by a sample user, and extracting negative sample acoustic features of the second sample audio;
And performing model training according to each type of positive sample phoneme sequence, each type of positive sample acoustic feature, negative sample phoneme sequence and negative sample acoustic feature to obtain a pronunciation skill detection model.
In this embodiment, the acoustic feature samples and the phoneme sequence samples are not artificially constructed, but the model is made to learn different pronunciation skills from a large amount of data from the data-driven thought. The following description will be given by taking a specific language as an example.
For this language, the electronic device obtains a plurality of types of first sample text, respectively, known to require utterances with different pronunciation skills. The number of the first sample text of each type of pronunciation skill acquired is not particularly limited herein, and may be configured by those skilled in the art according to actual needs.
For the first sample text (hereinafter referred to as first sample of each type) of the pronunciation skill of each type, the electronic device converts the first sample of each type into a phoneme sequence for each type, and records the phoneme sequence as a positive sample phoneme sequence. For how to convert the first text sample into the phoneme sequence, the method of converting the text to be detected into the phoneme sequence in the above embodiment may be correspondingly implemented, which is not described herein.
The electronic device also obtains audio of each type of first sample text spoken by the sample user with different pronunciation skills, and marks the audio as first sample audio, and extracts acoustic features of the first sample audio of each type of first sample text, and marks the acoustic features as positive sample acoustic features. The sample user may be a real person with pronunciation skills or a virtual person with pronunciation skills, and accordingly, for how to obtain the first sample audio of each type of the first sample text and how to extract the acoustic features of the positive sample, the method of obtaining the audio to be detected and extracting the acoustic features of the audio to be detected in the above embodiment may be correspondingly implemented, which is not described herein.
The electronic device also obtains text known not to require pronunciation skills to speak, notes as a second sample text, and converts the second sample text into a corresponding phoneme sequence, notes as a negative sample phoneme sequence. For how to convert the second sample text into the phoneme sequence, the method of converting the text to be detected into the phoneme sequence in the above embodiment may be correspondingly implemented, which is not described herein.
The electronic device also obtains audio of the second sample text spoken by the sample user, noted as second sample audio, and extracts acoustic features of the second sample audio, noted as negative sample acoustic features. For how to obtain the second sample audio of the second sample text and how to extract the negative sample acoustic feature, the method of obtaining the audio to be detected and extracting the acoustic feature of the audio to be detected in the above embodiment may be correspondingly implemented, which is not described herein.
It should be noted that the number of sample users is not particularly limited in this embodiment, and may be configured by those skilled in the art according to actual needs, for example, the positive sample phoneme sequence, the positive sample acoustic feature, the negative sample phoneme sequence, and the negative sample acoustic feature are obtained by 500 sample users in this embodiment.
After the positive sample phoneme sequence, the positive sample acoustic feature, the negative sample phoneme sequence and the negative sample acoustic feature are obtained, the electronic device performs model training according to each type of positive sample phoneme sequence, each type of positive sample acoustic feature, the negative sample phoneme sequence and the negative sample acoustic feature until a preset stopping condition is met, and a pronunciation skill detection model is obtained. The preset stopping condition can be configured to enable the iteration times of the model to reach the preset times in the training process or enable the model to converge.
Referring to fig. 13, in order to better execute the pronunciation skill detection method provided by the present application, the present application further provides a pronunciation skill detection device 400, as shown in fig. 13, the pronunciation skill detection device 400 includes:
the first obtaining module 410 is configured to obtain a text to be detected, and convert the text to be detected into a corresponding phoneme sequence;
the second obtaining module 420 is configured to obtain audio to be detected obtained by a speaker speaking a text to be detected, and extract acoustic features of the audio to be detected;
the detection module 430 is configured to input the phoneme sequence and the acoustic feature into a trained pronunciation skill detection model to perform pronunciation skill detection processing, so as to obtain a first detection result and a second detection result;
the first detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not, and the second detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not.
In an alternative embodiment, the pronunciation skill detection model includes a phoneme feature extraction network, an acoustic feature enhancement network, a feature fusion network, a first pronunciation skill detection network, and a second pronunciation skill detection network, and the detection module 430 is configured to:
Inputting the phoneme sequence into a phoneme feature extraction network to perform feature extraction processing to obtain a phoneme feature matrix;
Inputting the phoneme feature matrix into a first pronunciation skill detection network to carry out pronunciation skill detection processing to obtain a first detection result;
if the first detection result representation needs to adopt pronunciation skills to speak the text to be detected, inputting the acoustic features into an acoustic feature enhancement network for feature enhancement processing to obtain an enhanced acoustic feature matrix;
inputting the enhanced acoustic feature matrix and the phoneme feature matrix into a feature fusion network to perform feature fusion processing to obtain a fusion feature matrix;
and inputting the fusion feature matrix into a second pronunciation skill detection network to carry out pronunciation skill detection processing to obtain a second detection result.
In an alternative embodiment, the phoneme feature extraction network comprises a phoneme embedding module and a phoneme feature extraction module, and the detection module 430 is configured to:
Inputting the phoneme sequence into a phoneme embedding module for embedding treatment to obtain a phoneme vector matrix;
Inputting the phoneme vector matrix into a phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix.
In an alternative embodiment, the phoneme feature extraction module comprises at least 1 phoneme feature extraction sub-module, and the detection module 430 is configured to:
when the number of the phoneme feature extraction submodules is 1, inputting the phoneme vector matrix into the phoneme feature extraction submodule to perform feature extraction processing to obtain a phoneme feature matrix, or
When the number of the phoneme feature extraction submodules is N, inputting the phoneme vector matrix into the N phoneme feature extraction submodules to sequentially perform feature extraction processing to obtain a phoneme feature matrix, wherein N is an integer larger than 1.
In an alternative embodiment, the phoneme feature extracting submodule includes a first matrix conversion layer, a first multi-headed attention layer and a first matrix fusion layer, and the detecting module 430 is configured to:
Inputting the phoneme vector matrix into a first matrix conversion layer for matrix conversion processing to obtain a first query matrix, a first key matrix and a first value matrix;
Inputting a first query matrix, a first key matrix and a first value matrix into a first multi-head attention layer for attention enhancement processing to obtain a first attention enhancement matrix;
inputting the first attention enhancement matrix and the phoneme vector matrix into a first matrix fusion layer for matrix fusion processing to obtain a phoneme characteristic matrix.
In an alternative embodiment, the phoneme feature extracting network further includes a first position encoding module and a second matrix fusion layer, and before inputting the phoneme vector matrix into the phoneme feature extracting module for feature extraction processing, the detecting module 430 is further configured to:
Inputting the phoneme vector matrix into a first position coding module for position coding processing to obtain a first position coding matrix;
Inputting the first position coding matrix and the phoneme vector matrix into a second matrix fusion layer for matrix fusion processing to obtain a phoneme position fusion matrix;
when the phoneme vector matrix is input to the phoneme feature extraction module to perform feature extraction processing to obtain a phoneme feature matrix, the detection module 430 is configured to input the phoneme position fusion matrix to the phoneme feature extraction module to perform feature extraction processing to obtain a phoneme feature matrix.
In an alternative embodiment, the acoustic feature enhancement network includes a feature encoding module and at least 1 acoustic feature enhancement module, and the detecting module 430 is configured to:
inputting the acoustic characteristics into a characteristic coding module for characteristic coding treatment to obtain an acoustic characteristic matrix;
When the number of the acoustic feature enhancement modules is 1, inputting the acoustic feature matrix into the acoustic feature enhancement modules for feature enhancement processing to obtain an enhanced acoustic feature matrix, or
When the number of the acoustic feature enhancement modules is M, inputting the acoustic feature matrix into the M acoustic feature enhancement modules to sequentially perform feature enhancement processing, so as to obtain an enhanced acoustic feature matrix, wherein M is an integer larger than 1.
In an alternative embodiment, the feature encoding module includes a first convolution layer, a first pooling layer, a second convolution layer, and a second pooling layer, and the detection module 430 is configured to:
Inputting the acoustic features into a first convolution layer for convolution processing to obtain a first convolution result;
inputting the first convolution result into a first pooling layer for pooling treatment to obtain a first pooling result;
Inputting the first pooling result into a second convolution layer for convolution treatment to obtain a second convolution result;
And inputting the second convolution result into a second pooling layer for pooling treatment to obtain an acoustic feature matrix.
In an alternative embodiment, the acoustic feature enhancement module includes a second matrix conversion layer, a second multi-headed attention layer, a third matrix fusion layer, a third convolution layer, a deconvolution layer, and a fourth matrix fusion layer, and the detection module 430 is configured to:
Inputting the acoustic feature matrix into a second matrix conversion layer for matrix conversion processing to obtain a second query matrix, a second key matrix and a second value matrix;
Inputting the second query matrix, the second key matrix and the second value matrix into a second multi-head attention layer for attention enhancement processing to obtain a second attention enhancement matrix;
Inputting the second attention enhancement matrix and the acoustic feature matrix into a third matrix fusion layer for matrix fusion processing to obtain an acoustic fusion matrix;
Inputting the acoustic fusion matrix into a third convolution layer for convolution processing to obtain a third convolution result;
Inputting the third convolution result into the deconvolution layer for deconvolution treatment to obtain an deconvolution result;
and inputting the acoustic fusion matrix and the deconvolution result into a fourth matrix fusion layer for matrix fusion processing to obtain the enhanced acoustic feature matrix.
In an alternative embodiment, the acoustic feature enhancement network further includes a second position encoding module and a fifth matrix fusion layer, and before the acoustic feature matrix is input to the acoustic feature enhancement module for feature enhancement processing, the detection module 430 is further configured to:
Inputting the acoustic feature matrix into a second position coding module for position coding processing to obtain a second position coding matrix;
inputting the second position coding matrix and the acoustic feature matrix into a fifth matrix fusion layer for matrix fusion processing to obtain an acoustic position fusion matrix;
When the acoustic feature matrix is input to the acoustic feature enhancement module for feature enhancement processing to obtain an enhanced acoustic feature matrix, the detection module 430 is configured to:
And inputting the acoustic position fusion matrix into an acoustic feature enhancement module for feature enhancement processing to obtain an enhanced acoustic feature matrix.
In an alternative embodiment, the feature fusion network includes a third matrix conversion layer, a fourth matrix conversion layer, a third multi-headed attention layer, a sixth matrix fusion layer, a feed forward network layer, and a seventh matrix fusion layer, and the detection module 430 is configured to:
inputting the enhanced acoustic feature matrix into a third matrix conversion layer for matrix conversion treatment to obtain a third key matrix and a third value matrix;
inputting the phoneme characteristic matrix into a fourth matrix conversion layer for matrix conversion treatment to obtain a third query matrix;
Inputting the third query matrix, the third key matrix and the third value matrix into a third multi-head attention layer for attention enhancement processing to obtain a third attention enhancement matrix;
inputting the third attention enhancement matrix and the third query matrix into a sixth matrix fusion layer for matrix fusion processing to obtain an acoustic phoneme fusion matrix;
Inputting the acoustic phoneme fusion matrix into a feedforward network layer for feedforward calculation processing to obtain a feedforward matrix;
And inputting the feedforward matrix and the acoustic phoneme fusion matrix into a seventh matrix fusion layer for matrix fusion processing to obtain a fusion feature matrix.
In an alternative embodiment, the first sounding skill detection network includes a first full connection layer and a first classification function layer, and the detection module 430 is configured to:
Inputting the phoneme characteristic matrix into a first full-connection layer for full-connection processing to obtain a first full-connection result;
And inputting the first full-connection result into a first classification function layer for classification processing to obtain a first detection result.
In an alternative embodiment, the second pronunciation skill detection network includes L branch detection networks, each branch detection network corresponding to a different pronunciation skill, L being an integer greater than 1, and the detection module 430 is configured to:
Inputting the fusion feature matrix into each branch detection network to detect pronunciation skills to obtain a branch pronunciation skill detection result of each branch detection network, wherein the branch pronunciation skill detection result of each branch detection network represents whether a speaker adopts pronunciation skills corresponding to each branch detection network to speak a text to be detected;
And obtaining a second detection result according to the branch pronunciation skill detection result of each branch detection network.
In an alternative embodiment, the branch detection network includes a second full connection layer and a second classification function layer, and the detection module 430 is configured to:
Inputting the fusion feature matrix into a second full-connection layer for full-connection processing to obtain a second full-connection result;
And inputting the second full-connection result into a second classification function layer for classification processing to obtain a branch pronunciation skill detection result.
In an alternative embodiment, the pronunciation skill detecting device provided by the application further includes a training module for:
Acquiring a plurality of types of first sample texts which are known to be required to be uttered by different pronunciation skills, and converting each type of first sample text into a corresponding positive sample phoneme sequence;
acquiring first sample audio of each type of first sample text by a sample user by adopting different pronunciation skills, and extracting positive sample acoustic features of the first sample audio of each type of first sample text;
acquiring a second sample text which is known to be not required to be uttered by adopting pronunciation skills, and converting the second sample text into a corresponding negative sample phoneme sequence;
acquiring second sample audio of a second sample text spoken by a sample user, and extracting negative sample acoustic features of the second sample audio;
And performing model training according to each type of positive sample phoneme sequence, each type of positive sample acoustic feature, negative sample phoneme sequence and negative sample acoustic feature to obtain a pronunciation skill detection model.
In an alternative embodiment, the second obtaining module 420 is configured to:
Extracting Filterbank features, fundamental frequency features and energy features of the audio to be detected;
and fusing the Filterbank characteristic, the fundamental frequency characteristic and the energy characteristic to obtain the acoustic characteristic.
In an alternative embodiment, the first obtaining module 410 is configured to:
Removing text units which are not pronounciated in the text to be detected to obtain a new text to be detected;
and converting each text unit in the new text to be detected into a corresponding phoneme unit to obtain a phoneme sequence.
It should be noted that, the pronunciation skill detecting device 400 provided in the embodiment of the present application and the pronunciation skill detecting method in the above embodiment belong to the same concept, and detailed implementation processes thereof are described in the above related embodiments, which are not repeated here.
The embodiment of the application also provides an electronic device, which comprises a memory and a processor, wherein the processor is used for executing the steps in the pronunciation skill detection method provided by the embodiment by calling the computer program stored in the memory.
Referring to fig. 14, fig. 14 is a schematic structural diagram of an electronic device 100 according to an embodiment of the application.
The electronic device 100 may include a network interface 110, a memory 120, a processor 130, screen components, and the like. Those skilled in the art will appreciate that the configuration of the electronic device 100 shown in fig. 14 does not constitute a limitation of the electronic device 100, and may include more or fewer components than shown, or may combine certain components, or may have a different arrangement of components.
The network interface 110 may be used to make network connections between devices.
Memory 120 may be used to store computer programs and data. The memory 120 stores a computer program having executable code included therein. The computer program may be divided into various functional modules. The processor 130 executes various functional applications and data processing by running a computer program stored in the memory 120.
The processor 130 is a control center of the electronic device 100, connects various parts of the entire electronic device 100 using various interfaces and lines, and performs various functions of the electronic device 100 and processes data by running or executing computer programs stored in the memory 120 and calling data stored in the memory 120, thereby controlling the electronic device 100 as a whole.
In the embodiment of the present application, the processor 130 in the electronic device 100 loads one or more executable codes corresponding to one or more computer programs into the memory 120 according to the following instructions, and the steps in the pronunciation skill detection method provided by the present application are executed by the processor 130, for example:
Obtaining a text to be detected, and converting the text to be detected into a corresponding phoneme sequence;
Acquiring audio to be detected, which is obtained by a speaker speaking a text to be detected, and extracting acoustic characteristics of the audio to be detected;
inputting the phoneme sequence and the acoustic characteristics into a trained pronunciation skill detection model to carry out pronunciation skill detection processing, so as to obtain a first detection result and a second detection result;
the first detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not, and the second detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not.
It should be noted that, the electronic device 100 provided in the embodiment of the present application and the method for detecting pronunciation skills in the above embodiment belong to the same concept, and detailed implementation processes of the method are described in the above related embodiments, which are not repeated here.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed on a processor of an electronic device provided by an embodiment of the present application, causes the processor of the electronic device to perform any of the above steps in a pronunciation skill detection method suitable for the electronic device. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.
The foregoing describes the method, apparatus, storage medium and electronic device for detecting pronunciation skills according to the present application in detail, and specific examples are used herein to describe the principles and embodiments of the present application, and the above examples are only for helping to understand the method and core ideas of the present application, and meanwhile, those skilled in the art should not understand the present application to limit the scope of the present application according to the specific embodiments and application.

Claims (19)

1.一种发音技巧检测方法,其特征在于,包括:1. A pronunciation skill detection method, characterized by comprising: 获取待检测文本,将所述待检测文本转换为对应的音素序列;Obtaining a text to be detected, and converting the text to be detected into a corresponding phoneme sequence; 获取说话人说出所述待检测文本得到的待检测音频,提取所述待检测音频的声学特征;Acquire the audio to be detected obtained by the speaker speaking the text to be detected, and extract the acoustic features of the audio to be detected; 将所述音素序列和所述声学特征输入已训练的发音技巧检测模型进行发音技巧检测处理,得到第一检测结果和第二检测结果;Inputting the phoneme sequence and the acoustic feature into a trained pronunciation skill detection model to perform pronunciation skill detection processing to obtain a first detection result and a second detection result; 其中,所述第一检测结果用于表征是否需要采用发音技巧说出所述待检测文本,所述第二检测结果用于表征所述说话人是否采用发音技巧说出所述待检测文本;The first detection result is used to indicate whether the text to be detected needs to be spoken using pronunciation skills, and the second detection result is used to indicate whether the speaker uses pronunciation skills to speak the text to be detected; 所述发音技巧检测模型包括音素特征提取网络以及第一发音技巧检测网络,所述第一发音技巧检测网络包括第一全连接层和第一分类函数层,所述将所述音素序列和所述声学特征输入已训练的发音技巧检测模型进行发音技巧检测处理,得到第一检测结果,包括:The pronunciation skill detection model includes a phoneme feature extraction network and a first pronunciation skill detection network, wherein the first pronunciation skill detection network includes a first fully connected layer and a first classification function layer, and the phoneme sequence and the acoustic feature are input into the trained pronunciation skill detection model for pronunciation skill detection processing to obtain a first detection result, including: 将所述音素序列输入所述音素特征提取网络进行特征提取处理,得到音素特征矩阵;Inputting the phoneme sequence into the phoneme feature extraction network for feature extraction processing to obtain a phoneme feature matrix; 将所述音素特征矩阵输入所述第一全连接层进行全连接处理,得到第一全连接结果;Inputting the phoneme feature matrix into the first fully connected layer for full connection processing to obtain a first fully connected result; 将所述第一全连接结果输入所述第一分类函数层进行分类处理,得到所述第一检测结果。The first fully connected result is input into the first classification function layer for classification processing to obtain the first detection result. 2.根据权利要求1所述的发音技巧检测方法,其特征在于,所述发音技巧检测模型还包括声学特征增强网络、特征融合网络以及第二发音技巧检测网络,所述将所述音素序列和所述声学特征输入已训练的发音技巧检测模型进行发音技巧检测处理,得到第一检测结果和第二检测结果,还包括:2. The pronunciation skill detection method according to claim 1 is characterized in that the pronunciation skill detection model also includes an acoustic feature enhancement network, a feature fusion network and a second pronunciation skill detection network, and the phoneme sequence and the acoustic feature are input into the trained pronunciation skill detection model for pronunciation skill detection processing to obtain a first detection result and a second detection result, and further includes: 将所述声学特征输入所述声学特征增强网络进行特征增强处理,得到增强声学特征矩阵;Inputting the acoustic features into the acoustic feature enhancement network for feature enhancement processing to obtain an enhanced acoustic feature matrix; 将所述增强声学特征矩阵和所述音素特征矩阵输入所述特征融合网络进行特征融合处理,得到融合特征矩阵;Inputting the enhanced acoustic feature matrix and the phoneme feature matrix into the feature fusion network for feature fusion processing to obtain a fused feature matrix; 将所述融合特征矩阵输入所述第二发音技巧检测网络进行发音技巧检测处理,得到所述第二检测结果。The fused feature matrix is input into the second pronunciation skill detection network for pronunciation skill detection processing to obtain the second detection result. 3.根据权利要求1所述的发音技巧检测方法,其特征在于,所述音素特征提取网络包括音素嵌入模块和音素特征提取模块,所述将所述音素序列输入所述音素特征提取网络进行特征提取处理,得到音素特征矩阵,包括:3. The method for detecting pronunciation skills according to claim 1, wherein the phoneme feature extraction network comprises a phoneme embedding module and a phoneme feature extraction module, and the phoneme sequence is input into the phoneme feature extraction network for feature extraction processing to obtain a phoneme feature matrix, comprising: 将所述音素序列输入所述音素嵌入模块进行嵌入处理,得到音素向量矩阵;Inputting the phoneme sequence into the phoneme embedding module for embedding processing to obtain a phoneme vector matrix; 将所述音素向量矩阵输入所述音素特征提取模块进行特征提取处理,得到所述音素特征矩阵。The phoneme vector matrix is input into the phoneme feature extraction module for feature extraction processing to obtain the phoneme feature matrix. 4.根据权利要求3所述的发音技巧检测方法,其特征在于,所述音素特征提取模块包括至少1个音素特征提取子模块,所述将所述音素向量矩阵输入所述音素特征提取模块进行特征提取处理,得到所述音素特征矩阵,包括:4. The method for detecting pronunciation skills according to claim 3, wherein the phoneme feature extraction module comprises at least one phoneme feature extraction submodule, and the phoneme vector matrix is input into the phoneme feature extraction module for feature extraction processing to obtain the phoneme feature matrix, comprising: 在所述音素特征提取子模块为1个时,将所述音素向量矩阵输入所述音素特征提取子模块进行特征提取处理,得到所述音素特征矩阵;或者,When there is one phoneme feature extraction submodule, the phoneme vector matrix is input into the phoneme feature extraction submodule for feature extraction processing to obtain the phoneme feature matrix; or 在所述音素特征提取子模块为N个时,将所述音素向量矩阵输入N个所述音素特征提取子模块依次进行特征提取处理,得到所述音素特征矩阵,N为大于1的整数。When there are N phoneme feature extraction submodules, the phoneme vector matrix is input into the N phoneme feature extraction submodules to perform feature extraction processing in sequence to obtain the phoneme feature matrix, where N is an integer greater than 1. 5.根据权利要求4所述的发音技巧检测方法,其特征在于,所述音素特征提取子模块包括第一矩阵转换层、第一多头注意力层和第一矩阵融合层,所述将所述音素向量矩阵输入所述音素特征提取子模块进行特征提取处理,得到所述音素特征矩阵,包括:5. The pronunciation skill detection method according to claim 4 is characterized in that the phoneme feature extraction submodule includes a first matrix conversion layer, a first multi-head attention layer and a first matrix fusion layer, and the phoneme vector matrix is input into the phoneme feature extraction submodule for feature extraction processing to obtain the phoneme feature matrix, including: 将所述音素向量矩阵输入所述第一矩阵转换层进行矩阵转换处理,得到第一查询矩阵、第一键矩阵和第一值矩阵;Inputting the phoneme vector matrix into the first matrix conversion layer for matrix conversion processing to obtain a first query matrix, a first key matrix and a first value matrix; 将所述第一查询矩阵、所述第一键矩阵和所述第一值矩阵输入所述第一多头注意力层进行注意力增强处理,得到第一注意力增强矩阵;Inputting the first query matrix, the first key matrix and the first value matrix into the first multi-head attention layer for attention enhancement processing to obtain a first attention enhancement matrix; 将所述第一注意力增强矩阵和所述音素向量矩阵输入所述第一矩阵融合层进行矩阵融合处理,得到所述音素特征矩阵。The first attention enhancement matrix and the phoneme vector matrix are input into the first matrix fusion layer for matrix fusion processing to obtain the phoneme feature matrix. 6.根据权利要求3所述的发音技巧检测方法,其特征在于,所述音素特征提取网络还包括第一位置编码模块和第二矩阵融合层,所述将所述音素向量矩阵输入所述音素特征提取模块进行特征提取处理,得到所述音素特征矩阵之前,还包括:6. The method for detecting pronunciation skills according to claim 3, characterized in that the phoneme feature extraction network further comprises a first position encoding module and a second matrix fusion layer, and before the phoneme vector matrix is input into the phoneme feature extraction module for feature extraction processing to obtain the phoneme feature matrix, it further comprises: 将所述音素向量矩阵输入所述第一位置编码模块进行位置编码处理,得到第一位置编码矩阵;Inputting the phoneme vector matrix into the first position coding module for position coding processing to obtain a first position coding matrix; 将所述第一位置编码矩阵和所述音素向量矩阵输入所述第二矩阵融合层进行矩阵融合处理,得到音素位置融合矩阵;Inputting the first position coding matrix and the phoneme vector matrix into the second matrix fusion layer for matrix fusion processing to obtain a phoneme position fusion matrix; 所述将所述音素向量矩阵输入所述音素特征提取模块进行特征提取处理,得到所述音素特征矩阵,包括:The step of inputting the phoneme vector matrix into the phoneme feature extraction module for feature extraction processing to obtain the phoneme feature matrix comprises: 将所述音素位置融合矩阵输入所述音素特征提取模块进行特征提取处理,得到所述音素特征矩阵。The phoneme position fusion matrix is input into the phoneme feature extraction module for feature extraction processing to obtain the phoneme feature matrix. 7.根据权利要求2所述的发音技巧检测方法,其特征在于,所述声学特征增强网络包括特征编码模块和至少1个声学特征增强模块,所述将所述声学特征输入所述声学特征增强网络进行特征增强处理,得到增强声学特征矩阵,包括:7. The pronunciation skill detection method according to claim 2, characterized in that the acoustic feature enhancement network includes a feature encoding module and at least one acoustic feature enhancement module, and the acoustic feature is input into the acoustic feature enhancement network for feature enhancement processing to obtain an enhanced acoustic feature matrix, comprising: 将所述声学特征输入所述特征编码模块进行特征编码处理,得到声学特征矩阵;Inputting the acoustic features into the feature encoding module for feature encoding processing to obtain an acoustic feature matrix; 在所述声学特征增强模块为1个时,将所述声学特征矩阵输入所述声学特征增强模块进行特征增强处理,得到所述增强声学特征矩阵;或者,When there is one acoustic feature enhancement module, the acoustic feature matrix is input into the acoustic feature enhancement module for feature enhancement processing to obtain the enhanced acoustic feature matrix; or, 在所述声学特征增强模块为M个时,将所述声学特征矩阵输入M个所述声学特征增强模块依次进行特征增强处理,得到所述增强声学特征矩阵,M为大于1的整数。When there are M acoustic feature enhancement modules, the acoustic feature matrix is input into the M acoustic feature enhancement modules to perform feature enhancement processing in sequence to obtain the enhanced acoustic feature matrix, where M is an integer greater than 1. 8.根据权利要求7所述的发音技巧检测方法,其特征在于,所述特征编码模块包括第一卷积层、第一池化层、第二卷积层和第二池化层,所述将所述声学特征输入所述特征编码模块进行特征编码处理,得到声学特征矩阵,包括:8. The pronunciation skill detection method according to claim 7, characterized in that the feature coding module includes a first convolution layer, a first pooling layer, a second convolution layer and a second pooling layer, and the acoustic feature is input into the feature coding module for feature coding processing to obtain an acoustic feature matrix, comprising: 将所述声学特征输入所述第一卷积层进行卷积处理,得到第一卷积结果;Inputting the acoustic feature into the first convolution layer for convolution processing to obtain a first convolution result; 将所述第一卷积结果输入所述第一池化层进行池化处理,得到第一池化结果;Inputting the first convolution result into the first pooling layer for pooling processing to obtain a first pooling result; 将所述第一池化结果输入所述第二卷积层进行卷积处理,得到第二卷积结果;Inputting the first pooling result into the second convolutional layer for convolution processing to obtain a second convolution result; 将所述第二卷积结果输入所述第二池化层进行池化处理,得到所述声学特征矩阵。The second convolution result is input into the second pooling layer for pooling processing to obtain the acoustic feature matrix. 9.根据权利要求7所述的发音技巧检测方法,其特征在于,所述声学特征增强模块包括第二矩阵转换层、第二多头注意力层、第三矩阵融合层、第三卷积层、逆卷积层和第四矩阵融合层,所述将所述声学特征矩阵输入所述声学特征增强模块进行特征增强处理,得到所述增强声学特征矩阵,包括:9. The pronunciation skill detection method according to claim 7 is characterized in that the acoustic feature enhancement module includes a second matrix conversion layer, a second multi-head attention layer, a third matrix fusion layer, a third convolution layer, a deconvolution layer and a fourth matrix fusion layer, and the acoustic feature matrix is input into the acoustic feature enhancement module for feature enhancement processing to obtain the enhanced acoustic feature matrix, including: 将所述声学特征矩阵输入所述第二矩阵转换层进行矩阵转换处理,得到第二查询矩阵、第二键矩阵和第二值矩阵;Inputting the acoustic feature matrix into the second matrix conversion layer for matrix conversion processing to obtain a second query matrix, a second key matrix and a second value matrix; 将所述第二查询矩阵、所述第二键矩阵和所述第二值矩阵输入所述第二多头注意力层进行注意力增强处理,得到第二注意力增强矩阵;Inputting the second query matrix, the second key matrix and the second value matrix into the second multi-head attention layer for attention enhancement processing to obtain a second attention enhancement matrix; 将所述第二注意力增强矩阵和所述声学特征矩阵输入所述第三矩阵融合层进行矩阵融合处理,得到声学融合矩阵;Inputting the second attention enhancement matrix and the acoustic feature matrix into the third matrix fusion layer for matrix fusion processing to obtain an acoustic fusion matrix; 将所述声学融合矩阵输入所述第三卷积层进行卷积处理,得到第三卷积结果;Inputting the acoustic fusion matrix into the third convolution layer for convolution processing to obtain a third convolution result; 将所述第三卷积结果输入所述逆卷积层进行逆卷积处理,得到逆卷积结果;Inputting the third convolution result into the deconvolution layer for deconvolution processing to obtain a deconvolution result; 将所述声学融合矩阵和所述逆卷积结果输入所述第四矩阵融合层进行矩阵融合处理,得到所述增强声学特征矩阵。The acoustic fusion matrix and the deconvolution result are input into the fourth matrix fusion layer for matrix fusion processing to obtain the enhanced acoustic feature matrix. 10.根据权利要求7所述发音技巧检测方法,其特征在于,所述声学特征增强网络还包括第二位置编码模块和第五矩阵融合层,所述将所述声学特征矩阵输入所述声学特征增强模块进行特征增强处理,得到所述增强声学特征矩阵之前,还包括:10. The pronunciation skill detection method according to claim 7 is characterized in that the acoustic feature enhancement network further includes a second position encoding module and a fifth matrix fusion layer, and before the acoustic feature matrix is input into the acoustic feature enhancement module for feature enhancement processing to obtain the enhanced acoustic feature matrix, it also includes: 将所述声学特征矩阵输入所述第二位置编码模块进行位置编码处理,得到第二位置编码矩阵;Inputting the acoustic feature matrix into the second position coding module for position coding processing to obtain a second position coding matrix; 将所述第二位置编码矩阵和所述声学特征矩阵输入所述第五矩阵融合层进行矩阵融合处理,得到声学位置融合矩阵;Inputting the second position coding matrix and the acoustic feature matrix into the fifth matrix fusion layer for matrix fusion processing to obtain an acoustic position fusion matrix; 所述将所述声学特征矩阵输入所述声学特征增强模块进行特征增强处理,得到所述增强声学特征矩阵,包括:The step of inputting the acoustic feature matrix into the acoustic feature enhancement module for feature enhancement processing to obtain the enhanced acoustic feature matrix includes: 将所述声学位置融合矩阵输入所述声学特征增强模块进行特征增强处理,得到所述增强声学特征矩阵。The acoustic position fusion matrix is input into the acoustic feature enhancement module for feature enhancement processing to obtain the enhanced acoustic feature matrix. 11.根据权利要求2所述的发音技巧检测方法,其特征在于,所述特征融合网络包括第三矩阵转换层、第四矩阵转换层、第三多头注意力层、第六矩阵融合层、前馈网络层以及第七矩阵融合层,所述将所述增强声学特征矩阵和所述音素特征矩阵输入所述特征融合网络进行特征融合处理,得到融合特征矩阵,包括:11. The pronunciation skill detection method according to claim 2 is characterized in that the feature fusion network includes a third matrix conversion layer, a fourth matrix conversion layer, a third multi-head attention layer, a sixth matrix fusion layer, a feedforward network layer and a seventh matrix fusion layer, and the enhanced acoustic feature matrix and the phoneme feature matrix are input into the feature fusion network for feature fusion processing to obtain a fused feature matrix, including: 将所述增强声学特征矩阵输入所述第三矩阵转换层进行矩阵转换处理,得到第三键矩阵和第三值矩阵;Inputting the enhanced acoustic feature matrix into the third matrix conversion layer for matrix conversion processing to obtain a third key matrix and a third value matrix; 将所述音素特征矩阵输入所述第四矩阵转换层进行矩阵转换处理,得到第三查询矩阵;Inputting the phoneme feature matrix into the fourth matrix conversion layer for matrix conversion processing to obtain a third query matrix; 将所述第三查询矩阵、所述第三键矩阵和所述第三值矩阵输入所述第三多头注意力层进行注意力增强处理,得到第三注意力增强矩阵;Inputting the third query matrix, the third key matrix and the third value matrix into the third multi-head attention layer for attention enhancement processing to obtain a third attention enhancement matrix; 将所述第三注意力增强矩阵和所述第三查询矩阵输入所述第六矩阵融合层进行矩阵融合处理,得到声学音素融合矩阵;Inputting the third attention enhancement matrix and the third query matrix into the sixth matrix fusion layer for matrix fusion processing to obtain an acoustic phoneme fusion matrix; 将所述声学音素融合矩阵输入所述前馈网络层进行前馈计算处理,得到前馈矩阵;Inputting the acoustic phoneme fusion matrix into the feedforward network layer for feedforward calculation processing to obtain a feedforward matrix; 将所述前馈矩阵和所述声学音素融合矩阵输入所述第七矩阵融合层进行矩阵融合处理,得到所述融合特征矩阵。The feedforward matrix and the acoustic phoneme fusion matrix are input into the seventh matrix fusion layer for matrix fusion processing to obtain the fusion feature matrix. 12.根据权利要求2所述的发音技巧检测方法,其特征在于,第二发音技巧检测网络包括L个分支检测网络,每一分支检测网络对应不同的发音技巧,L为大于1的整数,所述将所述融合特征矩阵输入所述第二发音技巧检测网络进行发音技巧检测处理,得到所述第二检测结果,包括:12. The method for detecting pronunciation skills according to claim 2, characterized in that the second pronunciation skill detection network includes L branch detection networks, each branch detection network corresponds to a different pronunciation skill, L is an integer greater than 1, and the fusion feature matrix is input into the second pronunciation skill detection network for pronunciation skill detection processing to obtain the second detection result, comprising: 将所述融合特征矩阵输入每一所述分支检测网络进行发音技巧检测,得到每一所述分支检测网络的分支发音技巧检测结果,每一所述分支检测网络的分支发音技巧检测结果表征所述说话人是否采用每一所述分支检测网络对应的发音技巧说出所述待检测文本;Inputting the fused feature matrix into each of the branch detection networks to perform pronunciation skill detection, and obtaining a branch pronunciation skill detection result of each of the branch detection networks, wherein the branch pronunciation skill detection result of each of the branch detection networks indicates whether the speaker uses the pronunciation skill corresponding to each of the branch detection networks to speak the text to be detected; 根据每一所述分支检测网络的分支发音技巧检测结果得到所述第二检测结果。The second detection result is obtained according to the branch pronunciation skill detection result of each branch detection network. 13.根据权利要求12所述的发音技巧检测方法,其特征在于,所述分支检测网络包括第二全连接层和第二分类函数层,所述将所述融合特征矩阵输入每一所述分支检测网络进行发音技巧检测,得到每一所述分支检测网络的分支发音技巧检测结果,包括:13. The pronunciation skill detection method according to claim 12, characterized in that the branch detection network comprises a second fully connected layer and a second classification function layer, and the fusion feature matrix is input into each of the branch detection networks to perform pronunciation skill detection, and the branch pronunciation skill detection result of each of the branch detection networks is obtained, comprising: 将所述融合特征矩阵输入所述第二全连接层进行全连接处理,得到第二全连接结果;Inputting the fused feature matrix into the second fully connected layer for full connection processing to obtain a second fully connected result; 将所述第二全连接结果输入所述第二分类函数层进行分类处理,得到所述分支发音技巧检测结果。The second fully connected result is input into the second classification function layer for classification processing to obtain the branch pronunciation skill detection result. 14.根据权利要求12所述的发音技巧检测方法,其特征在于,所述获取待检测文本,将所述待检测文本转换为对应的音素序列之前,还包括:14. The pronunciation skill detection method according to claim 12, characterized in that before the step of obtaining the text to be detected and converting the text to be detected into a corresponding phoneme sequence, it further comprises: 获取已知需要采用不同发音技巧说出的多类第一样本文本,将每一类所述第一样本文本转换为对应的正样本音素序列;Acquire multiple types of first sample texts that are known to need to be spoken using different pronunciation techniques, and convert each type of the first sample texts into a corresponding positive sample phoneme sequence; 获取样本用户采用不同发音技巧说出每一类所述第一样本文本的第一样本音频,提取每一类所述第一样本文本的第一样本音频的正样本声学特征;Obtaining first sample audios of sample users speaking each type of the first sample text using different pronunciation techniques, and extracting positive sample acoustic features of the first sample audios of each type of the first sample text; 获取已知不需要采用发音技巧说出的第二样本文本,将所述第二样本文本转换为对应的负样本音素序列;Acquire a second sample text that is known not to require pronunciation skills to be spoken, and convert the second sample text into a corresponding negative sample phoneme sequence; 获取所述样本用户说出所述第二样本文本的第二样本音频,提取所述第二样本音频的负样本声学特征;Acquire a second sample audio of the sample user speaking the second sample text, and extract a negative sample acoustic feature of the second sample audio; 根据每一类所述正样本音素序列、每一类所述正样本声学特征、所述负样本音素序列以及所述负样本声学特征进行模型训练,得到所述发音技巧检测模型。Model training is performed according to each category of the positive sample phoneme sequence, each category of the positive sample acoustic features, the negative sample phoneme sequence and the negative sample acoustic features to obtain the pronunciation skill detection model. 15.根据权利要求1-14任一项所述的发音技巧检测方法,其特征在于,所述提取所述待检测音频的声学特征,包括:15. The pronunciation skill detection method according to any one of claims 1 to 14, characterized in that extracting the acoustic features of the audio to be detected comprises: 提取所述待检测音频的Filterbank特征、基频特征和能量特征;Extracting the filterbank feature, fundamental frequency feature and energy feature of the audio to be detected; 融合所述Filterbank特征、所述基频特征和所述能量特征,得到所述声学特征。The acoustic feature is obtained by fusing the Filterbank feature, the fundamental frequency feature and the energy feature. 16.根据权利要求1-14任一项所述的发音技巧检测方法,其特征在于,所述将所述待检测文本转换为对应的音素序列,包括:16. The pronunciation skill detection method according to any one of claims 1 to 14, characterized in that the step of converting the text to be detected into a corresponding phoneme sequence comprises: 去除所述待检测文本中不发音的文本单元,得到新的待检测文本;Removing unpronounced text units from the text to be detected to obtain a new text to be detected; 将所述新的待检测文本中的每一文本单元转换为对应的音素单元,得到所述音素序列。Each text unit in the new text to be detected is converted into a corresponding phoneme unit to obtain the phoneme sequence. 17.一种发音技巧检测装置,其特征在于,包括:17. A pronunciation skill detection device, characterized by comprising: 第一获取模块,用于获取待检测文本,将所述待检测文本转换为对应的音素序列;A first acquisition module is used to acquire the text to be detected and convert the text to be detected into a corresponding phoneme sequence; 第二获取模块,用于获取说话人说出所述待检测文本得到的待检测音频,提取所述待检测音频的声学特征;A second acquisition module is used to acquire the audio to be detected obtained by the speaker speaking the text to be detected, and extract the acoustic features of the audio to be detected; 检测模块,用于将所述音素序列和所述声学特征输入已训练的发音技巧检测模型进行发音技巧检测处理,得到第一检测结果和第二检测结果;A detection module, used for inputting the phoneme sequence and the acoustic feature into a trained pronunciation skill detection model to perform pronunciation skill detection processing to obtain a first detection result and a second detection result; 其中,所述第一检测结果用于表征是否需要采用发音技巧说出所述待检测文本,所述第二检测结果用于表征所述说话人是否采用发音技巧说出所述待检测文本;The first detection result is used to indicate whether the text to be detected needs to be spoken using pronunciation skills, and the second detection result is used to indicate whether the speaker uses pronunciation skills to speak the text to be detected; 所述发音技巧检测模型包括音素特征提取网络以及第一发音技巧检测网络,所述第一发音技巧检测网络包括第一全连接层和第一分类函数层,所述检测模块还用于:The pronunciation skill detection model includes a phoneme feature extraction network and a first pronunciation skill detection network, wherein the first pronunciation skill detection network includes a first fully connected layer and a first classification function layer, and the detection module is further used for: 将所述音素序列输入所述音素特征提取网络进行特征提取处理,得到音素特征矩阵;Inputting the phoneme sequence into the phoneme feature extraction network for feature extraction processing to obtain a phoneme feature matrix; 将所述音素特征矩阵输入所述第一全连接层进行全连接处理,得到第一全连接结果;Inputting the phoneme feature matrix into the first fully connected layer for full connection processing to obtain a first fully connected result; 将所述第一全连接结果输入所述第一分类函数层进行分类处理,得到所述第一检测结果。The first fully connected result is input into the first classification function layer for classification processing to obtain the first detection result. 18.一种存储介质,其上存储有计算机程序,其特征在于,当所述计算机程序被处理器加载时执行如权利要求1-16任一项所述的发音技巧检测方法中的步骤。18. A storage medium having a computer program stored thereon, characterized in that when the computer program is loaded by a processor, the steps in the pronunciation skill detection method according to any one of claims 1 to 16 are executed. 19.一种电子设备,包括处理器和存储器,所述存储器储存有计算机程序,其特征在于,所述处理器通过加载所述计算机程序,用于执行如权利要求1至16任一项所述的发音技巧检测方法中的步骤。19. An electronic device comprising a processor and a memory, wherein the memory stores a computer program, wherein the processor is configured to execute the steps of the pronunciation skill detection method according to any one of claims 1 to 16 by loading the computer program.
CN202111620731.2A 2021-12-28 2021-12-28 Pronunciation skill detection method, device, storage medium and electronic device Active CN114170997B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111620731.2A CN114170997B (en) 2021-12-28 2021-12-28 Pronunciation skill detection method, device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111620731.2A CN114170997B (en) 2021-12-28 2021-12-28 Pronunciation skill detection method, device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN114170997A CN114170997A (en) 2022-03-11
CN114170997B true CN114170997B (en) 2025-06-24

Family

ID=80488185

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111620731.2A Active CN114170997B (en) 2021-12-28 2021-12-28 Pronunciation skill detection method, device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN114170997B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6729539B2 (en) * 2017-11-29 2020-07-22 ヤマハ株式会社 Speech synthesis method, speech synthesis system and program
CN116564350A (en) * 2023-06-09 2023-08-08 腾讯音乐娱乐科技(深圳)有限公司 Pronunciation detection method, device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968618A (en) * 2020-08-27 2020-11-20 腾讯科技(深圳)有限公司 Speech synthesis method and device
CN112349300A (en) * 2020-11-06 2021-02-09 北京乐学帮网络技术有限公司 Voice evaluation method and device
CN113066510A (en) * 2021-04-26 2021-07-02 中国科学院声学研究所 Vowel weak reading detection method and device
CN113345467A (en) * 2021-05-19 2021-09-03 苏州奇梦者网络科技有限公司 Method, device, medium and equipment for evaluating spoken language pronunciation

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8271281B2 (en) * 2007-12-28 2012-09-18 Nuance Communications, Inc. Method for assessing pronunciation abilities
CN110751944B (en) * 2019-09-19 2024-09-24 平安科技(深圳)有限公司 Method, device, equipment and storage medium for constructing voice recognition model
CN111785256A (en) * 2020-06-28 2020-10-16 北京三快在线科技有限公司 Acoustic model training method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968618A (en) * 2020-08-27 2020-11-20 腾讯科技(深圳)有限公司 Speech synthesis method and device
CN112349300A (en) * 2020-11-06 2021-02-09 北京乐学帮网络技术有限公司 Voice evaluation method and device
CN113066510A (en) * 2021-04-26 2021-07-02 中国科学院声学研究所 Vowel weak reading detection method and device
CN113345467A (en) * 2021-05-19 2021-09-03 苏州奇梦者网络科技有限公司 Method, device, medium and equipment for evaluating spoken language pronunciation

Also Published As

Publication number Publication date
CN114170997A (en) 2022-03-11

Similar Documents

Publication Publication Date Title
CN113470662B (en) Generating and using text-to-speech data for keyword detection system and speaker adaptation in speech recognition system
CN111312245B (en) Voice response method, device and storage medium
EP3895159B1 (en) Multi-speaker neural text-to-speech synthesis
CN103765506B (en) A method for tone/intonation recognition using auditory attention cues
CN112017644A (en) Sound transformation system, method and application
CN113707125A (en) Training method and device for multi-language voice synthesis model
CN112581963B (en) Voice intention recognition method and system
CN115394287B (en) Mixed language speech recognition method, device, system and storage medium
CN112397056B (en) Voice evaluation method and computer storage medium
CN115132170B (en) Language classification method, device and computer readable storage medium
CN112837669B (en) Speech synthesis method, device and server
CN118364427A (en) Mongolian multi-mode emotion analysis method based on cross-mode transformers
CN112735404A (en) Ironic detection method, system, terminal device and storage medium
CN104538025A (en) Method and device for converting gestures to Chinese and Tibetan bilingual voices
CN114170997B (en) Pronunciation skill detection method, device, storage medium and electronic device
Choi et al. Learning to maximize speech quality directly using mos prediction for neural text-to-speech
CN115547484A (en) Method and device for detecting Alzheimer&#39;s disease based on voice analysis
CN114203159B (en) Speech emotion recognition method, terminal device and computer readable storage medium
CN116416966A (en) Text-to-speech synthesis method, device, device and storage medium
CN119475252B (en) A multimodal emotion recognition method
CN116959417A (en) Dialogue turn detection methods, devices, equipment, media, and program products
CN115130457A (en) Prosody modeling method and modeling system fused with Amdo Tibetan phoneme vectors
CN119229845B (en) Speech synthesis method and device, electronic device and storage medium
Vijaya et al. An Efficient System for Audio-Based Sign Language Translator Through MFCC Feature Extraction
CN119479609A (en) Speech generation method, device, equipment, storage medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant