CN109346058B

CN109346058B - A system for expanding speech acoustic features

Info

Publication number: CN109346058B
Application number: CN201811443497.9A
Authority: CN
Inventors: 程冰
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2024-06-28
Anticipated expiration: 2038-11-29
Also published as: CN109346058A

Abstract

The application belongs to the technical field of sound processing, and particularly relates to a voice acoustic feature expansion system. In the language learning process, the acoustic characteristics of the language are required to be enlarged, and then corpus suitable for brain perception is produced for the learner to stimulate the brain. The application provides a voice acoustic characteristic expanding system, which comprises a voice acquisition unit, a voice processing unit and a video editing unit, wherein the voice acquisition unit is connected with the voice processing unit; the voice acquisition unit is used for acquiring natural voice; the voice processing unit is used for expanding the frequency spectrum characteristics in the natural voice to different degrees so as to manufacture corpus; the video editing unit is used for editing the voice video and the processed voice to synthesize a video clip. The speech acoustic feature expansion system can produce corpus more suitable for brain perception, thereby helping learners form speech categories in the brain more similar to those of the native speaker.

Description

Voice acoustic feature expansion system

Technical Field

The application belongs to the technical field of sound processing, and particularly relates to a voice acoustic feature expansion system.

Background

With the rapid development of related fields such as bioengineering, computer science, data statistics processing, brain imaging technology and the like, brain science research combines the advantages of interdisciplinary subjects, and completely new exploration is carried out on the interaction process of brain development and growth and language learning environments. Studies have shown that infants gradually lose sensitivity to non-native language speech after 12 months, thereby creating a hurdle to future foreign language speech learning. One person often gets habit to learn a new language from his original speech perception, so that the foreign language speech similar to the pronunciation of the native language is received faster, and the speech not in the native language is received more difficult. However, when learning a voice similar to a native language, a learner is more likely to be affected by the native language, thereby generating an accent. For example, the united states may have a different perception than the brain of a chinese person for the same english language.

Because it is insensitive to non-native language speech, the learner cannot fully receive language information first audibly, so it is difficult to pronounce it correctly. At the same time, each time a learner learns a phoneme, it is necessary to establish a voice category of the phoneme in the brain. This speech category is not a point, but a collection. Because the language environment in which the foreign language learner is in contact with the native language learner is not comparable, the voice category established in their brains is far from.

In the language learning process, acoustic features of natural voice are expanded to produce corpus suitable for brain perception for learners, and nerve systems of the learners, which lose sensitivity to non-native language voice, are stimulated to be reopened so as to comprehensively receive voice information, thereby helping the learners to form a voice category more similar to that of the native language learners in the brain.

Disclosure of Invention

1. Technical problem to be solved

Based on the fact that acoustic features of natural voices are expanded in the language learning process, corpus suitable for brain perception is manufactured for learners, nerve systems of the learners, which lose sensitivity to non-native voices, are stimulated to be opened again, and voice information is comprehensively received, so that the learners are helped to form voice categories which are closer to the native voices in the brain.

2. Technical proposal

In order to achieve the above object, the present application provides a speech acoustic feature expansion system, which includes a speech acquisition unit, the speech acquisition unit is connected with a speech processing unit, and the speech processing unit is connected with a video editing unit;

the voice acquisition unit is used for acquiring natural voice;

the voice processing unit is used for expanding the frequency spectrum characteristics in the natural voice to different degrees and manufacturing corpus;

the video editing unit is used for editing the voice video and the processed voice to synthesize different video clips.

Optionally, the voice processing unit comprises a MATLAB-based sound processing module.

Optionally, the MATLAB-based sound processing module includes a formant frequency difference expansion sub-module, a pitch synchronization overlap sub-module, a frequency separation sub-module, a bandwidth separation sub-module, and a gap separation sub-module.

Optionally, the MATLAB-based sound processing module includes a sound analysis sub-module and a sound synthesis sub-module.

Optionally, the video editing unit includes a format processing module and a frame rate processing module.

Optionally, the speech processing unit is configured to perform 3 different degrees of expansion on the spectral features in speech, which are 300%,208%,144%, respectively, so as to make a corpus.

3. Advantageous effects

Compared with the prior art, the voice acoustic feature expansion system provided by the application has the beneficial effects that:

The voice acoustic characteristic expanding system provided by the application is characterized in that a voice acquisition unit, a voice processing unit and a video editing unit are connected; and expanding the frequency spectrum characteristics of the natural voice to manufacture video. The acoustic characteristics of the voice contacted when the infant learns the language are simulated, corpus suitable for brain perception is manufactured for a learner to stimulate the brain, so that the brain with reduced sensitivity to foreign language voice can clearly perceive the physical acoustic characteristics of the voice, thereby establishing a voice category similar to a mother language in the brain, and further improving the accuracy of pronunciation.

Drawings

FIG. 1 is a schematic diagram of a speech acoustic feature augmentation system of the present application;

In the figure: the system comprises a 1-voice acquisition unit, a 2-voice processing unit, a 3-video editing unit, a 4-MATLAB-based voice processing module, a 5-formant frequency difference expansion sub-module, a 6-pitch synchronous splicing sub-module, a 7-frequency separation sub-module, an 8-bandwidth separation sub-module, a 9-gap separation sub-module, a 10-voice analysis sub-module, an 11-voice synthesis sub-module, a 12-format processing module and a 13-frame frequency processing module.

Detailed Description

Hereinafter, specific embodiments of the present application will be described in detail with reference to the accompanying drawings, and according to these detailed descriptions, those skilled in the art can clearly understand the present application and can practice the present application. Features from various embodiments may be combined to obtain new implementations, or substituted for certain features from certain embodiments to obtain further preferred implementations, without departing from the principles of the application.

The speech unit of the "infant" is exaggeratedly represented by the vibration frequency of the vocal cords and the resonance frequencies of the oral cavity, the laryngeal cavity, and the nasal cavity, and the gap between formants peculiar to vowels is artificially increased. This exaggeration not only allows the infant to easily discern phonetic units, but also simultaneously perceives key phonetic elements in the native language that distinguish the meaning of a single word. The mother and child sounds when speaking have great flexibility and variability, and such flexibility variation helps the infant establish an effective acoustic pattern for speech classification, i.e. a native speech category for each phoneme in the brain. The brain science field discovers that the infant learned native language voice process has the following characteristics: 1) Infants have the opportunity to hear sounds of various people speaking; 2) They have the opportunity to see the pronunciation mouth shape of different people; 3) The sound of the mother speaking to the infant is exaggeratedly represented by the vibration frequency of the vocal cords and the resonance frequencies of the oral cavity, the laryngeal cavity, and the nasal cavity. These three elements are very useful in utilizing infants to facilitate the ability to distinguish phonetic differences between speech and to build a comprehensive native language speech category.

Corpus, i.e. language material. Corpus is the content of linguistic studies. Corpus is a basic unit constituting a corpus.

The baby-ward (MATHERESE, or "mother") is a language used by adults, especially by mother speaking to infants. The language content and form (words, intonation, speed, etc.) need to be adapted to the language ability and cognitive ability of children, considering the understanding and acceptance ability of babies. Studies have shown that the vergence has a physical acoustic characteristic that is expanded in terms of speech over normal language.

Referring to fig. 1, the application provides a voice acoustic feature expanding system, which comprises a voice acquisition unit 1, wherein the voice acquisition unit 1 is connected with a voice processing unit 2, and the voice processing unit 2 is connected with a video editing unit 3;

the voice acquisition unit 1 is used for acquiring natural voice;

the voice processing unit 2 is used for expanding the frequency spectrum characteristics in the natural voice to different degrees and manufacturing corpus;

the video editing unit 3 is configured to edit the voice video and the processed voice to synthesize different video clips.

Optionally, the speech processing unit 2 comprises a MATLAB-based sound processing module 4.

Optionally, the MATLAB-based sound processing module 4 includes a formant frequency difference expansion sub-module 5, a pitch synchronization overlap sub-module 6, a frequency division sub-module 7, a bandwidth division sub-module 8, and a gap division sub-module 9.

Optionally, the MATLAB-based sound processing module 4 comprises a sound analysis sub-module 10 and a sound synthesis sub-module 11. The sound analysis sub-module 10 analyzes the acquired sound, and then synthesizes a new sound by the sound synthesis sub-module 11.

Optionally, the video editing sheet 3 includes a format processing module 12 and a frame rate processing module 13.

Optionally, the speech processing unit 2 is configured to perform 3 different degrees of expansion on the spectral features in speech, which are 300%,208%,144%, respectively, so as to make a corpus.

Examples

Amplifying the target speech is important to distinguish between acoustic elements. For each group of voices to be trained, the physical parameters of a specific natural sound process need to be determined according to the distinguishing factors of the acoustic characteristics of the two voices.

The natural sound recording is obtained through the voice obtaining unit 1 and then transmitted to the voice processing unit 2, the spectral characteristics in the voice are amplified to 3 different degrees through the MATLAB voice processing module 4, the spectral characteristics are 300%,208% and 144% respectively, and then the voice is made into four-level training corpus together with the original voice. For example, english language voice/r-l/pair, 3 parameters are F3 separation frequency, F3 bandwidth and F3 transition time. During the synthesis process, the sub-module 5 amplifies the formant frequency difference of/r-l/by the formant frequency difference and reduces the F3 bandwidth. The amplification of the/r-l/time characteristic is then added by the pitch synchronous splicing submodule 6 using a time-warping technique. For example, the vowels/I-I/pairs of english are separated into frequencies and bandwidths of F1 and F2 by the frequency separation sub-module 7, the bandwidth separation sub-module 8 and the gap separation sub-module 9, and gaps between F1 and F2 are adjusted.

The sub-module "LPC ANALYSIS AND SYNTHESIS of Speech" in the MATLAB sound processing module 4 is used for the fabrication. LPC refers to Linear Prediction Coding. Including the sound analysis sub-module 10 and the sound synthesis sub-module 11, new sounds can be analyzed and synthesized. (see: DSP SYSTEM Toolbox ^TM functionality available at the for operation)command line.)

After the sound processing is finished, the Final Cut Pro7 is used, and the method comprises a format processing module 12 and a frame frequency processing module 13, wherein different formats and frame frequencies can be mixed and matched in a time axis, videos of the sound are processed through synchronizing slow shot videos of different versions and time stretching audio tracks, and then the processed videos and the processed sounds are put together to edit and synthesize different video clips to be used as corpus for further manufacturing training software.

The voice acoustic characteristic expanding system provided by the application is characterized in that a voice acquisition unit, a voice processing unit and a video editing unit are connected; and expanding the frequency spectrum characteristics of the voice to manufacture video. The acoustic characteristics of the voice contacted when the infant learns the language are simulated, corpus suitable for brain perception is manufactured for a learner to stimulate the brain, so that the brain with reduced sensitivity to foreign language voice can clearly perceive the physical acoustic characteristics of the voice in a hearing way, a voice category similar to a mother language is established in the brain, and the accuracy of pronunciation is further improved.

Although the application has been described with reference to specific embodiments, those skilled in the art will appreciate that many modifications are possible in the construction and detail of the application disclosed within the spirit and scope thereof. The scope of the application is to be determined by the appended claims, and it is intended that the claims cover all modifications that are within the literal meaning or range of equivalents of the technical features of the claims.

Claims

1. A speech acoustic feature expansion system, characterized in that: it comprises a speech acquisition unit, the speech acquisition unit is connected to a speech processing unit, and the speech processing unit is connected to a video editing unit;

The speech acquisition unit is used to acquire natural speech;

The speech processing unit is used to expand the frequency spectrum features in natural speech to different degrees to produce corpus;

The video editing unit is used to edit the voice video and the processed voice to synthesize different video clips;

The natural recordings obtained by the speech acquisition unit are transmitted to the speech processing unit. The spectral features in the sound are amplified to three different degrees, namely 300%, 208%, and 144%, through the MATLAB sound processing module, and then made into four levels of training corpus together with the original sound; for English speech pairs, during the synthesis process, the formant frequency difference amplification submodule amplifies the formant frequency difference of the English speech pairs and reduces the F3 bandwidth; the amplification of the time characteristics is achieved by using the time deviation technology and adding through the fundamental tone synchronization splicing submodule; for English vowel pairs, the frequency and bandwidth of F1 and F2 are separated through the frequency separation submodule, the bandwidth separation submodule and the gap separation submodule, and the gap between F1 and F2 is adjusted;

The "LPC Analysis and Synthesis of Speech" submodule in the MATLAB sound processing module was used during the production. LPC refers to Linear Prediction Coding, which includes the sound analysis submodule and the sound synthesis submodule, and can analyze and synthesize new sounds.

After the sound processing was completed, Final Cut Pro7, including the format processing module and the frame rate processing module, was used to mix and match different formats and frame rates in the timeline. The video of the sound was synchronized with different versions of slow-motion video and time-stretched audio tracks, and then edited and synthesized together with the processed sound to serve as the corpus for further production of training software.