CN111370002B - Method and device for acquiring voice training sample, computer equipment and storage medium - Google Patents
Method and device for acquiring voice training sample, computer equipment and storage medium Download PDFInfo
- Publication number
- CN111370002B CN111370002B CN202010093613.XA CN202010093613A CN111370002B CN 111370002 B CN111370002 B CN 111370002B CN 202010093613 A CN202010093613 A CN 202010093613A CN 111370002 B CN111370002 B CN 111370002B
- Authority
- CN
- China
- Prior art keywords
- tearing
- spectrogram
- sound
- time
- spectrograms
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Auxiliary Devices For Music (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
The application discloses a method, a device, computer equipment and a storage medium for acquiring a voice training sample, wherein the method comprises the following steps: processing a voice signal to obtain a sound spectrogram of the voice signal; randomly selecting a time point in a time direction on the sound spectrogram; and taking the time point as a tearing point, separating the sound spectrogram on two sides of the tearing point in the time direction, finishing the tearing treatment of the sound spectrogram, adding transition information at a fracture part according to a preset rule to obtain a tearing spectrogram, and taking the tearing spectrogram as the voice training sample. According to the method and the device, after an original voice signal is converted into a sound spectrogram, a large number of torn spectrograms, a first mask spectrogram and a second mask spectrogram are derived from the sound spectrogram through tearing and mask processing, and therefore the problem that accurate voiceprint recognition models cannot be obtained due to the fact that a sample amount of a voiceprint recognition model trained in the prior art is small can be solved.
Description
Technical Field
The present application relates to the field of computer neural network training, and in particular, to a method and an apparatus for obtaining a speech training sample, a computer device, and a storage medium.
Background
Voice recognition identity, i.e., voiceprint recognition, is an important direction in the field of artificial intelligence, and is an important application of artificial intelligence technology in biological feature recognition scenarios. Although the accuracy of voiceprint recognition is always higher in laboratory conditions, in an actual service scenario, since voice transmission depends on a transmission channel, such as a telephone, a broadband network, and the like, and received voice is affected by the channel, the accuracy of voiceprint recognition is still not high.
Because the speaking voice and the channel can not be completely separated, in the process of voiceprint recognition, the extracted voice characteristics of the speaker inevitably have channel characteristics, for example, the extracted characteristics of the speaker A in the telephone recording and the speaker A in the network voice are respectively attached with the characteristics of the telephone channel and the network channel, which can cause the judgment error of the voiceprint recognition. Thus, the cross-channel problem has heretofore been a difficult problem in the field of voiceprint recognition.
The mainstream solution in the industry at present is to collect voice data of each channel, either train a model for feature transformation between channels, or expand a training set of an original model with collected cross-channel data. The core of this is to collect enough cross-channel data as samples. In actual production, due to the limitations of sample collection cost and collection conditions, sufficient and effective cross-channel voice data cannot be collected as a sample.
Disclosure of Invention
The application mainly aims to provide a method, a device, computer equipment and a storage medium for acquiring a voice training sample, and aims to solve the technical problem that sufficient and effective cross-channel voice data cannot be acquired as a sample in the prior art.
In order to achieve the above object, the present application provides a method for acquiring a speech training sample, including:
processing a voice signal to obtain a sound spectrogram of the voice signal;
randomly selecting a time point in a time direction on the sound spectrogram;
and taking the time point as a tearing point, separating the sound frequency spectrograms on two sides of the tearing point in the time direction to finish tearing processing of the sound frequency spectrograms, adding transition information at the breakage part according to a preset rule to obtain the tearing frequency spectrograms, and taking the tearing frequency spectrograms as the voice training samples, wherein the separation distance of the sound frequency spectrograms on two sides of the tearing point is S, the S is a number randomly selected from uniform distribution of [0, S ], and the S is a time deformation parameter.
Further, the step of adding the hyperspectral map information at the tearing part according to a preset rule comprises the following steps:
and randomly adding the excess information to the fracture of the tear spectrogram.
Further, the step of randomly selecting a time point in a time direction on the sound spectrogram comprises:
acquiring the time length of the sound spectrogram;
determining the tearing processing times of the sound frequency spectrogram according to the time length;
and selecting the time points with the same number of times as the tearing times so as to perform tearing processing on the sound spectrogram for different times.
Further, the step of selecting the time points with the same number of times as the tearing times to perform tearing for different times on the sound spectrogram comprises:
and equally distributing the time points with the number corresponding to the tearing times in the time length so as to perform tearing on the sound frequency spectrogram for different times.
Further, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point to complete the tearing process of the sound spectrograms, and adding transition information at the fracture part according to a preset rule to obtain the tearing spectrograms, the method includes:
selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tear spectrogram;
and applying a mask sequence to each first spectrum block to obtain a first mask spectrum diagram.
Further, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding excessive information at the fracture part according to a preset rule to obtain the tearing spectrograms, the method further includes:
selecting a plurality of second frequency spectrum blocks of different frequency channels in the frequency direction on the tear spectrogram;
and applying a mask sequence to each second spectrum block to obtain a second mask spectrogram.
Further, the step of randomly selecting a time point in a time direction on the sound spectrogram comprises:
randomly adding a mask in the time direction on the sound spectrogram to obtain a third mask spectrogram;
randomly selecting the time point in a time direction on the third masked spectrogram.
The present application further provides an apparatus for obtaining a speech training sample, including:
the conversion unit is used for processing a voice signal to obtain a sound spectrogram of the voice signal;
a selection unit configured to randomly select a time point in a time direction on the sound spectrogram;
and the tearing unit is used for separating the sound frequency spectrograms on the two sides of the tearing point in the time direction by taking the time point as the tearing point, finishing the tearing processing of the sound frequency spectrograms, adding excessive information at the fracture part according to a preset rule to obtain the tearing frequency spectrograms, and taking the tearing frequency spectrograms as the voice training samples, wherein the separation distance of the sound frequency spectrograms on the two sides of the tearing point is S, the S is a number randomly selected from the uniform distribution of [0, S ], and the S is a time deformation parameter.
The present application further provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the above.
According to the method, the device, the computer equipment and the storage medium for acquiring the voice training sample, an original voice signal can be converted into the sound spectrogram, a large number of tear spectrograms, first mask spectrograms and second mask spectrograms are derived from the sound spectrogram through tearing and mask processing, and the tear spectrograms, the first mask spectrograms and the second mask spectrograms can be used as samples for training the voiceprint recognition model, so that the problem that the accurate voiceprint recognition model cannot be obtained due to the fact that the number of the samples for training the voiceprint recognition model is small in the prior art can be solved. For example, the problem that a voiceprint recognition model cannot be trained well due to the fact that a few samples exist in different channel scenes can be solved well.
Drawings
Fig. 1 is a schematic flowchart of a method for obtaining a speech training sample according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of an apparatus for obtaining a speech training sample according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.
Referring to fig. 1, a method for obtaining a speech training sample includes:
s1, processing a voice signal to obtain a sound spectrogram of the voice signal;
s2, randomly selecting a time point in the time direction on the sound spectrogram;
and S3, taking the time point as a tearing point, separating the sound spectrogram at two sides of the tearing point in the time direction, completing tearing processing of the sound spectrogram, adding excessive information at a fracture part according to a preset rule to obtain a tearing spectrogram, and taking the tearing spectrogram as the voice training sample, wherein the separation distance of the sound spectrogram at two sides of the tearing point is S, the S is a number randomly selected from uniform distribution of [0, S ], and the S is a time deformation parameter.
In this embodiment, a sample speech signal is first converted into an acoustic spectrogram, which is generally a mel spectrogram, and the specific conversion process can be implemented by any one of the prior art. Tearing the sound spectrogram at a certain time point, namely separating the sound spectrogram in time at the time point, wherein the separation mode can be various, for example, a first side of the sound spectrogram at two sides of the tearing point is fixed, and a second side of the sound spectrogram moves in a direction away from the first side; alternatively, the first side and the second side are moved in directions away from each other, respectively, and the like. In one embodiment, the first side may be fixed and the second side moved s away from the first side; the second side is then fixed on the original sound spectrogram, the first side is moved away from the second side by s, and so on, thereby obtaining two tear spectrograms with different moving directions at the processing of one point in time. In its embodiment, it may also be moved a specified distance in a specified direction. Further, the above steps S2 and S3 are repeated, different time points are selected each time, a plurality of torn spectrogram patterns corresponding to the sound spectrogram are obtained, and finally the sound spectrogram and the plurality of torn spectrogram patterns form a first voice training sample set. By the aid of the technical scheme, a plurality of torn spectrograms after being torn can be derived through one sound spectrogram, so that the number of voice training samples is enriched, and the problem that accurate voiceprint recognition models cannot be obtained due to the fact that the number of samples for training the voiceprint recognition models in the prior art is small can be solved. For example, the problem that a voiceprint recognition model cannot be trained well due to the fact that a few samples are obtained in different channel scenes can be solved well.
The step of adding the excessive information at the fracture according to the preset rule includes:
and randomly adding the excess information to the fracture of the tear spectrogram.
In this embodiment, since the torn spectrum segment exists in the torn spectrum graph, the torn spectrum segment may have a blank, and in order to improve the diversity of the training samples, excessive information may be added in the blank, for example, different smooth signals may be added. The excess information can be preset, a plurality of different excess information can be preset generally, then one excess information is randomly selected to be added to the fracture part, and if the excess information cannot just fill the blank, the excess information can be amplified or reduced in an equal proportion, so that the excess information can be just added to the blank. In another embodiment, if S is a positive integer, S kinds of transition information are set, each kind of transition information includes a plurality of transition information with different contents, and when the transition information is added, one of the transition information corresponding to S kinds of transition information is randomly selected, thereby further providing diversity of the creep samples.
In another embodiment, the preset rule is to add all the same data at the fracture, such as all 0's, all 1's, or other data such as 010101 that repeats the loop continuously.
In one embodiment, the step S2 of randomly selecting a time point in the time direction on the sound spectrogram includes:
s201, acquiring the time length of the sound spectrogram;
s202, determining the tearing processing times of the sound spectrogram according to the time length;
and S203, selecting time points with the same number as the tearing times so as to perform tearing processing on the sound spectrogram for different times.
In this embodiment, the audio spectrogram cannot be torn by the wireless frequency, so the application determines the tearing frequency according to the length of the time information in the audio spectrogram. Specifically, a mapping table is set, one column in the mapping table is a time length range, one column is the tearing times corresponding to the time length range, after the time length in the sound spectrogram is determined, the time length in which the time length falls in the mapping table is checked, and then the tearing times corresponding to the time length range are selected. The specific time length and the tearing times can be set manually according to experience, and the setting idea is that the longer the time length is, the more the corresponding tearable times are, and otherwise, the tearable times are less.
In an embodiment, the step S203 of selecting the same number of time points as the number of times of the tearing process to perform the tearing process on the sound spectrogram for different times includes:
and equally distributing the time points with the number corresponding to the tearing times in the time length so as to perform tearing on the sound frequency spectrogram for different times.
In this embodiment, the time points are evenly distributed within the time length, the distribution is fast and uniform, and the difference between samples is more even than the random distribution.
In one embodiment, the audio spectrogram can be torn at only one time point, so as to obtain a torn spectrogram with only one torn part; in another embodiment, a sound spectrogram can perform tearing processing by taking a plurality of time points as tearing points at the same time to obtain a tearing spectrogram with a plurality of tearing positions.
In an embodiment, after the step S3 of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding excessive information at the fracture point according to a preset rule to obtain a torn spectrogram, the method includes:
s4, selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tear spectrogram;
s5, applying a mask sequence to each of the first spectrum blocks to obtain a first mask spectrum map.
In this embodiment, in the time direction of the tear spectrogram, first spectral blocks of x (positive integer) consecutive time steps [ t0, t0+ t ] are selected, and then a mask sequence [ W1, … ] is applied on these first spectral blocks, W being a number randomly selected from a uniform distribution of [0, W ], W being a time mask parameter. In a specific embodiment, different t is selected, so that different first mask spectrograms can be obtained, and a plurality of first mask spectrograms corresponding to the tear spectrograms are obtained, the sound spectrogram is put together with all the first mask spectrograms and all the tear spectrograms to form a second voice training sample set, and the number of samples and the richness of the samples are further improved. In this embodiment, the time length represented by t is smaller than the time length of the tear spectrogram, and t0 is an arbitrary time point in the tear spectrogram, but it is required that the time length is capable of satisfying the blocking of the tear spectrogram.
In an embodiment, after the step S3 of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding excessive information at the fracture point according to a preset rule to obtain a torn spectrogram, the method further includes:
s6, selecting a plurality of second frequency spectrum blocks of different frequency channels in the frequency direction on the tear spectrogram;
and S7, applying a mask sequence to each second spectrum block to obtain a second mask spectrum map.
In the present embodiment, the second spectrum block is a spectrum block in the frequency direction, not a spectrum block in time. Specifically, in the frequency direction of the spectrogram, a mask sequence [ V1, … ] is applied to a spectrum block of n (positive integer) consecutive frequency channels [ m0, m0+ n ], V being a number randomly selected from a uniform distribution of [0, V ], and V being a frequency mask parameter. Similarly, different n is selected, so that different second mask spectrograms can be obtained, so that a plurality of second mask spectrograms corresponding to the torn spectrogram are obtained, and the sound spectrogram, all the second mask spectrograms and all the torn spectrogram are put together to form a third voice training sample set. In the present embodiment, m0 is an arbitrary frequency channel point in the tear spectrogram, but it is required to satisfy the blocking of the tear spectrogram.
In one embodiment, the step S2 of randomly selecting the time point in the time direction on the sound spectrogram includes:
s21, randomly adding masks in the time direction of the sound spectrogram to obtain a third mask spectrogram;
s22, randomly selecting the time point in the time direction on the third mask spectrogram.
In this embodiment, a mask is first added to the sound spectrogram, and then the time point is randomly selected in the time direction on the third mask spectrogram, so that a richer sample can be obtained.
According to the method for acquiring the voice training sample, an original voice signal can be converted into a voice spectrogram, a large number of tearing spectrograms, a first mask spectrogram and a second mask spectrogram are derived from the voice spectrogram through tearing and mask processing, the tearing spectrograms, the first mask spectrogram and the second mask spectrogram can be used as samples for training the voiceprint recognition model, and therefore the problem that the accurate voiceprint recognition model cannot be obtained due to the fact that the number of the samples for training the voiceprint recognition model in the prior art is small can be solved. For example, if voice information under different channel scenes is respectively acquired, and if the voice information is directly used as training samples, an accurate voiceprint recognition model cannot be obtained due to the fact that the number of the training samples is small, but by the method, a large number of training samples can be derived according to the training samples corresponding to a small number of voice information, so that the problem that the number of the training samples is small is solved, and the problem that one voiceprint recognition model cannot be well trained due to the fact that a small number of samples under different channel scenes are obtained is well solved.
Referring to fig. 2, an embodiment of the present application further provides an apparatus for obtaining a speech training sample, including:
a conversion unit 10, configured to process a speech signal to obtain a sound spectrogram of the speech signal;
a selection unit 20 for randomly selecting a time point in a time direction on the sound spectrogram;
and the tearing unit 30 is configured to separate the sound spectrograms on the two sides of the tearing point in the time direction by using the time point as the tearing point, complete tearing processing on the sound spectrograms, add transition information at the fracture part according to a preset rule to obtain a tearing spectrogram, and use the tearing spectrograms as the voice training sample, where a separation distance of the sound spectrograms on the two sides of the tearing point is S, the S is a number randomly selected from uniform distribution of [0, S ], and the S is a time deformation parameter.
In this embodiment, the converting unit 10 first converts the voice signal as a sample into a sound spectrogram, which is generally a mel spectrogram, and the specific converting process can be implemented by any one of the prior art. After the selecting unit 20 randomly selects a time point, the tearing unit 30 tears the sound spectrogram by using the time point as a tearing point, that is, separates the sound spectrogram in time at the time point, where the separation manner may be multiple, for example, a first side of the sound spectrogram on both sides of the tearing point is fixed, and a second side of the sound spectrogram moves away from the first side; alternatively, the first side and the second side each move away from each other, and so on. In one embodiment, the first side may be fixed and the second side moved s away from the first side; the second side is then fixed on the original sound spectrogram, the first side is moved away from the second side by s, and so on, thereby obtaining two tear spectrograms with different moving directions at the processing of one point in time. In its embodiment, it may also be moved a specified distance in a specified direction. Further, the process of randomly selecting the time points and the tearing processing is repeated, different time points are selected each time, a plurality of tearing spectrum graphs corresponding to the sound frequency spectrogram are obtained, and finally the sound frequency spectrogram and the plurality of tearing spectrum graphs form a first voice training sample set. By means of the technical scheme, a plurality of torn spectrograms after tearing can be derived through one sound spectrogram, so that the number of voice training samples is enriched, and the problem that accurate voiceprint recognition models cannot be obtained due to the fact that the number of samples for training the voiceprint recognition models in the prior art is small can be solved. For example, the problem that a voiceprint recognition model cannot be trained well due to the fact that a few samples exist in different channel scenes can be solved well.
In one embodiment, the tearing unit 30 further includes:
and the adding unit is used for randomly adding the excessive information to the fracture part of the tearing spectrogram. Namely, the preset rule is to add excessive information at the fracture randomly.
In this embodiment, since the torn spectrum portion exists in the torn spectrum graph, the torn spectrum portion may have a blank, and in order to improve the diversity of the training samples, excessive information may be added to the blank, for example, a different smooth signal may be added. The excess information can be preset, a plurality of different excess information can be generally preset, then one excess information is randomly selected to be added to the fracture part, and if the excess information cannot just fill the blank, the excess information can be proportionally amplified or reduced so that the excess information can be just added to the blank. In another embodiment, if S is a positive integer, S kinds of transition information are set, each kind of transition information includes a plurality of transition information with different contents, and when the transition information is added, one of the transition information corresponding to S kinds of transition information is randomly selected, so as to further provide diversity of training samples.
In another embodiment, the predetermined rule is to add all the same data at the fracture, such as all 0's, all 1's, or other data such as 010101 that repeats the loop continuously.
In an embodiment, the apparatus for obtaining a speech training sample further includes:
an acquisition unit, configured to acquire a time length of the sound spectrogram;
the determining unit is used for determining the tearing processing times of the sound spectrogram according to the time length;
and the selecting unit is used for selecting the time points with the same number of times as the tearing times so as to tear the sound spectrogram for different times.
In this embodiment, the audio spectrogram cannot be torn by the wireless frequency, so the application determines the tearing frequency according to the length of the time information in the audio spectrogram. Specifically, a mapping table is set, one column in the mapping table is a time length range, one column is the tearing times corresponding to the time length range, after the time length in the sound spectrogram is determined, the time length is checked to be in which time length range in the mapping table, and then the tearing times corresponding to the time length range are selected. The specific time length and the tearing times can be set manually according to experience, and the setting idea is that the longer the time length is, the more the corresponding tearable times are, and otherwise, the tearable times are less.
In one embodiment, the selecting unit includes:
and the average selection module is used for averagely distributing the time points with the number corresponding to the tearing times in the time length so as to tear the sound frequency spectrogram for different times.
In this embodiment, the time points are evenly distributed within the time length, the distribution is fast and uniform, and the difference between samples is more even than the random distribution.
In one embodiment, the audio spectrogram can be torn at only one time point, so as to obtain a torn spectrogram with only one torn part; in another embodiment, a sound spectrogram can be subjected to tearing processing by taking a plurality of time points as tearing points at the same time, so as to obtain a tearing spectrogram with a plurality of tearing positions.
In an embodiment, the apparatus for obtaining a speech training sample further includes:
the time spectrum unit is used for selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tearing spectrogram;
a first mask unit, configured to apply a mask sequence to each first spectrum block to obtain a first mask spectrum map.
In this embodiment, in the time direction of the tear spectrogram, first spectral blocks of x (positive integer) consecutive time steps [ t0, t0+ t ] are selected, and then a mask sequence [ W1, … ] is applied on these first spectral blocks, W being a number randomly selected from a uniform distribution of [0, W ], W being a time mask parameter. In a specific embodiment, different t is selected, so that different first mask spectrograms can be obtained, and a plurality of first mask spectrograms corresponding to the tear spectrograms are obtained, the sound spectrogram is put together with all the first mask spectrograms and all the tear spectrograms to form a second voice training sample set, and the number of samples and the richness of the samples are further improved. In this embodiment, the time length represented by t is smaller than the time length of the tear spectrogram, and t0 is an arbitrary time point in the tear spectrogram, but it is required that the time length can satisfy the blocking of the tear spectrogram.
In an embodiment, the apparatus for obtaining a speech training sample further includes:
a frequency spectrum unit, configured to select a plurality of second spectrum blocks of different frequency channels in a frequency direction on the tear spectrogram;
and the second mask unit is used for applying a mask sequence to each second spectrum block to obtain a second mask spectrum map.
In the present embodiment, the second spectrum block is a spectrum block in the frequency direction, not a spectrum block in time. Specifically, in the frequency direction of the spectrogram, a mask sequence [ V1, … ] is applied to a spectrum block of n (positive integer) consecutive frequency channels [ m0, m0+ n ], V being a number randomly selected from a uniform distribution of [0, V ], and V being a frequency mask parameter. Similarly, different n is selected, so that different second mask spectrograms can be obtained, a plurality of second mask spectrograms corresponding to the tear spectrograms are obtained, and the sound spectrogram, all the second mask spectrograms and all the tear spectrograms are put together to form a third voice training sample set. In the present embodiment, m0 is an arbitrary frequency channel point in the tear spectrogram, but it is required to satisfy the blocking of the tear spectrogram.
In one embodiment, the selecting unit 20 includes:
the mask module is used for randomly adding masks in the time direction on the sound spectrogram to obtain a third mask spectrogram;
a selection module configured to randomly select the time point in a time direction on the third mask spectrogram.
In this embodiment, a mask is first added to the sound spectrogram, and then the time point is randomly selected in the time direction on the third mask spectrogram, so that a richer sample can be obtained.
The device for acquiring the voice training sample can convert an original voice signal into a voice spectrogram, derive a large number of tear spectrograms, a first mask spectrogram and a second mask spectrogram from one voice spectrogram through the processing of tearing and masking, and the tear spectrograms, the first mask spectrogram and the second mask spectrogram can be used as samples for training the voiceprint recognition model, so that the problem that the accurate voiceprint recognition model cannot be obtained due to the fact that the number of samples for training the voiceprint recognition model in the prior art is small can be solved. For example, voice information under different channel scenes is respectively acquired, if the voice information is directly used as training samples, an accurate voiceprint recognition model cannot be obtained due to the fact that the number of the training samples is small, but by the method, a large number of training samples can be derived according to the training samples corresponding to a small number of voice information, so that the problem that the number of the training samples is small is solved, and the problem that one voiceprint recognition model cannot be well trained due to the fact that the number of the samples under different channel scenes is small is well solved.
Referring to fig. 3, an embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and an internal structure of the memory may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operating system and the running of computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as sample sets. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of obtaining speech training samples. Specifically, the method comprises the following steps:
a method for acquiring a voice training sample comprises the following steps: processing a voice signal to obtain a sound spectrogram of the voice signal; randomly selecting a time point in a time direction on the sound spectrogram; and taking the time point as a tearing point, separating the sound frequency spectrograms on two sides of the tearing point in the time direction to finish tearing processing of the sound frequency spectrograms, adding transition information at the breakage part according to a preset rule to obtain the tearing frequency spectrograms, and taking the tearing frequency spectrograms as the voice training samples, wherein the separation distance of the sound frequency spectrograms on two sides of the tearing point is S, the S is a number randomly selected from uniform distribution of [0, S ], and the S is a time deformation parameter.
In one embodiment, the step of adding the hyperspectral map information at the tearing part according to a preset rule comprises: and randomly adding the excess information to the fracture part of the tearing spectrogram.
In one embodiment, the step of randomly selecting a time point in a time direction on the sound spectrogram comprises: acquiring the time length of the sound spectrogram; determining the tearing processing times of the sound spectrogram according to the time length; and selecting the time points with the same number of times as the tearing times so as to tear the sound spectrogram for different times.
In one embodiment, the step of selecting the same number of time points as the number of tearing processes to tear the sound spectrogram different times comprises: and equally distributing the time points with the number corresponding to the tearing times in the time length so as to perform tearing on the sound frequency spectrogram for different times.
In an embodiment, after the step of separating the sound spectrograms on both sides of the tearing point in a time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding excessive information at the fracture according to a preset rule to obtain a torn spectrogram, the method includes: selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tear spectrogram; and applying a mask sequence to each first spectrum block to obtain a first mask spectrum diagram.
In an embodiment, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding transition information at the fracture according to a preset rule to obtain the tearing spectrograms, the method further includes: selecting a second spectrum block of a plurality of different frequency channels in the frequency direction on the tear spectrogram; and applying a mask sequence to each second spectrum block to obtain a second mask spectrogram.
In one embodiment, the step of randomly selecting a time point in a time direction on the sound spectrogram comprises: randomly adding a mask in the time direction on the sound spectrogram to obtain a third mask spectrogram; randomly selecting the time point in a time direction on the third masked spectrogram.
The computer device of the embodiment of the application can convert an original voice signal into a sound spectrogram, and derive a large number of tear spectrograms, a first mask spectrogram and a second mask spectrogram from the sound spectrogram through the processing of tearing and masking, and the tear spectrograms, the first mask spectrogram and the second mask spectrogram can be used as samples for training a voiceprint recognition model, so that the problem that the accurate voiceprint recognition model cannot be obtained due to the fact that the number of the samples for training the voiceprint recognition model in the prior art is small can be solved. For example, if voice information under different channel scenes is respectively acquired, and if the voice information is directly used as training samples, an accurate voiceprint recognition model cannot be obtained due to the fact that the number of the training samples is small, but by the method, a large number of training samples can be derived according to the training samples corresponding to a small number of voice information, so that the problem that the number of the training samples is small is solved, and the problem that one voiceprint recognition model cannot be well trained due to the fact that a small number of samples under different channel scenes are obtained is well solved.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a method for obtaining a speech training sample. Specifically, the method comprises the following steps:
a method for acquiring a voice training sample comprises the following steps: processing a voice signal to obtain a sound spectrogram of the voice signal; randomly selecting a time point in a time direction on the sound spectrogram; and taking the time point as a tearing point, separating the sound frequency spectrograms on two sides of the tearing point in the time direction to finish tearing processing of the sound frequency spectrograms, adding transition information at the breakage part according to a preset rule to obtain the tearing frequency spectrograms, and taking the tearing frequency spectrograms as the voice training samples, wherein the separation distance of the sound frequency spectrograms on two sides of the tearing point is S, the S is a number randomly selected from uniform distribution of [0, S ], and the S is a time deformation parameter.
In one embodiment, the step of adding the excessive spectrogram information at the tearing part according to a preset rule includes: and randomly adding the excess information to the fracture part of the tearing spectrogram.
In one embodiment, the step of randomly selecting a time point in a time direction on the sound spectrogram comprises: acquiring the time length of the sound spectrogram; determining the tearing processing times of the sound frequency spectrogram according to the time length; and selecting the time points with the same number of times as the tearing times so as to perform tearing processing on the sound spectrogram for different times.
In one embodiment, the step of selecting the same number of time points as the number of tearing processes to tear the sound spectrogram different times comprises: and equally distributing the time points with the number corresponding to the tearing times in the time length so as to perform tearing on the sound frequency spectrogram for different times.
In an embodiment, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding transition information at the fracture according to a preset rule to obtain the tearing spectrograms, the method includes: selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tear spectrogram; and applying a mask sequence to each first spectrum block to obtain a first mask spectrogram.
In an embodiment, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding transition information at the fracture according to a preset rule to obtain the tearing spectrograms, the method further includes: selecting a plurality of second frequency spectrum blocks of different frequency channels in the frequency direction on the tear spectrogram; and applying a mask sequence to each second spectrum block to obtain a second mask spectrogram.
In one embodiment, the step of randomly selecting time points in a time direction on the sound spectrogram comprises: randomly adding a mask in the time direction on the sound spectrogram to obtain a third mask spectrogram; randomly selecting the time point in a temporal direction on the third mask spectrogram.
When a computer program is executed by a processor to realize the method for acquiring the voice training sample, an original voice signal can be converted into a voice spectrogram, a large number of tearing frequency spectrograms, a first mask frequency spectrogram and a second mask frequency spectrogram are derived from the voice spectrogram through tearing and mask processing, and the tearing frequency spectrogram, the first mask frequency spectrogram and the second mask frequency spectrogram can be used as samples for training the voiceprint recognition model, so that the problem that the accurate voiceprint recognition model cannot be obtained due to the fact that the number of the samples for training the voiceprint recognition model is small in the prior art can be solved. For example, voice information under different channel scenes is respectively acquired, if the voice information is directly used as training samples, an accurate voiceprint recognition model cannot be obtained due to the fact that the number of the training samples is small, but by the method, a large number of training samples can be derived according to the training samples corresponding to a small number of voice information, so that the problem that the number of the training samples is small is solved, and the problem that one voiceprint recognition model cannot be well trained due to the fact that the number of the samples under different channel scenes is small is well solved.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.
Claims (10)
1. A method for obtaining a voice training sample is characterized by comprising the following steps:
processing a voice signal to obtain a sound spectrogram of the voice signal;
determining the tearing processing times of the sound spectrogram according to the time length of the sound spectrogram;
randomly selecting a time point in the time direction on the sound spectrogram according to the tearing processing times;
and separating the sound spectrograms on two sides of the tearing point in the time direction by taking the time point as the tearing point to finish the tearing processing of the sound spectrograms, adding transition information at the breaking part according to a preset rule to obtain a tearing spectrogram, and taking the tearing spectrograms as the voice training sample, wherein the separation distance of the sound spectrograms on two sides of the tearing point is S, the S is a number randomly selected from the uniform distribution of [0, S ], and the S is a time deformation parameter.
2. The method for acquiring the voice training sample according to claim 1, wherein the step of adding the excess information at the break according to the preset rule comprises:
and randomly adding the excess information to the fracture of the tear spectrogram.
3. The method for acquiring the voice training sample according to claim 1, wherein the determining the number of tearing processes of the sound spectrogram according to the time length of the sound spectrogram comprises:
acquiring the time length of the sound spectrogram;
determining the tearing processing times of the sound spectrogram according to the time length;
and selecting the time points with the same number of times as the tearing times so as to tear the sound spectrogram for different times.
4. The method of claim 3, wherein the step of selecting the same number of time points as the number of times of the tearing process to perform the tearing process on the sound spectrogram for different times comprises:
and equally distributing the time points with the number corresponding to the tearing times in the time length so as to perform tearing on the sound spectrogram for different times.
5. The method for acquiring the speech training sample according to claim 1, wherein the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point to complete the tearing process on the sound spectrograms, and adding excessive information at the fracture according to a preset rule to obtain the tearing spectrograms comprises:
selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tear spectrogram;
and applying a mask sequence to each first spectrum block to obtain a first mask spectrum diagram.
6. The method for acquiring a speech training sample according to claim 1, wherein the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point to complete the tearing process on the sound spectrograms, and adding excessive information at the fracture according to a preset rule to obtain the tearing spectrograms further comprises:
selecting a second spectrum block of a plurality of different frequency channels in the frequency direction on the tear spectrogram;
and applying a mask sequence to each second spectrum block to obtain a second mask spectrum map.
7. The method of claim 1, wherein the step of randomly selecting the time point in the time direction on the sound spectrogram comprises:
randomly adding a mask in the time direction on the sound spectrogram to obtain a third mask spectrogram;
randomly selecting the time point in a temporal direction on the third mask spectrogram.
8. An apparatus for obtaining a speech training sample, comprising:
the conversion unit is used for processing a voice signal to obtain a sound spectrogram of the voice signal;
the selecting unit is used for determining the tearing processing times of the sound spectrogram according to the time length of the sound spectrogram;
randomly selecting a time point in the time direction on the sound spectrogram according to the tearing processing times;
and the tearing unit is used for separating the sound spectrograms on two sides of the tearing point in the time direction by taking the time point as the tearing point, completing the tearing processing of the sound spectrograms, adding excessive information at the breaking part according to a preset rule to obtain a tearing spectrogram, and taking the tearing spectrograms as the voice training samples, wherein the separation distance of the sound spectrograms on two sides of the tearing point is S, the S is a number randomly selected from the uniform distribution of [0, S ], and the S is a time deformation parameter.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010093613.XA CN111370002B (en) | 2020-02-14 | 2020-02-14 | Method and device for acquiring voice training sample, computer equipment and storage medium |
PCT/CN2020/093092 WO2021159635A1 (en) | 2020-02-14 | 2020-05-29 | Speech training sample obtaining method and apparatus, computer device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010093613.XA CN111370002B (en) | 2020-02-14 | 2020-02-14 | Method and device for acquiring voice training sample, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111370002A CN111370002A (en) | 2020-07-03 |
CN111370002B true CN111370002B (en) | 2022-08-19 |
Family
ID=71206253
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010093613.XA Active CN111370002B (en) | 2020-02-14 | 2020-02-14 | Method and device for acquiring voice training sample, computer equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111370002B (en) |
WO (1) | WO2021159635A1 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112017638A (en) * | 2020-09-08 | 2020-12-01 | 北京奇艺世纪科技有限公司 | Voice semantic recognition model construction method, semantic recognition method, device and equipment |
CN113241062B (en) * | 2021-06-01 | 2023-12-26 | 平安科技(深圳)有限公司 | Enhancement method, device, equipment and storage medium for voice training data set |
CN115580682B (en) * | 2022-12-07 | 2023-04-28 | 北京云迹科技股份有限公司 | Method and device for determining connection and disconnection time of robot dialing |
CN116092512A (en) * | 2022-12-30 | 2023-05-09 | 重庆邮电大学 | Small sample voice separation method based on data generation |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR940002854B1 (en) * | 1991-11-06 | 1994-04-04 | 한국전기통신공사 | Sound synthesizing system |
CN104408681A (en) * | 2014-11-04 | 2015-03-11 | 南昌大学 | Multi-image hiding method based on fractional mellin transform |
CN104484872A (en) * | 2014-11-27 | 2015-04-01 | 浙江工业大学 | Interference image edge extending method based on directions |
US10373073B2 (en) * | 2016-01-11 | 2019-08-06 | International Business Machines Corporation | Creating deep learning models using feature augmentation |
CN106898357B (en) * | 2017-02-16 | 2019-10-18 | 华南理工大学 | A Vector Quantization Method Based on Normal Distribution Law |
CN108830277B (en) * | 2018-04-20 | 2020-04-21 | 平安科技(深圳)有限公司 | Training method and device of semantic segmentation model, computer equipment and storage medium |
CN108922560B (en) * | 2018-05-02 | 2022-12-02 | 杭州电子科技大学 | Urban noise identification method based on hybrid deep neural network model |
CN110148400B (en) * | 2018-07-18 | 2023-03-17 | 腾讯科技(深圳)有限公司 | Pronunciation type recognition method, model training method, device and equipment |
CN109087632B (en) * | 2018-08-17 | 2023-06-06 | 平安科技(深圳)有限公司 | Speech processing method, device, computer equipment and storage medium |
CN110379414B (en) * | 2019-07-22 | 2021-12-03 | 出门问问(苏州)信息科技有限公司 | Acoustic model enhancement training method and device, readable storage medium and computing equipment |
CN110751177A (en) * | 2019-09-17 | 2020-02-04 | 阿里巴巴集团控股有限公司 | Training method, prediction method and device of classification model |
-
2020
- 2020-02-14 CN CN202010093613.XA patent/CN111370002B/en active Active
- 2020-05-29 WO PCT/CN2020/093092 patent/WO2021159635A1/en active Application Filing
Also Published As
Publication number | Publication date |
---|---|
WO2021159635A1 (en) | 2021-08-19 |
CN111370002A (en) | 2020-07-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111370002B (en) | Method and device for acquiring voice training sample, computer equipment and storage medium | |
DE102018128006B4 (en) | METHOD OF PRODUCING OUTPUTS OF NATURAL LANGUAGE GENERATION BASED ON USER LANGUAGE STYLE | |
CN110120224B (en) | Method and device for constructing bird sound recognition model, computer equipment and storage medium | |
DE102012217160B4 (en) | Procedures for correcting unintelligible synthetic speech | |
CN111247584B (en) | Voice conversion method, system, device and storage medium | |
DE69127961T2 (en) | Speech recognition method | |
DE60302407T2 (en) | Ambient and speaker-adapted speech recognition | |
DE602005002706T2 (en) | Method and system for the implementation of text-to-speech | |
DE69226796T2 (en) | Temporal decorrelation method for interference-free speaker recognition | |
DE69705830T2 (en) | VOICE PROCESSING | |
DE60222249T2 (en) | SPEECH RECOGNITION SYSTEM BY IMPLICIT SPEAKER ADAPTION | |
DE112013007617B4 (en) | Speech recognition device and speech recognition method | |
DE102018103188B4 (en) | METHOD OF VOICE RECOGNITION IN A VEHICLE TO IMPROVE TASKS | |
DE60004331T2 (en) | SPEAKER RECOGNITION | |
DE102019111529A1 (en) | AUTOMATED LANGUAGE IDENTIFICATION USING A DYNAMICALLY ADJUSTABLE TIME-OUT | |
DE102010034433B4 (en) | Method of recognizing speech | |
DE602004020247D1 (en) | SYSTEM AND METHOD FOR SELECTING A USER LANGUAGE PROFILE FOR A DEVICE IN A VEHICLE | |
DE19942178C1 (en) | Method of preparing database for automatic speech processing enables very simple generation of database contg. grapheme-phoneme association | |
DE102017102392A1 (en) | AUTOMATIC LANGUAGE RECOGNITION BY VOICE CHANNELS | |
DE102015106280B4 (en) | Systems and methods for compensating for speech artifacts in speech recognition systems | |
DE102020215954A1 (en) | DIALOGUE SYSTEM AND PROCEDURE FOR CONTROLLING THESE | |
EP1282897B1 (en) | Method for creating a speech database for a target vocabulary in order to train a speech recognition system | |
DE102013222520B4 (en) | METHOD FOR A LANGUAGE SYSTEM OF A VEHICLE | |
EP1704561A1 (en) | Method and device for processing a voice signal for robust speech recognition | |
CN115019802A (en) | Speech intention recognition method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |