CN111370002B

CN111370002B - Method and device for acquiring voice training sample, computer equipment and storage medium

Info

Publication number: CN111370002B
Application number: CN202010093613.XA
Authority: CN
Inventors: 马坤; 赵之砚; 施奕明
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-02-14
Filing date: 2020-02-14
Publication date: 2022-08-19
Anticipated expiration: 2040-02-14
Also published as: WO2021159635A1; CN111370002A

Abstract

The application discloses a method, a device, computer equipment and a storage medium for acquiring a voice training sample, wherein the method comprises the following steps: processing a voice signal to obtain a sound spectrogram of the voice signal; randomly selecting a time point in a time direction on the sound spectrogram; and taking the time point as a tearing point, separating the sound spectrogram on two sides of the tearing point in the time direction, finishing the tearing treatment of the sound spectrogram, adding transition information at a fracture part according to a preset rule to obtain a tearing spectrogram, and taking the tearing spectrogram as the voice training sample. According to the method and the device, after an original voice signal is converted into a sound spectrogram, a large number of torn spectrograms, a first mask spectrogram and a second mask spectrogram are derived from the sound spectrogram through tearing and mask processing, and therefore the problem that accurate voiceprint recognition models cannot be obtained due to the fact that a sample amount of a voiceprint recognition model trained in the prior art is small can be solved.

Description

Method and device for acquiring voice training sample, computer equipment and storage medium

Technical Field

The present application relates to the field of computer neural network training, and in particular, to a method and an apparatus for obtaining a speech training sample, a computer device, and a storage medium.

Background

Voice recognition identity, i.e., voiceprint recognition, is an important direction in the field of artificial intelligence, and is an important application of artificial intelligence technology in biological feature recognition scenarios. Although the accuracy of voiceprint recognition is always higher in laboratory conditions, in an actual service scenario, since voice transmission depends on a transmission channel, such as a telephone, a broadband network, and the like, and received voice is affected by the channel, the accuracy of voiceprint recognition is still not high.

Because the speaking voice and the channel can not be completely separated, in the process of voiceprint recognition, the extracted voice characteristics of the speaker inevitably have channel characteristics, for example, the extracted characteristics of the speaker A in the telephone recording and the speaker A in the network voice are respectively attached with the characteristics of the telephone channel and the network channel, which can cause the judgment error of the voiceprint recognition. Thus, the cross-channel problem has heretofore been a difficult problem in the field of voiceprint recognition.

The mainstream solution in the industry at present is to collect voice data of each channel, either train a model for feature transformation between channels, or expand a training set of an original model with collected cross-channel data. The core of this is to collect enough cross-channel data as samples. In actual production, due to the limitations of sample collection cost and collection conditions, sufficient and effective cross-channel voice data cannot be collected as a sample.

Disclosure of Invention

The application mainly aims to provide a method, a device, computer equipment and a storage medium for acquiring a voice training sample, and aims to solve the technical problem that sufficient and effective cross-channel voice data cannot be acquired as a sample in the prior art.

In order to achieve the above object, the present application provides a method for acquiring a speech training sample, including:

processing a voice signal to obtain a sound spectrogram of the voice signal;

randomly selecting a time point in a time direction on the sound spectrogram;

and taking the time point as a tearing point, separating the sound frequency spectrograms on two sides of the tearing point in the time direction to finish tearing processing of the sound frequency spectrograms, adding transition information at the breakage part according to a preset rule to obtain the tearing frequency spectrograms, and taking the tearing frequency spectrograms as the voice training samples, wherein the separation distance of the sound frequency spectrograms on two sides of the tearing point is S, the S is a number randomly selected from uniform distribution of [0, S ], and the S is a time deformation parameter.

Further, the step of adding the hyperspectral map information at the tearing part according to a preset rule comprises the following steps:

and randomly adding the excess information to the fracture of the tear spectrogram.

Further, the step of randomly selecting a time point in a time direction on the sound spectrogram comprises:

acquiring the time length of the sound spectrogram;

determining the tearing processing times of the sound frequency spectrogram according to the time length;

and selecting the time points with the same number of times as the tearing times so as to perform tearing processing on the sound spectrogram for different times.

Further, the step of selecting the time points with the same number of times as the tearing times to perform tearing for different times on the sound spectrogram comprises:

and equally distributing the time points with the number corresponding to the tearing times in the time length so as to perform tearing on the sound frequency spectrogram for different times.

Further, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point to complete the tearing process of the sound spectrograms, and adding transition information at the fracture part according to a preset rule to obtain the tearing spectrograms, the method includes:

selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tear spectrogram;

and applying a mask sequence to each first spectrum block to obtain a first mask spectrum diagram.

Further, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding excessive information at the fracture part according to a preset rule to obtain the tearing spectrograms, the method further includes:

selecting a plurality of second frequency spectrum blocks of different frequency channels in the frequency direction on the tear spectrogram;

and applying a mask sequence to each second spectrum block to obtain a second mask spectrogram.

randomly adding a mask in the time direction on the sound spectrogram to obtain a third mask spectrogram;

randomly selecting the time point in a time direction on the third masked spectrogram.

The present application further provides an apparatus for obtaining a speech training sample, including:

the conversion unit is used for processing a voice signal to obtain a sound spectrogram of the voice signal;

a selection unit configured to randomly select a time point in a time direction on the sound spectrogram;

and the tearing unit is used for separating the sound frequency spectrograms on the two sides of the tearing point in the time direction by taking the time point as the tearing point, finishing the tearing processing of the sound frequency spectrograms, adding excessive information at the fracture part according to a preset rule to obtain the tearing frequency spectrograms, and taking the tearing frequency spectrograms as the voice training samples, wherein the separation distance of the sound frequency spectrograms on the two sides of the tearing point is S, the S is a number randomly selected from the uniform distribution of [0, S ], and the S is a time deformation parameter.

The present application further provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the above.

According to the method, the device, the computer equipment and the storage medium for acquiring the voice training sample, an original voice signal can be converted into the sound spectrogram, a large number of tear spectrograms, first mask spectrograms and second mask spectrograms are derived from the sound spectrogram through tearing and mask processing, and the tear spectrograms, the first mask spectrograms and the second mask spectrograms can be used as samples for training the voiceprint recognition model, so that the problem that the accurate voiceprint recognition model cannot be obtained due to the fact that the number of the samples for training the voiceprint recognition model is small in the prior art can be solved. For example, the problem that a voiceprint recognition model cannot be trained well due to the fact that a few samples exist in different channel scenes can be solved well.

Drawings

Fig. 1 is a schematic flowchart of a method for obtaining a speech training sample according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an apparatus for obtaining a speech training sample according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

Referring to fig. 1, a method for obtaining a speech training sample includes:

s1, processing a voice signal to obtain a sound spectrogram of the voice signal;

s2, randomly selecting a time point in the time direction on the sound spectrogram;

and S3, taking the time point as a tearing point, separating the sound spectrogram at two sides of the tearing point in the time direction, completing tearing processing of the sound spectrogram, adding excessive information at a fracture part according to a preset rule to obtain a tearing spectrogram, and taking the tearing spectrogram as the voice training sample, wherein the separation distance of the sound spectrogram at two sides of the tearing point is S, the S is a number randomly selected from uniform distribution of [0, S ], and the S is a time deformation parameter.

In this embodiment, a sample speech signal is first converted into an acoustic spectrogram, which is generally a mel spectrogram, and the specific conversion process can be implemented by any one of the prior art. Tearing the sound spectrogram at a certain time point, namely separating the sound spectrogram in time at the time point, wherein the separation mode can be various, for example, a first side of the sound spectrogram at two sides of the tearing point is fixed, and a second side of the sound spectrogram moves in a direction away from the first side; alternatively, the first side and the second side are moved in directions away from each other, respectively, and the like. In one embodiment, the first side may be fixed and the second side moved s away from the first side; the second side is then fixed on the original sound spectrogram, the first side is moved away from the second side by s, and so on, thereby obtaining two tear spectrograms with different moving directions at the processing of one point in time. In its embodiment, it may also be moved a specified distance in a specified direction. Further, the above steps S2 and S3 are repeated, different time points are selected each time, a plurality of torn spectrogram patterns corresponding to the sound spectrogram are obtained, and finally the sound spectrogram and the plurality of torn spectrogram patterns form a first voice training sample set. By the aid of the technical scheme, a plurality of torn spectrograms after being torn can be derived through one sound spectrogram, so that the number of voice training samples is enriched, and the problem that accurate voiceprint recognition models cannot be obtained due to the fact that the number of samples for training the voiceprint recognition models in the prior art is small can be solved. For example, the problem that a voiceprint recognition model cannot be trained well due to the fact that a few samples are obtained in different channel scenes can be solved well.

The step of adding the excessive information at the fracture according to the preset rule includes:

In this embodiment, since the torn spectrum segment exists in the torn spectrum graph, the torn spectrum segment may have a blank, and in order to improve the diversity of the training samples, excessive information may be added in the blank, for example, different smooth signals may be added. The excess information can be preset, a plurality of different excess information can be preset generally, then one excess information is randomly selected to be added to the fracture part, and if the excess information cannot just fill the blank, the excess information can be amplified or reduced in an equal proportion, so that the excess information can be just added to the blank. In another embodiment, if S is a positive integer, S kinds of transition information are set, each kind of transition information includes a plurality of transition information with different contents, and when the transition information is added, one of the transition information corresponding to S kinds of transition information is randomly selected, thereby further providing diversity of the creep samples.

In another embodiment, the preset rule is to add all the same data at the fracture, such as all 0's, all 1's, or other data such as 010101 that repeats the loop continuously.

In one embodiment, the step S2 of randomly selecting a time point in the time direction on the sound spectrogram includes:

s201, acquiring the time length of the sound spectrogram;

s202, determining the tearing processing times of the sound spectrogram according to the time length;

and S203, selecting time points with the same number as the tearing times so as to perform tearing processing on the sound spectrogram for different times.

In this embodiment, the audio spectrogram cannot be torn by the wireless frequency, so the application determines the tearing frequency according to the length of the time information in the audio spectrogram. Specifically, a mapping table is set, one column in the mapping table is a time length range, one column is the tearing times corresponding to the time length range, after the time length in the sound spectrogram is determined, the time length in which the time length falls in the mapping table is checked, and then the tearing times corresponding to the time length range are selected. The specific time length and the tearing times can be set manually according to experience, and the setting idea is that the longer the time length is, the more the corresponding tearable times are, and otherwise, the tearable times are less.

In an embodiment, the step S203 of selecting the same number of time points as the number of times of the tearing process to perform the tearing process on the sound spectrogram for different times includes:

In this embodiment, the time points are evenly distributed within the time length, the distribution is fast and uniform, and the difference between samples is more even than the random distribution.

In one embodiment, the audio spectrogram can be torn at only one time point, so as to obtain a torn spectrogram with only one torn part; in another embodiment, a sound spectrogram can perform tearing processing by taking a plurality of time points as tearing points at the same time to obtain a tearing spectrogram with a plurality of tearing positions.

In an embodiment, after the step S3 of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding excessive information at the fracture point according to a preset rule to obtain a torn spectrogram, the method includes:

s4, selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tear spectrogram;

s5, applying a mask sequence to each of the first spectrum blocks to obtain a first mask spectrum map.

In this embodiment, in the time direction of the tear spectrogram, first spectral blocks of x (positive integer) consecutive time steps [ t0, t0+ t ] are selected, and then a mask sequence [ W1, … ] is applied on these first spectral blocks, W being a number randomly selected from a uniform distribution of [0, W ], W being a time mask parameter. In a specific embodiment, different t is selected, so that different first mask spectrograms can be obtained, and a plurality of first mask spectrograms corresponding to the tear spectrograms are obtained, the sound spectrogram is put together with all the first mask spectrograms and all the tear spectrograms to form a second voice training sample set, and the number of samples and the richness of the samples are further improved. In this embodiment, the time length represented by t is smaller than the time length of the tear spectrogram, and t0 is an arbitrary time point in the tear spectrogram, but it is required that the time length is capable of satisfying the blocking of the tear spectrogram.

In an embodiment, after the step S3 of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding excessive information at the fracture point according to a preset rule to obtain a torn spectrogram, the method further includes:

s6, selecting a plurality of second frequency spectrum blocks of different frequency channels in the frequency direction on the tear spectrogram;

and S7, applying a mask sequence to each second spectrum block to obtain a second mask spectrum map.

In the present embodiment, the second spectrum block is a spectrum block in the frequency direction, not a spectrum block in time. Specifically, in the frequency direction of the spectrogram, a mask sequence [ V1, … ] is applied to a spectrum block of n (positive integer) consecutive frequency channels [ m0, m0+ n ], V being a number randomly selected from a uniform distribution of [0, V ], and V being a frequency mask parameter. Similarly, different n is selected, so that different second mask spectrograms can be obtained, so that a plurality of second mask spectrograms corresponding to the torn spectrogram are obtained, and the sound spectrogram, all the second mask spectrograms and all the torn spectrogram are put together to form a third voice training sample set. In the present embodiment, m0 is an arbitrary frequency channel point in the tear spectrogram, but it is required to satisfy the blocking of the tear spectrogram.

In one embodiment, the step S2 of randomly selecting the time point in the time direction on the sound spectrogram includes:

s21, randomly adding masks in the time direction of the sound spectrogram to obtain a third mask spectrogram;

s22, randomly selecting the time point in the time direction on the third mask spectrogram.

In this embodiment, a mask is first added to the sound spectrogram, and then the time point is randomly selected in the time direction on the third mask spectrogram, so that a richer sample can be obtained.

According to the method for acquiring the voice training sample, an original voice signal can be converted into a voice spectrogram, a large number of tearing spectrograms, a first mask spectrogram and a second mask spectrogram are derived from the voice spectrogram through tearing and mask processing, the tearing spectrograms, the first mask spectrogram and the second mask spectrogram can be used as samples for training the voiceprint recognition model, and therefore the problem that the accurate voiceprint recognition model cannot be obtained due to the fact that the number of the samples for training the voiceprint recognition model in the prior art is small can be solved. For example, if voice information under different channel scenes is respectively acquired, and if the voice information is directly used as training samples, an accurate voiceprint recognition model cannot be obtained due to the fact that the number of the training samples is small, but by the method, a large number of training samples can be derived according to the training samples corresponding to a small number of voice information, so that the problem that the number of the training samples is small is solved, and the problem that one voiceprint recognition model cannot be well trained due to the fact that a small number of samples under different channel scenes are obtained is well solved.

Referring to fig. 2, an embodiment of the present application further provides an apparatus for obtaining a speech training sample, including:

a conversion unit 10, configured to process a speech signal to obtain a sound spectrogram of the speech signal;

a selection unit 20 for randomly selecting a time point in a time direction on the sound spectrogram;

and the tearing unit 30 is configured to separate the sound spectrograms on the two sides of the tearing point in the time direction by using the time point as the tearing point, complete tearing processing on the sound spectrograms, add transition information at the fracture part according to a preset rule to obtain a tearing spectrogram, and use the tearing spectrograms as the voice training sample, where a separation distance of the sound spectrograms on the two sides of the tearing point is S, the S is a number randomly selected from uniform distribution of [0, S ], and the S is a time deformation parameter.

In this embodiment, the converting unit 10 first converts the voice signal as a sample into a sound spectrogram, which is generally a mel spectrogram, and the specific converting process can be implemented by any one of the prior art. After the selecting unit 20 randomly selects a time point, the tearing unit 30 tears the sound spectrogram by using the time point as a tearing point, that is, separates the sound spectrogram in time at the time point, where the separation manner may be multiple, for example, a first side of the sound spectrogram on both sides of the tearing point is fixed, and a second side of the sound spectrogram moves away from the first side; alternatively, the first side and the second side each move away from each other, and so on. In one embodiment, the first side may be fixed and the second side moved s away from the first side; the second side is then fixed on the original sound spectrogram, the first side is moved away from the second side by s, and so on, thereby obtaining two tear spectrograms with different moving directions at the processing of one point in time. In its embodiment, it may also be moved a specified distance in a specified direction. Further, the process of randomly selecting the time points and the tearing processing is repeated, different time points are selected each time, a plurality of tearing spectrum graphs corresponding to the sound frequency spectrogram are obtained, and finally the sound frequency spectrogram and the plurality of tearing spectrum graphs form a first voice training sample set. By means of the technical scheme, a plurality of torn spectrograms after tearing can be derived through one sound spectrogram, so that the number of voice training samples is enriched, and the problem that accurate voiceprint recognition models cannot be obtained due to the fact that the number of samples for training the voiceprint recognition models in the prior art is small can be solved. For example, the problem that a voiceprint recognition model cannot be trained well due to the fact that a few samples exist in different channel scenes can be solved well.

In one embodiment, the tearing unit 30 further includes:

and the adding unit is used for randomly adding the excessive information to the fracture part of the tearing spectrogram. Namely, the preset rule is to add excessive information at the fracture randomly.

In this embodiment, since the torn spectrum portion exists in the torn spectrum graph, the torn spectrum portion may have a blank, and in order to improve the diversity of the training samples, excessive information may be added to the blank, for example, a different smooth signal may be added. The excess information can be preset, a plurality of different excess information can be generally preset, then one excess information is randomly selected to be added to the fracture part, and if the excess information cannot just fill the blank, the excess information can be proportionally amplified or reduced so that the excess information can be just added to the blank. In another embodiment, if S is a positive integer, S kinds of transition information are set, each kind of transition information includes a plurality of transition information with different contents, and when the transition information is added, one of the transition information corresponding to S kinds of transition information is randomly selected, so as to further provide diversity of training samples.

In another embodiment, the predetermined rule is to add all the same data at the fracture, such as all 0's, all 1's, or other data such as 010101 that repeats the loop continuously.

In an embodiment, the apparatus for obtaining a speech training sample further includes:

an acquisition unit, configured to acquire a time length of the sound spectrogram;

the determining unit is used for determining the tearing processing times of the sound spectrogram according to the time length;

and the selecting unit is used for selecting the time points with the same number of times as the tearing times so as to tear the sound spectrogram for different times.

In this embodiment, the audio spectrogram cannot be torn by the wireless frequency, so the application determines the tearing frequency according to the length of the time information in the audio spectrogram. Specifically, a mapping table is set, one column in the mapping table is a time length range, one column is the tearing times corresponding to the time length range, after the time length in the sound spectrogram is determined, the time length is checked to be in which time length range in the mapping table, and then the tearing times corresponding to the time length range are selected. The specific time length and the tearing times can be set manually according to experience, and the setting idea is that the longer the time length is, the more the corresponding tearable times are, and otherwise, the tearable times are less.

In one embodiment, the selecting unit includes:

and the average selection module is used for averagely distributing the time points with the number corresponding to the tearing times in the time length so as to tear the sound frequency spectrogram for different times.

In one embodiment, the audio spectrogram can be torn at only one time point, so as to obtain a torn spectrogram with only one torn part; in another embodiment, a sound spectrogram can be subjected to tearing processing by taking a plurality of time points as tearing points at the same time, so as to obtain a tearing spectrogram with a plurality of tearing positions.

the time spectrum unit is used for selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tearing spectrogram;

a first mask unit, configured to apply a mask sequence to each first spectrum block to obtain a first mask spectrum map.

In this embodiment, in the time direction of the tear spectrogram, first spectral blocks of x (positive integer) consecutive time steps [ t0, t0+ t ] are selected, and then a mask sequence [ W1, … ] is applied on these first spectral blocks, W being a number randomly selected from a uniform distribution of [0, W ], W being a time mask parameter. In a specific embodiment, different t is selected, so that different first mask spectrograms can be obtained, and a plurality of first mask spectrograms corresponding to the tear spectrograms are obtained, the sound spectrogram is put together with all the first mask spectrograms and all the tear spectrograms to form a second voice training sample set, and the number of samples and the richness of the samples are further improved. In this embodiment, the time length represented by t is smaller than the time length of the tear spectrogram, and t0 is an arbitrary time point in the tear spectrogram, but it is required that the time length can satisfy the blocking of the tear spectrogram.

a frequency spectrum unit, configured to select a plurality of second spectrum blocks of different frequency channels in a frequency direction on the tear spectrogram;

and the second mask unit is used for applying a mask sequence to each second spectrum block to obtain a second mask spectrum map.

In the present embodiment, the second spectrum block is a spectrum block in the frequency direction, not a spectrum block in time. Specifically, in the frequency direction of the spectrogram, a mask sequence [ V1, … ] is applied to a spectrum block of n (positive integer) consecutive frequency channels [ m0, m0+ n ], V being a number randomly selected from a uniform distribution of [0, V ], and V being a frequency mask parameter. Similarly, different n is selected, so that different second mask spectrograms can be obtained, a plurality of second mask spectrograms corresponding to the tear spectrograms are obtained, and the sound spectrogram, all the second mask spectrograms and all the tear spectrograms are put together to form a third voice training sample set. In the present embodiment, m0 is an arbitrary frequency channel point in the tear spectrogram, but it is required to satisfy the blocking of the tear spectrogram.

In one embodiment, the selecting unit 20 includes:

the mask module is used for randomly adding masks in the time direction on the sound spectrogram to obtain a third mask spectrogram;

a selection module configured to randomly select the time point in a time direction on the third mask spectrogram.

The device for acquiring the voice training sample can convert an original voice signal into a voice spectrogram, derive a large number of tear spectrograms, a first mask spectrogram and a second mask spectrogram from one voice spectrogram through the processing of tearing and masking, and the tear spectrograms, the first mask spectrogram and the second mask spectrogram can be used as samples for training the voiceprint recognition model, so that the problem that the accurate voiceprint recognition model cannot be obtained due to the fact that the number of samples for training the voiceprint recognition model in the prior art is small can be solved. For example, voice information under different channel scenes is respectively acquired, if the voice information is directly used as training samples, an accurate voiceprint recognition model cannot be obtained due to the fact that the number of the training samples is small, but by the method, a large number of training samples can be derived according to the training samples corresponding to a small number of voice information, so that the problem that the number of the training samples is small is solved, and the problem that one voiceprint recognition model cannot be well trained due to the fact that the number of the samples under different channel scenes is small is well solved.

Referring to fig. 3, an embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and an internal structure of the memory may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operating system and the running of computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as sample sets. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of obtaining speech training samples. Specifically, the method comprises the following steps:

a method for acquiring a voice training sample comprises the following steps: processing a voice signal to obtain a sound spectrogram of the voice signal; randomly selecting a time point in a time direction on the sound spectrogram; and taking the time point as a tearing point, separating the sound frequency spectrograms on two sides of the tearing point in the time direction to finish tearing processing of the sound frequency spectrograms, adding transition information at the breakage part according to a preset rule to obtain the tearing frequency spectrograms, and taking the tearing frequency spectrograms as the voice training samples, wherein the separation distance of the sound frequency spectrograms on two sides of the tearing point is S, the S is a number randomly selected from uniform distribution of [0, S ], and the S is a time deformation parameter.

In one embodiment, the step of adding the hyperspectral map information at the tearing part according to a preset rule comprises: and randomly adding the excess information to the fracture part of the tearing spectrogram.

In one embodiment, the step of randomly selecting a time point in a time direction on the sound spectrogram comprises: acquiring the time length of the sound spectrogram; determining the tearing processing times of the sound spectrogram according to the time length; and selecting the time points with the same number of times as the tearing times so as to tear the sound spectrogram for different times.

In one embodiment, the step of selecting the same number of time points as the number of tearing processes to tear the sound spectrogram different times comprises: and equally distributing the time points with the number corresponding to the tearing times in the time length so as to perform tearing on the sound frequency spectrogram for different times.

In an embodiment, after the step of separating the sound spectrograms on both sides of the tearing point in a time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding excessive information at the fracture according to a preset rule to obtain a torn spectrogram, the method includes: selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tear spectrogram; and applying a mask sequence to each first spectrum block to obtain a first mask spectrum diagram.

In an embodiment, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding transition information at the fracture according to a preset rule to obtain the tearing spectrograms, the method further includes: selecting a second spectrum block of a plurality of different frequency channels in the frequency direction on the tear spectrogram; and applying a mask sequence to each second spectrum block to obtain a second mask spectrogram.

In one embodiment, the step of randomly selecting a time point in a time direction on the sound spectrogram comprises: randomly adding a mask in the time direction on the sound spectrogram to obtain a third mask spectrogram; randomly selecting the time point in a time direction on the third masked spectrogram.

The computer device of the embodiment of the application can convert an original voice signal into a sound spectrogram, and derive a large number of tear spectrograms, a first mask spectrogram and a second mask spectrogram from the sound spectrogram through the processing of tearing and masking, and the tear spectrograms, the first mask spectrogram and the second mask spectrogram can be used as samples for training a voiceprint recognition model, so that the problem that the accurate voiceprint recognition model cannot be obtained due to the fact that the number of the samples for training the voiceprint recognition model in the prior art is small can be solved. For example, if voice information under different channel scenes is respectively acquired, and if the voice information is directly used as training samples, an accurate voiceprint recognition model cannot be obtained due to the fact that the number of the training samples is small, but by the method, a large number of training samples can be derived according to the training samples corresponding to a small number of voice information, so that the problem that the number of the training samples is small is solved, and the problem that one voiceprint recognition model cannot be well trained due to the fact that a small number of samples under different channel scenes are obtained is well solved.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a method for obtaining a speech training sample. Specifically, the method comprises the following steps:

In one embodiment, the step of adding the excessive spectrogram information at the tearing part according to a preset rule includes: and randomly adding the excess information to the fracture part of the tearing spectrogram.

In one embodiment, the step of randomly selecting a time point in a time direction on the sound spectrogram comprises: acquiring the time length of the sound spectrogram; determining the tearing processing times of the sound frequency spectrogram according to the time length; and selecting the time points with the same number of times as the tearing times so as to perform tearing processing on the sound spectrogram for different times.

In an embodiment, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding transition information at the fracture according to a preset rule to obtain the tearing spectrograms, the method includes: selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tear spectrogram; and applying a mask sequence to each first spectrum block to obtain a first mask spectrogram.

In an embodiment, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding transition information at the fracture according to a preset rule to obtain the tearing spectrograms, the method further includes: selecting a plurality of second frequency spectrum blocks of different frequency channels in the frequency direction on the tear spectrogram; and applying a mask sequence to each second spectrum block to obtain a second mask spectrogram.

In one embodiment, the step of randomly selecting time points in a time direction on the sound spectrogram comprises: randomly adding a mask in the time direction on the sound spectrogram to obtain a third mask spectrogram; randomly selecting the time point in a temporal direction on the third mask spectrogram.

When a computer program is executed by a processor to realize the method for acquiring the voice training sample, an original voice signal can be converted into a voice spectrogram, a large number of tearing frequency spectrograms, a first mask frequency spectrogram and a second mask frequency spectrogram are derived from the voice spectrogram through tearing and mask processing, and the tearing frequency spectrogram, the first mask frequency spectrogram and the second mask frequency spectrogram can be used as samples for training the voiceprint recognition model, so that the problem that the accurate voiceprint recognition model cannot be obtained due to the fact that the number of the samples for training the voiceprint recognition model is small in the prior art can be solved. For example, voice information under different channel scenes is respectively acquired, if the voice information is directly used as training samples, an accurate voiceprint recognition model cannot be obtained due to the fact that the number of the training samples is small, but by the method, a large number of training samples can be derived according to the training samples corresponding to a small number of voice information, so that the problem that the number of the training samples is small is solved, and the problem that one voiceprint recognition model cannot be well trained due to the fact that the number of the samples under different channel scenes is small is well solved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method for obtaining a voice training sample is characterized by comprising the following steps:

processing a voice signal to obtain a sound spectrogram of the voice signal;

determining the tearing processing times of the sound spectrogram according to the time length of the sound spectrogram;

randomly selecting a time point in the time direction on the sound spectrogram according to the tearing processing times;

and separating the sound spectrograms on two sides of the tearing point in the time direction by taking the time point as the tearing point to finish the tearing processing of the sound spectrograms, adding transition information at the breaking part according to a preset rule to obtain a tearing spectrogram, and taking the tearing spectrograms as the voice training sample, wherein the separation distance of the sound spectrograms on two sides of the tearing point is S, the S is a number randomly selected from the uniform distribution of [0, S ], and the S is a time deformation parameter.

2. The method for acquiring the voice training sample according to claim 1, wherein the step of adding the excess information at the break according to the preset rule comprises:

3. The method for acquiring the voice training sample according to claim 1, wherein the determining the number of tearing processes of the sound spectrogram according to the time length of the sound spectrogram comprises:

acquiring the time length of the sound spectrogram;

determining the tearing processing times of the sound spectrogram according to the time length;

and selecting the time points with the same number of times as the tearing times so as to tear the sound spectrogram for different times.

4. The method of claim 3, wherein the step of selecting the same number of time points as the number of times of the tearing process to perform the tearing process on the sound spectrogram for different times comprises:

and equally distributing the time points with the number corresponding to the tearing times in the time length so as to perform tearing on the sound spectrogram for different times.

5. The method for acquiring the speech training sample according to claim 1, wherein the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point to complete the tearing process on the sound spectrograms, and adding excessive information at the fracture according to a preset rule to obtain the tearing spectrograms comprises:

6. The method for acquiring a speech training sample according to claim 1, wherein the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point to complete the tearing process on the sound spectrograms, and adding excessive information at the fracture according to a preset rule to obtain the tearing spectrograms further comprises:

selecting a second spectrum block of a plurality of different frequency channels in the frequency direction on the tear spectrogram;

and applying a mask sequence to each second spectrum block to obtain a second mask spectrum map.

7. The method of claim 1, wherein the step of randomly selecting the time point in the time direction on the sound spectrogram comprises:

randomly selecting the time point in a temporal direction on the third mask spectrogram.

8. An apparatus for obtaining a speech training sample, comprising:

the selecting unit is used for determining the tearing processing times of the sound spectrogram according to the time length of the sound spectrogram;

and the tearing unit is used for separating the sound spectrograms on two sides of the tearing point in the time direction by taking the time point as the tearing point, completing the tearing processing of the sound spectrograms, adding excessive information at the breaking part according to a preset rule to obtain a tearing spectrogram, and taking the tearing spectrograms as the voice training samples, wherein the separation distance of the sound spectrograms on two sides of the tearing point is S, the S is a number randomly selected from the uniform distribution of [0, S ], and the S is a time deformation parameter.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.