CN110335616B

CN110335616B - Voice data noise reduction method, device, computer equipment and storage medium

Info

Publication number: CN110335616B
Application number: CN201910650447.6A
Authority: CN
Inventors: 欧阳碧云; 王晶晶
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-07-18
Filing date: 2019-07-18
Publication date: 2024-10-15
Anticipated expiration: 2039-07-18
Also published as: CN110335616A

Abstract

The application relates to a voice data noise reduction method, a device, a computer device and a storage medium based on artificial intelligence, which comprises the following steps: and receiving a noise reduction request sent by the terminal, and acquiring a feature combination corresponding to the audio data to be processed and an association relation among the features in the feature combination. And calculating the distinguishing degree of each feature combination according to each feature and the association relation between the features. Screening each feature combination according to a preset discrimination threshold to obtain an initial feature combination, screening the initial feature combination by using a preset evaluation index to obtain an available feature combination, obtaining audio data to be processed corresponding to the available feature combination, generating first initial audio data based on discrimination, performing noise reduction processing on the first initial audio data based on a deep learning noise reduction model, and generating noise-reduced voice data. According to the method, the noise reduction is carried out on the voice data based on the distinction degree by using the deep learning noise reduction model, so that the noise reduction effect of the voice data is improved.

Description

Voice data noise reduction method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method and apparatus for noise reduction of speech data, a computer device, and a storage medium.

Background

With the increasing development of voice processing technology, voice data is commonly used in daily life, the requirements of different users are different in terms of voice quality of the voice data, and under various conditions of daily use, interference of various noise data and equipment signals exists, the voice quality can be affected to a certain extent, and the requirements of the users cannot be met, so that the voice noise reduction technology is presented.

The current common voice noise reduction method is to determine a voice frame and a noise frame in a voice signal according to a signal-to-noise ratio curve of the voice signal, and only perform noise reduction treatment on the obtained noise frame, but the adopted method is simpler, and the accuracy of distinguishing and determining the voice frame and the noise frame is to be improved, if the problem of inaccurate determination occurs, the voice quality is affected, and the noise reduction effect of voice data is reduced.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a voice data noise reduction method, apparatus, computer device, and storage medium capable of improving the voice noise reduction effect.

A method of noise reduction of speech data, the method comprising:

receiving a noise reduction request of audio data to be processed, which is sent by a terminal, and acquiring the audio data to be processed;

acquiring a feature combination corresponding to the audio data to be processed, acquiring an association relation between features in the feature combination, and calculating the distinguishing degree of the feature combinations according to the feature corresponding to the feature combination and the association relation between the features;

screening each feature combination according to the discrimination threshold to obtain an initial feature combination;

Screening the initial feature combinations by using preset evaluation indexes to obtain available feature combinations conforming to the preset evaluation indexes;

Acquiring audio data to be processed corresponding to the available feature combinations, and generating first initial audio data based on distinction;

and carrying out noise reduction processing on the first initial audio data based on the deep learning noise reduction model to generate noise-reduced voice data.

In one embodiment, the method for obtaining the deep learning noise reduction model includes:

Acquiring effective audio data corresponding to the feature combination which does not exceed the discrimination threshold and second initial audio data corresponding to the effective audio data from a training sample;

slicing the effective audio data and the second initial audio data according to a preset length;

generating a first voiceprint map of the effective audio data according to the sliced effective audio data, and extracting first voiceprint parameters of the effective audio data from the first voiceprint map;

Generating a second voice pattern of the second initial audio data according to the sliced second initial audio data, and extracting second voice parameters of the second initial audio data from the second voice pattern;

And taking the second voiceprint parameters of the second initial audio data as the input of a deep learning model, taking the first voiceprint parameters of the effective audio data at corresponding moments as the output of the deep learning model, and training the deep learning model to obtain the deep learning noise reduction model.

In one embodiment, obtaining feature combinations corresponding to the audio data to be processed, and calculating a discrimination degree of each feature combination includes:

Acquiring the corresponding characteristics of the audio data to be processed and the association relation among the characteristics;

generating a feature combination corresponding to the audio data to be processed according to the features and the association relation among the features;

And respectively calculating the distinguishing degree of each feature combination according to the feature corresponding to each feature combination and the association relation between the features.

In one embodiment, the screening each feature combination according to the preset discrimination threshold to obtain an initial feature combination includes:

comparing the distinguishing degree of each characteristic combination with the distinguishing degree threshold value respectively;

And acquiring a feature combination corresponding to the degree of distinction exceeding the threshold value of the degree of distinction, and generating an initial feature combination.

In one embodiment, the screening the initial feature combinations by using a preset evaluation index to obtain available feature combinations includes:

Acquiring a preset evaluation index; the preset evaluation index comprises an AUC value, an accuracy rate and a recall rate;

Screening the initial feature combination according to the AUC value, the accuracy and the recall rate;

and obtaining an initial feature combination meeting the requirements, and generating a usable feature combination.

In one embodiment, the noise reduction processing is performed on the first initial audio data based on the deep learning noise reduction model, and the generating noise-reduced voice data includes:

slicing the first initial audio data according to a preset length;

Generating a to-be-processed voiceprint map of the first initial audio data according to the sliced first initial audio data, and extracting to-be-processed voiceprint parameters of the first initial audio data from the to-be-processed voiceprint map;

inputting the voice print parameters to be processed into the deep learning noise reduction model to obtain noise reduced voice data.

In one embodiment, before the step of calculating the discrimination of each of the feature combinations, further comprising:

Respectively acquiring data types corresponding to the feature combinations according to the corresponding relation between the feature combinations and the data types; the data types comprise a digital type, a byte type and a text type;

Acquiring a data processing mode corresponding to the data type according to the corresponding relation between the data type and the data processing mode; the data processing mode comprises judgment processing, assignment processing and statement processing;

and respectively carrying out data processing on the audio data to be processed corresponding to each characteristic combination according to each data processing mode.

A voice data noise reduction device, the device comprising:

the receiving module is used for receiving a noise reduction request of the audio data to be processed, which is sent by the terminal, and acquiring the audio data to be processed;

the distinguishing degree calculating module is used for obtaining the feature combination corresponding to the audio data to be processed, obtaining the association relation between the features in the feature combination, and calculating the distinguishing degree of each feature combination according to the feature corresponding to the feature combination and the association relation between the features;

the initial feature combination acquisition module is used for screening each feature combination according to a preset distinguishing degree threshold value to acquire an initial feature combination;

the available feature combination acquisition module is used for screening the initial feature combination by utilizing a preset evaluation index to obtain an available feature combination conforming to the preset evaluation index;

the initial audio data generation module is used for acquiring the audio data to be processed corresponding to the available feature combinations and generating first initial audio data based on the distinction;

the noise reduction module is used for carrying out noise reduction processing on the first initial audio data based on the deep learning noise reduction model to generate noise reduced voice data.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

According to the voice data noise reduction method, the voice data noise reduction device, the computer equipment and the storage medium, the feature combination corresponding to the audio data to be processed is traversed by utilizing the preset distinguishing degree threshold value, the feature combination conforming to the distinguishing degree threshold value is obtained, the feature combination conforming to the distinguishing degree threshold value is screened by utilizing the preset evaluation index, the available feature combination is obtained, and the reliability of distinguishing the voice data and the noise data is enhanced. And carrying out noise reduction processing on the first initial audio data based on the distinction degree by using a deep learning noise reduction model to obtain noise-reduced voice data. On the basis of improving the distinction degree of voice data and noise data, the trained deep learning noise reduction model is utilized, so that the noise reduction processing of the voice data is realized rapidly and efficiently, and the noise reduction effect of the voice data is further improved.

Drawings

FIG. 1 is an application scenario diagram of a method of noise reduction of speech data in one embodiment;

FIG. 2 is a flow chart of a method of noise reduction of voice data according to one embodiment;

FIG. 3 is a flowchart illustrating steps for obtaining a deep learning noise reduction model in one embodiment;

FIG. 4 is a block diagram of a voice data noise reducer in one embodiment;

fig. 5 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The voice data denoising method provided by the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 receives a noise reduction request of the audio data to be processed sent by the terminal 102, acquires the audio data to be processed, acquires a feature combination corresponding to the audio data to be processed, acquires an association relationship between features in the feature combination, and calculates the distinction degree of each feature combination according to the feature corresponding to the feature combination and the association relationship between the features. The server 104 screens each feature combination according to a preset discrimination threshold to obtain an initial feature combination. Screening the initial feature combinations by using preset evaluation indexes to obtain available feature combinations which accord with the preset evaluation indexes, obtaining audio data to be processed corresponding to the available feature combinations, generating first initial audio data based on distinction, and performing noise reduction processing on the first initial audio data based on a deep learning noise reduction model to generate noise-reduced voice data. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, and tablet computers, and the server 104 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 2, a method for noise reduction of voice data is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

s202, receiving a noise reduction request of the audio data to be processed, which is sent by the terminal, and acquiring the audio data to be processed.

S204, obtaining feature combinations corresponding to the audio data to be processed, obtaining association relations among the features in the feature combinations, and calculating the distinguishing degree of the feature combinations according to the feature corresponding to the feature combinations and the association relations among the features.

Specifically, the audio data to be processed corresponding to the noise reduction processing request corresponds to a plurality of features, and the corresponding feature combination can be generated according to the plurality of features and each association relationship by acquiring the association relationship between the plurality of features. The server can generate the feature combination corresponding to the audio data to be processed according to the feature and the association relation between the features by acquiring the feature corresponding to the audio data to be processed and the association relation between the features. Therefore, the distinguishing degree of each feature combination can be calculated according to the corresponding feature of each feature combination and the association relation between the features.

In this scheme, the characteristics of the audio data include sampling frequency, bit rate, channel number, frame rate, zero-crossing rate and short-time energy, where the sampling frequency represents the number of sampling points on the analog signal in a unit time, and a number is given to a point on the sampled analog signal and can be converted into a digital signal. The bit rate represents the different levels at which the ringing (amplitude affecting the loudness of sound) of the analog signal is divided. The channel number represents the number of channels of audio, the frame rate represents the number of sound frames per unit time, and one frame may contain a plurality of sound samples. The zero crossing rate represents the number of signal zero crossings in each frame of signal and is used to represent the frequency characteristics of the audio. The short-time energy is used for reflecting the intensity of the audio signal at different moments. Therefore, since the values of the features corresponding to the different audio data are different, the feature combinations generated from the features corresponding to the different audio data are also different.

Further, before the degree of distinction calculation, the method further includes performing corresponding data processing on the audio data to be processed to improve accuracy of the degree of distinction corresponding to each feature combination obtained by calculation, and specifically includes:

Different noise reduction processing modes are executed for the initial audio data of different data types, wherein the data types comprise digital type, byte type, text type and the like, and the corresponding noise reduction processing modes comprise judgment processing, assignment processing and statement processing. For the digital initial audio data, executing judgment processing to obtain a preset value range, comparing the preset value range with the value of the digital initial audio data, judging whether the value of the digital initial audio data accords with the preset value range, extracting the digital initial audio data which accords with the preset value range, deleting noise data in the digital initial audio data, and generating digital available data.

And performing assignment processing on the byte-type initial audio data, judging whether the value of the byte-type initial audio data accords with a preset value, assigning the preset value to the corresponding byte-type initial audio data when the value of the byte-type initial audio data does not accord with the preset value, and deleting noise data in the assigned byte-type initial audio data to generate byte-type available data.

And executing declaration processing on the text-type initial audio data, acquiring components of the text-type initial audio data, comparing the components with preset components, and when the components of the text-type initial audio data are inconsistent with the preset components, declaring the text-type initial audio data as the preset components, deleting noise data in the text-type initial audio data, and generating text-type available data.

S206, screening the feature combinations according to a preset distinguishing degree threshold value to obtain initial feature combinations. Specifically, the server obtains the feature combination corresponding to the discrimination degree exceeding the discrimination degree threshold by comparing the discrimination degree of each feature combination with the discrimination degree threshold, namely the initial feature combination. The distinguishing threshold is used for traversing and screening the feature combinations corresponding to the audio data to be processed, and further the audio data to be processed corresponding to the feature combinations conforming to the distinguishing threshold are obtained. That is, the server obtains the preset distinguishing degree threshold value, and traverses the distinguishing degree corresponding to each feature combination according to the distinguishing degree threshold value to obtain the feature combination corresponding to the distinguishing degree exceeding the distinguishing degree threshold value, so as to generate the initial feature combination.

Further, noise data that breaks through the threshold value in the initial audio data may also be deleted. The noise data breaking through the threshold value is data with the distinguishing degree lower than the corresponding data of the initial distinguishing degree threshold value, namely invalid audio data, and the data corresponding to the initial characteristic combination is data with the distinguishing degree exceeding the distinguishing degree threshold value, namely initial audio data. In this scheme, the discrimination threshold range may be set to 0.8 to 1, and invalid audio data below the discrimination threshold of 0.8 is noise data for which the noise reduction operation cannot be performed, and also not valid audio data, and the deletion process is performed. The initial audio data exceeding the discrimination threshold 1 is required to be subjected to noise reduction processing to generate valid audio data.

S208, screening the initial feature combinations by using a preset evaluation index to obtain available feature combinations which accord with the preset evaluation index.

Specifically, the preset evaluation index acquired by the server comprises an AUC value, an accuracy rate and a recall rate, and the server screens the initial feature combination according to the acquired AUC value, accuracy rate and recall rate to acquire the initial feature combination meeting the requirements and generate the available feature combination.

The AUC value is the Area Under the sensitivity Curve, which is known as Area Under the ROC Curve and ranges between 0.5 and 1. The ROC curve is a sensitivity curve, each point on the curve reflects the same sensitivity, the same sensitivity is the response to the same signal stimulus, the results are obtained under two different judging standards, the working characteristic curve of the test subject is a graph formed by taking the probability of false positive as the horizontal axis and the true positive as the vertical axis, and the test subject is drawn by adopting different results obtained by different judging standards under the specific stimulus condition.

Accuracy (Precision) refers to the ratio of the number of samples correctly classified by the classifier to the total number of samples for a given test dataset, i.e., the accuracy on the test dataset when the loss function is a 0-1 loss. The formulation may be: accuracy = total number of relevant files retrieved by the system/all files retrieved by the system.

Recall (Recall) is a measure of coverage, and there are a number of positive examples of measures that are divided into positive examples. The formula can be expressed as: recall = the total number of relevant files retrieved by the system/all relevant files of the system.

Further, the server acquires a corresponding relation between the AUC value and the initial feature combination, and screens the initial feature combination according to the preset AUC value to acquire the initial feature combination which accords with the preset AUC value.

The server can set the AUC value to 0.8 according to the value range of the AUC value, and screen the initial feature combination by using the AUC value of 0.8 to obtain the initial feature combination conforming to the AUC value. The server acquires the corresponding relation between the accuracy and the initial feature combination, screens the initial feature combination according to the preset accuracy, and acquires the initial feature combination which accords with the preset accuracy. The server acquires the corresponding relation between the recall rate and the initial feature combination, and screens the initial feature combination according to the preset recall rate to obtain the initial feature combination conforming to the preset recall rate. Finally, the server generates available feature combinations according to the initial feature combinations which accord with the preset evaluation index AUC value, accuracy and recall rate.

S210, obtaining the audio data to be processed corresponding to the available feature combinations, and generating first initial audio data based on the distinction degree.

S212, noise reduction processing is carried out on the first initial audio data based on the deep learning noise reduction model, and noise reduced voice data are generated.

Specifically, the server performs slicing processing on the first initial audio data according to a preset length, generates a to-be-processed voiceprint map of the first initial audio data according to the sliced first initial audio data, and extracts to-be-processed voiceprint parameters of the first initial audio data from the to-be-processed voiceprint map. Therefore, the voice print parameters to be processed can be input into the deep learning noise reduction model, and noise reduced voice data are obtained.

Further, the server obtains first initial audio data corresponding to the to-be-processed voiceprint parameters which accord with the second voiceprint parameters by obtaining the to-be-processed voiceprint parameters of the first initial audio data to be subjected to noise reduction processing, inputting the to-be-processed parameters into a deep learning noise reduction model, matching the second voiceprint parameters with the to-be-processed voiceprint parameters, and performing noise reduction processing on the first initial audio data corresponding to the to-be-processed voiceprint parameters which accord with the second voiceprint parameters by utilizing the first voiceprint parameters to obtain noise reduced voice data.

According to the voice data noise reduction method, the feature combination corresponding to the audio data to be processed is traversed by utilizing the preset distinguishing degree threshold value, the feature combination conforming to the distinguishing degree threshold value is obtained, the feature combination conforming to the distinguishing degree threshold value is screened by utilizing the preset evaluation index, the available feature combination is obtained, and the reliability of distinguishing the voice data from the noise data is enhanced. And carrying out noise reduction processing on the first initial audio data based on the distinction degree by using a deep learning noise reduction model to obtain noise-reduced voice data. On the basis of improving the distinction degree of voice data and noise data, the trained deep learning noise reduction model is utilized, so that the noise reduction processing of the voice data is realized rapidly and efficiently, and the noise reduction effect of the voice data is further improved.

In one embodiment, as shown in FIG. 3, there is provided a step of obtaining a deep learning noise reduction model, comprising:

s302, obtaining effective audio data corresponding to the feature combination which does not exceed the distinguishing degree threshold value and corresponding second initial audio data from the training sample.

Specifically, the training sample includes valid audio data corresponding to a feature combination that does not exceed the discrimination threshold, second initial audio data corresponding to the valid audio data, and invalid audio data that breaks through the threshold. And the server acquires effective audio data and second initial audio data corresponding to the effective audio data from the training samples according to the requirements of training the deep learning noise reduction model.

S304, slicing the effective audio data and the second initial audio data according to the preset length.

Specifically, the server also needs to preprocess the valid audio data and the second initial audio data before slicing the valid audio data and the second initial audio data, so as to obtain the valid audio data and the second initial audio data in a predetermined format. Further acquiring a preset slice length, and slicing a pair of valid audio data and second initial audio data in a preset format according to the preset length.

S306, generating a first voiceprint map of the effective audio data according to the sliced effective audio data, and extracting first voiceprint parameters of the effective audio data from the first voiceprint map.

And S308, generating a second voice pattern of the second initial audio data according to the sliced second initial audio data, and extracting second voice parameters of the second initial audio data from the second voice pattern.

Specifically, the server generates a first voiceprint pattern of the effective audio data and a second voiceprint pattern corresponding to the first initial audio data according to the sliced effective audio data and the second initial audio data, and extracts first voiceprint parameters of the effective audio data from the first voiceprint pattern and second voiceprint parameters of the second initial audio data from the second voiceprint pattern respectively.

Among them, the voiceprint patterns currently known mainly include: broadband voiceprint, narrowband voiceprint, amplitude voiceprint, contour voiceprint, time spectrum voiceprint, cross-section voiceprint (again, broadband, narrowband). The first two display voice frequency and intensity change characteristics along with time, and the middle three display voice intensity or sound pressure change characteristics along with time; the cross section voiceprint is simply a voiceprint graph showing the intensity and frequency characteristics of the sound wave at a certain point in time.

Voiceprint parameters, which are characteristic specific to audio data, refer to content-based digital signatures that can represent important acoustic features of a piece of audio data, and the primary purpose of which is to establish an efficient mechanism to compare the perceived auditory quality of two pieces of audio data, can include, but are not limited to, acoustic features related to the anatomy of the human pronunciation mechanism, such as spectrum, cepstrum, formants, pitch, reflection coefficients, and the like.

S310, taking the second voiceprint parameters of the second initial audio data as the input of the deep learning model, taking the first voiceprint parameters of the effective audio data at corresponding moments as the output of the deep learning model, and training the deep learning model to obtain the deep learning noise reduction model.

Specifically, the second voiceprint parameter of the second initial audio data is used as an input of the deep learning model, corresponds to the audio data needing noise reduction, and the first voiceprint parameter of the effective audio data corresponding to the second initial audio data is used as an output of the deep learning model, and corresponds to the audio data after noise reduction. The effective audio data and the second initial audio data are acquired from the sample for multiple times, corresponding second voiceprint parameters and first voiceprint parameters are extracted, and the deep learning model is trained, so that the deep learning noise reduction model can be obtained.

In the above steps, the server acquires the effective audio data corresponding to the feature combination and the second initial audio data corresponding to the feature combination which do not exceed the discrimination threshold from the training sample, performs slicing processing on the effective audio data and the second initial audio data according to a predetermined length, generates a first voiceprint map of the effective audio data according to the sliced effective audio data, extracts a first voiceprint parameter of the effective audio data from the first voiceprint map, generates a second voiceprint map of the second initial audio data according to the sliced second initial audio data, and extracts a second voiceprint parameter of the second initial audio data from the second voiceprint map. Therefore, the second voiceprint parameters of the second initial audio data can be used as the input of the deep learning model, the first voiceprint parameters of the effective audio data at the corresponding moment are used as the output of the deep learning model, the training of the deep learning model is realized, the deep learning noise reduction model applicable to the voice data is obtained, and the noise reduction effect on the voice data is improved.

In one embodiment, there is provided a method for acquiring feature combinations corresponding to audio data to be processed, and calculating a distinction degree of each feature combination, including:

The method comprises the steps that a server obtains features corresponding to audio data to be processed and association relations among the features; generating a feature combination corresponding to the audio data to be processed according to the features and the association relation among the features; and respectively calculating the distinguishing degree of each feature combination according to the corresponding feature of each feature combination and the association relation between the features.

Further, the audio data includes different values of the corresponding features, that is, the sampling frequency, the bit rate, the channel number, the frame rate and the short-time energy are different, and the association relationship between the corresponding features is also inconsistent, so that the server can calculate the distinguishing degree between the feature combinations of different audio data through the different values and association relationships of the features.

In the above steps, the server generates feature combinations corresponding to the audio data to be processed according to the features and the association relations between the features, and calculates the distinguishing degree of each feature combination according to the features corresponding to each feature combination and the association relations between the features. According to the method, corresponding distinguishing degrees are calculated for different feature combinations, effective audio data in voice data and noise data to be subjected to noise reduction processing are distinguished more quickly, and the working efficiency is improved.

In one embodiment, there is provided a method for screening each feature combination according to a preset discrimination threshold to obtain an initial feature combination, including:

The server compares the distinguishing degree of each feature combination with a distinguishing degree threshold value respectively; and acquiring the feature combination corresponding to the degree of distinction exceeding the threshold value of the degree of distinction, and generating an initial feature combination.

Specifically, the server obtains the feature combination corresponding to the discrimination degree exceeding the discrimination degree threshold by comparing the discrimination degree of each feature combination with the discrimination degree threshold, namely the initial feature combination. That is, the server obtains the preset distinguishing degree threshold value, and traverses the distinguishing degree corresponding to each feature combination according to the distinguishing degree threshold value to obtain the feature combination corresponding to the distinguishing degree exceeding the distinguishing degree threshold value, so as to generate the initial feature combination.

In the above step, the server compares the discrimination degrees of the feature combinations with the discrimination degree threshold value respectively, and obtains the feature combination corresponding to the discrimination degree exceeding the discrimination degree threshold value, so as to generate the initial feature combination. Because the comparison of the distinguishing degree and the distinguishing degree threshold value corresponding to each characteristic combination is considered, invalid data in the audio data to be processed can be deleted, and when the initial audio data needing noise reduction processing is acquired, the screening workload is reduced, and the working efficiency is improved.

In one embodiment, there is provided a method for screening an initial feature set by using a preset evaluation index to obtain a usable feature set, including:

The server acquires a preset evaluation index; the preset evaluation index comprises an AUC value, an accuracy rate and a recall rate; screening the initial feature combination according to the AUC value, the accuracy and the recall rate; and obtaining an initial feature combination meeting the requirements, and generating a usable feature combination.

Specifically, the AUC value is the size of the Area Under the sensitivity Curve, known as Area Under the ROC Curve, and is defined as the Area Under the ROC Curve, ranging between 0.5 and 1. Accuracy (Precision) refers to the ratio of the number of samples correctly classified by the classifier to the total number of samples for a given test dataset, i.e., the accuracy on the test dataset when the loss function is a 0-1 loss. The formulation may be: accuracy = total number of relevant files retrieved by the system/all files retrieved by the system. Recall (Recall) is a measure of coverage, and there are a number of positive examples of measures that are divided into positive examples. The formula can be expressed as: recall = the total number of relevant files retrieved by the system/all relevant files of the system.

In the above steps, the server screens the initial feature combination according to the preset evaluation index AUC value, accuracy and recall rate to generate an available feature combination, screens the feature combination again, and further improves the acquisition efficiency of the initial audio data.

In one embodiment, there is provided a method for generating noise-reduced speech data by performing noise reduction processing on first initial audio data based on a deep learning noise reduction model, including:

The server performs slicing processing on the first initial audio data according to a preset length; generating a to-be-processed voiceprint map of the first initial audio data according to the sliced first initial audio data, and extracting to-be-processed voiceprint parameters of the first initial audio data from the to-be-processed voiceprint map; inputting the voice print parameters to be processed into the deep learning noise reduction model to obtain noise reduced voice data.

Specifically, the server obtains first initial audio data corresponding to the to-be-processed voiceprint parameters which accord with the second voiceprint parameters by obtaining the to-be-processed voiceprint parameters of the first initial audio data to be subjected to noise reduction processing, inputting the to-be-processed parameters into a deep learning noise reduction model, matching the second voiceprint parameters with the to-be-processed voiceprint parameters, and performing noise reduction processing on the first initial audio data corresponding to the to-be-processed voiceprint parameters which accord with the second voiceprint parameters by utilizing the first voiceprint parameters to obtain noise reduced voice data.

In the above steps, the server performs slicing processing on the first initial audio data according to the predetermined length, generates a to-be-processed voiceprint map of the first initial audio data according to the sliced first initial audio data, extracts to-be-processed voiceprint parameters of the first initial audio data from the to-be-processed voiceprint map, and inputs the to-be-processed voiceprint parameters into the deep learning noise reduction model to obtain noise-reduced voice data. Noise reduction processing is carried out on the first initial audio data based on the distinction degree by utilizing the deep learning noise reduction model, so that the noise reduction effect of the voice data is improved.

In one embodiment, a method for noise reduction of voice data is provided, further comprising:

The server obtains the data types corresponding to the feature combinations according to the corresponding relation between the feature combinations and the data types; the data types include a digital type, a byte type, and a text type; acquiring a data processing mode corresponding to the data type according to the corresponding relation between the data type and the data processing mode; the data processing mode comprises judgment processing, assignment processing and statement processing; and respectively carrying out data processing on the audio data to be processed corresponding to each characteristic combination according to each data processing mode.

Specifically, for the digital initial data, a judgment process is executed to obtain a preset value range, the preset value range and the value of the digital initial data are compared, whether the value of the digital initial data accords with the preset value range is judged, the digital initial data which accords with the preset value range is extracted, noise data in the digital initial data is deleted, and digital available data is generated.

And executing assignment processing on the byte-type initial data, judging whether the value of the byte-type initial data accords with a preset value, assigning the preset value to the corresponding byte-type initial data when the value of the byte-type initial data does not accord with the preset value, deleting noise data in the assigned byte-type initial data, and generating byte-type available data.

And executing declaration processing on the text type initial data, acquiring components of the text type initial data, comparing the components with preset components, and when the components of the text type initial data are inconsistent with the preset components, declaring the text type initial data as the preset components, deleting noise data in the text type initial data, and generating the text type available data.

In the above steps, before the server performs the degree of discrimination calculation of each feature combination, the server performs corresponding data preprocessing according to different types of audio data to be processed, so that the accuracy of the degree of discrimination calculation of each subsequent feature combination is improved.

It should be understood that, although the steps in the flowcharts of fig. 2-3 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2-3 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily occur sequentially, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or steps.

In one embodiment, as shown in fig. 4, there is provided a voice data noise reduction apparatus, including: a receiving module 402, a discrimination computation module 404, an initial feature combination obtaining module 406, an available feature combination obtaining module 408, an initial audio data generating module 410, and a noise reduction module 412, wherein:

a receiving module 402, configured to receive a noise reduction request of audio data to be processed sent by a terminal, and obtain the audio data to be processed;

The differentiating degree calculating module 404 is configured to obtain a feature combination corresponding to the audio data to be processed, obtain an association relationship between features in the feature combination, and calculate a differentiating degree of each feature combination according to the feature corresponding to the feature combination and the association relationship between features.

An initial feature combination obtaining module 406, configured to screen each feature combination according to a preset discrimination threshold to obtain an initial feature combination;

the available feature combination obtaining module 408 is configured to screen the initial feature combination by using a preset evaluation index to obtain an available feature combination that meets the preset evaluation index;

an initial audio data generating module 410, configured to obtain audio data to be processed corresponding to the available feature combinations, and generate first initial audio data based on the distinction;

the noise reduction module 412 is configured to perform noise reduction processing on the first initial audio data based on the deep learning noise reduction model, and generate noise-reduced speech data.

According to the voice data noise reduction device, the feature combination corresponding to the audio data to be processed is traversed by utilizing the preset distinguishing degree threshold value, the feature combination conforming to the distinguishing degree threshold value is obtained, the feature combination conforming to the distinguishing degree threshold value is screened by utilizing the preset evaluation index, the available feature combination is obtained, and the reliability of distinguishing the voice data from the noise data is enhanced. And carrying out noise reduction processing on the first initial audio data based on the distinction degree by using a deep learning noise reduction model to obtain noise-reduced voice data. On the basis of improving the distinction degree of voice data and noise data, the trained deep learning noise reduction model is utilized, so that the noise reduction processing of the voice data is realized rapidly and efficiently, and the noise reduction effect of the voice data is further improved.

In one embodiment, a deep learning noise reduction model training module is provided that is further configured to:

Acquiring effective audio data corresponding to the feature combination which does not exceed the discrimination threshold and second initial audio data corresponding to the effective audio data from the training sample; slicing the effective audio data and the second initial audio data according to a preset length; generating a first voiceprint map of the effective audio data according to the sliced effective audio data, and extracting first voiceprint parameters of the effective audio data from the first voiceprint map; generating a second voice pattern of the second initial audio data according to the sliced second initial audio data, and extracting second voice parameters of the second initial audio data from the second voice pattern; and taking the second voiceprint parameters of the second initial audio data as the input of the deep learning model, taking the first voiceprint parameters of the effective audio data at the corresponding moment as the output of the deep learning model, and training the deep learning model to obtain the deep learning noise reduction model.

According to the deep learning noise reduction model training module, the second voiceprint parameters of the second initial audio data are used as the input of the deep learning model, the first voiceprint parameters of the effective audio data at corresponding moments are used as the output of the deep learning model, training of the deep learning model is achieved, the deep learning noise reduction model applicable to the voice data is obtained, and noise reduction effect on the voice data is improved.

In one embodiment, a discrimination computation module is provided that is further configured to:

Acquiring characteristics corresponding to the audio data to be processed and association relations among the characteristics; generating a feature combination corresponding to the audio data to be processed according to the features and the association relation among the features; and respectively calculating the distinguishing degree of each feature combination according to the corresponding feature of each feature combination and the association relation between the features.

The discrimination calculation module realizes the respective calculation of the respective discrimination for different feature combinations, so that the effective audio data in the voice data and the noise data to be subjected to noise reduction are discriminated more quickly, and the working efficiency is improved.

In one embodiment, an initial feature combination acquisition module is provided that is further configured to:

comparing the distinguishing degree of each feature combination with a distinguishing degree threshold value respectively; and acquiring the feature combination corresponding to the degree of distinction exceeding the threshold value of the degree of distinction, and generating an initial feature combination.

According to the initial feature combination acquisition module, invalid data in the audio data to be processed can be deleted in consideration of comparison of the distinguishing degree and the distinguishing degree threshold value corresponding to each feature combination, and when the initial audio data needing noise reduction processing is acquired, the screening workload is reduced, and the working efficiency is improved.

In one embodiment, a usable feature acquisition module is provided that is further configured to:

Acquiring a preset evaluation index; the preset evaluation index comprises an AUC value, an accuracy rate and a recall rate; screening the initial feature combination according to the AUC value, the accuracy and the recall rate; and obtaining an initial feature combination meeting the requirements, and generating a usable feature combination.

According to the available feature acquisition module, the server screens the initial feature combination according to the preset evaluation index AUC value, the accuracy and the recall rate to generate the available feature combination, screens the feature combination again, and further improves the acquisition efficiency of the initial audio data.

In one embodiment, a noise reduction module is provided, further configured to:

Slicing the first initial audio data according to a predetermined length; generating a to-be-processed voiceprint map of the first initial audio data according to the sliced first initial audio data, and extracting to-be-processed voiceprint parameters of the first initial audio data from the to-be-processed voiceprint map; inputting the voice print parameters to be processed into the deep learning noise reduction model to obtain noise reduced voice data.

According to the noise reduction module, the noise reduction processing is carried out on the first initial audio data based on the distinction degree by utilizing the deep learning noise reduction model, so that the noise reduction effect of the voice data is improved.

In one embodiment, a data processing module is provided, further configured to:

Respectively acquiring data types corresponding to the feature combinations according to the corresponding relation between the feature combinations and the data types; the data types include a digital type, a byte type, and a text type; acquiring a data processing mode corresponding to the data type according to the corresponding relation between the data type and the data processing mode; the data processing mode comprises judgment processing, assignment processing and statement processing; and respectively carrying out data processing on the audio data to be processed corresponding to each characteristic combination according to each data processing mode.

According to the data processing module, the server performs corresponding data preprocessing according to different types of the audio data to be processed before performing the degree of distinction calculation of each feature combination, so that the accuracy of the degree of distinction calculation of each subsequent feature combination is improved.

For specific limitations of the voice data noise reduction device, reference may be made to the above limitation of the voice data noise reduction method, and no further description is given here. The above-described individual modules in the voice data noise reduction apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing speech data noise reduction data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of noise reduction of speech data.

It will be appreciated by those skilled in the art that the structure shown in FIG. 5 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory storing a computer program and a processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the respective method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method of noise reduction of speech data, the method comprising:

acquiring a feature combination corresponding to the audio data to be processed, acquiring an association relation between features in the feature combination and a value corresponding to each feature, and calculating the distinguishing degree of each feature combination of different audio data to be processed according to the feature corresponding to the feature combination, the value corresponding to each feature and the association relation between the features; the characteristics of the audio data to be processed include: sampling frequency, bit rate, number of channels, frame rate, zero crossing rate, and short time energy; different values are respectively corresponding to different features, and different feature combinations are generated by different features corresponding to different audio data to be processed;

screening each feature combination according to a preset distinguishing degree threshold value to obtain an initial feature combination;

noise reduction processing is carried out on the first initial audio data based on a deep learning noise reduction model, and noise-reduced voice data are generated;

The method for obtaining the deep learning noise reduction model comprises the following steps: acquiring effective audio data corresponding to the feature combination which does not exceed the discrimination threshold and second initial audio data corresponding to the effective audio data from a training sample; slicing the effective audio data and the second initial audio data according to a preset length; generating a first voiceprint map of the effective audio data according to the sliced effective audio data, and extracting first voiceprint parameters of the effective audio data from the first voiceprint map; generating a second voice pattern of the second initial audio data according to the sliced second initial audio data, and extracting second voice parameters of the second initial audio data from the second voice pattern; taking the second voiceprint parameters of the second initial audio data as the input of a deep learning model, taking the first voiceprint parameters of the effective audio data at corresponding moments as the output of the deep learning model, and training the deep learning model to obtain a deep learning noise reduction model;

The noise reduction processing is performed on the first initial audio data based on the deep learning noise reduction model, and noise-reduced voice data is generated, including: slicing the first initial audio data according to a preset length; generating a to-be-processed voiceprint map of the first initial audio data according to the sliced first initial audio data, and extracting to-be-processed voiceprint parameters of the first initial audio data from the to-be-processed voiceprint map; inputting the voice print parameters to be processed into the deep learning noise reduction model to obtain noise reduced voice data;

The method comprises the steps of obtaining first initial audio data to be subjected to noise reduction, inputting the first initial audio data to be subjected to noise reduction into a deep learning noise reduction model, matching the second voice parameters with the first initial audio data corresponding to the second voice parameters, and carrying out noise reduction on the first initial audio data corresponding to the second voice parameters by utilizing the first voice parameters to obtain noise-reduced voice data.

2. The method according to claim 1, wherein obtaining the feature combinations corresponding to the audio data to be processed, and obtaining the association relationships between the features in the feature combinations, and calculating the discrimination degree of each feature combination according to the feature corresponding to the feature combinations and the association relationships between the features, comprises:

3. The method according to claim 1 or 2, wherein the screening each of the feature combinations according to a preset discrimination threshold to obtain an initial feature combination includes:

4. The method according to claim 1 or 2, wherein the screening the initial feature combinations with a preset evaluation index to obtain available feature combinations that meet the preset evaluation index comprises:

5. The method of claim 1, further comprising, prior to the step of calculating the discrimination of each of the feature combinations:

6. A voice data noise reduction device, the device comprising:

The distinguishing degree calculating module is used for obtaining the feature combination corresponding to the audio data to be processed, obtaining the association relation between the features in the feature combination and the value corresponding to each feature, and calculating the distinguishing degree of each feature combination of different audio data to be processed according to the feature corresponding to the feature combination, the value corresponding to each feature and the association relation between the features; the characteristics of the audio data to be processed include: sampling frequency, bit rate, number of channels, frame rate, zero crossing rate, and short time energy; different values are respectively corresponding to different features, and different feature combinations are generated by different features corresponding to different audio data to be processed;

the noise reduction module is used for carrying out noise reduction processing on the first initial audio data based on the deep learning noise reduction model to generate noise-reduced voice data;

The device also comprises a deep learning noise reduction model training module for: acquiring effective audio data corresponding to the feature combination which does not exceed the discrimination threshold and second initial audio data corresponding to the effective audio data from a training sample; slicing the effective audio data and the second initial audio data according to a preset length; generating a first voiceprint map of the effective audio data according to the sliced effective audio data, and extracting first voiceprint parameters of the effective audio data from the first voiceprint map; generating a second voice pattern of the second initial audio data according to the sliced second initial audio data, and extracting second voice parameters of the second initial audio data from the second voice pattern; taking the second voiceprint parameters of the second initial audio data as the input of a deep learning model, taking the first voiceprint parameters of the effective audio data at corresponding moments as the output of the deep learning model, and training the deep learning model to obtain a deep learning noise reduction model;

The noise reduction module is further configured to: slicing the first initial audio data according to a preset length; generating a to-be-processed voiceprint map of the first initial audio data according to the sliced first initial audio data, and extracting to-be-processed voiceprint parameters of the first initial audio data from the to-be-processed voiceprint map; inputting the voice print parameters to be processed into the deep learning noise reduction model to obtain noise reduced voice data; the method comprises the steps of obtaining first initial audio data to be subjected to noise reduction, inputting the first initial audio data to be subjected to noise reduction into a deep learning noise reduction model, matching the second voice parameters with the first initial audio data corresponding to the second voice parameters, and carrying out noise reduction on the first initial audio data corresponding to the second voice parameters by utilizing the first voice parameters to obtain noise-reduced voice data.

7. The apparatus of claim 6, wherein the discrimination computation module is further configured to:

acquiring the corresponding characteristics of the audio data to be processed and the association relation among the characteristics; generating a feature combination corresponding to the audio data to be processed according to the features and the association relation among the features; and respectively calculating the distinguishing degree of each feature combination according to the feature corresponding to each feature combination and the association relation between the features.

8. The apparatus of claim 6 or 7, wherein the initial feature combination acquisition module is further configured to:

comparing the distinguishing degree of each characteristic combination with the distinguishing degree threshold value respectively; and acquiring a feature combination corresponding to the degree of distinction exceeding the threshold value of the degree of distinction, and generating an initial feature combination.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.