CN113921042A

CN113921042A - Speech desensitization method, device, electronic device and storage medium

Info

Publication number: CN113921042A
Application number: CN202111144335.7A
Authority: CN
Inventors: 曹鹏; 吴华鑫; 吴江照; 潘嘉
Original assignee: Hefei Intelligent Voice Innovation Development Co ltd
Current assignee: Hefei Intelligent Voice Innovation Development Co ltd
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2022-01-11
Anticipated expiration: 2041-09-28
Also published as: CN113921042B

Abstract

The present invention provides a voice desensitization method, device, electronic equipment and storage medium, wherein the method includes: determining the voice data to be desensitized; inputting the amplitude spectrum of each voice frame in the voice data into a sensitive voice detection model to obtain a sensitive voice detection model. The speech mask of each speech frame output by the speech detection model; the sensitive speech detection model is based on the sample general speech and the general mask of each sample speech frame, and the sample sensitive word speech and the sensitivity mask of each sample speech frame Training is obtained; based on the speech mask of each speech frame, the sensitive information in the speech data is eliminated. The method, device, electronic device and storage medium provided by the present invention can output a voice mask based on the amplitude spectrum of the input voice frame by using the sensitive voice detection model to locate the sensitive word, desensitize the sensitive word, and overcome the sensitive information It is easy to leak or over-eliminate the voice and the problems of low efficiency and recognition rate, and realize the real-time and accurate desensitization of the voice.

Description

Speech desensitization method, device, electronic device and storage medium

技术领域technical field

本发明涉及人工智能技术领域，尤其涉及一种语音脱敏方法、装置、电子设备及存储介质。The present invention relates to the technical field of artificial intelligence, and in particular, to a voice desensitization method, device, electronic device and storage medium.

背景技术Background technique

如今语音的实时通讯的应用场景越来越多，譬如电话通讯，语音通讯工具通讯以及在线会议通讯等，而在语音通讯的过程中，往往包含大量的敏感信息，例如个人证件号码、姓名、地址、价格、注册信息等。因此，需要对这些敏感信息进行屏蔽以保护语音信息的安全，而现有的语音敏感信息的屏蔽方法主要是通过语音转文字后检测敏感词，将敏感词进行脱敏替换，将替换后的内容转成语音输出至接收端。但该方法存在无法应对实时场景、敏感信息容易泄露或者过度消除语音以及识别率低的问题。Nowadays, there are more and more application scenarios of real-time voice communication, such as telephone communication, voice communication tool communication and online conference communication. In the process of voice communication, a large amount of sensitive information is often included, such as personal ID number, name, address, etc. , price, registration information, etc. Therefore, it is necessary to shield these sensitive information to protect the security of voice information, and the existing method of shielding sensitive voice information is mainly to detect sensitive words after converting speech into text, desensitize and replace the sensitive words, and replace the replaced content. Converted to voice output to the receiver. However, this method cannot cope with real-time scenarios, sensitive information is easily leaked or excessively eliminates speech, and the recognition rate is low.

发明内容SUMMARY OF THE INVENTION

本发明提供一种语音脱敏方法、装置、电子设备及存储介质，用以解决现有技术中无法应对实时场景、敏感信息容易泄露或者过度消除语音以及识别率低的缺陷。The present invention provides a voice desensitization method, device, electronic device and storage medium, which are used to solve the defects in the prior art that the real-time scene cannot be dealt with, the sensitive information is easily leaked or the voice is excessively eliminated, and the recognition rate is low.

本发明提供一种语音脱敏方法，包括：The present invention provides a speech desensitization method, comprising:

确定待脱敏的语音数据；Determine the voice data to be desensitized;

将所述语音数据中每一语音帧的幅度谱输入至敏感语音检测模型，得到所述敏感语音检测模型输出的每一语音帧的语音掩码；所述敏感语音检测模型基于样本通用语音以及其中每一样本语音帧的通用掩码，和样本敏感词语音以及其中每一样本语音帧的敏感掩码训练得到；The amplitude spectrum of each speech frame in the speech data is input to the sensitive speech detection model, and the speech mask of each speech frame output by the sensitive speech detection model is obtained; the sensitive speech detection model is based on the sample general speech and wherein The general mask of each sample speech frame, and the sample sensitive word speech and the sensitive mask of each sample speech frame are obtained by training;

基于所述每一语音帧的语音掩码，消除所述语音数据中的敏感信息。Based on the speech mask for each speech frame, sensitive information in the speech data is eliminated.

根据本发明提供的一种语音脱敏方法，所述将所述语音数据中每一语音帧的幅度谱输入至敏感语音检测模型，得到所述敏感语音检测模型输出的每一语音帧的语音掩码，包括：According to a speech desensitization method provided by the present invention, the amplitude spectrum of each speech frame in the speech data is input into the sensitive speech detection model, and the speech mask of each speech frame output by the sensitive speech detection model is obtained. code, including:

将所述语音数据中各语音帧的幅度谱逐帧输入至敏感语音检测模型，得到所述敏感语音检测模型逐帧输出的各语音帧的语音掩码；The amplitude spectrum of each speech frame in the described speech data is input to the sensitive speech detection model frame by frame, and the speech mask of each speech frame output by the described sensitive speech detection model frame by frame is obtained;

其中，同一时刻的输入语音帧和输出语音帧相差预设帧数，所述输入语音帧为输入所述敏感语音检测模型的幅度谱对应的语音帧，所述输出语音帧为从所述敏感语音检测模型中输出的语音掩码对应的语音帧。Wherein, the input voice frame and the output voice frame at the same time differ by a preset number of frames, the input voice frame is the voice frame corresponding to the amplitude spectrum of the input sensitive voice detection model, and the output voice frame is from the sensitive voice Detect the speech frame corresponding to the speech mask output in the model.

根据本发明提供的一种语音脱敏方法，所述将所述语音数据中各语音帧的幅度谱逐帧输入至敏感语音检测模型，得到所述敏感语音检测模型逐帧输出的各语音帧的语音掩码，包括：According to a speech desensitization method provided by the present invention, the amplitude spectrum of each speech frame in the speech data is input into the sensitive speech detection model frame by frame, and the frame-by-frame output of each speech frame by the sensitive speech detection model is obtained. Voice masks, including:

将所述语音数据中各语音帧的幅度谱逐帧输入至敏感语音检测模型，由所述敏感语音检测模型基于各语音帧的幅度谱、各语音帧之后连续的预设帧数个语音帧的幅度谱，以及各语音帧之前一帧的状态向量，编码各语音帧的状态向量，并基于各语音帧的状态向量进行敏感语音检测，得到所述敏感语音检测模型逐帧输出的各语音帧的语音掩码。The amplitude spectrum of each voice frame in the voice data is input into the sensitive voice detection model frame by frame, and the sensitive voice detection model is based on the amplitude spectrum of each voice frame, the continuous preset frame number of voice frames after each voice frame. Amplitude spectrum, and the state vector of the frame before each speech frame, encode the state vector of each speech frame, and perform sensitive speech detection based on the state vector of each speech frame, and obtain the frame-by-frame output of the sensitive speech detection model. Voice mask.

根据本发明提供的一种语音脱敏方法，所述基于所述每一语音帧的语音掩码，消除所述语音数据中的敏感信息，包括：According to a voice desensitization method provided by the present invention, the method for eliminating sensitive information in the voice data based on the voice mask of each voice frame includes:

基于所述每一语音帧的语音掩码，对所述语音数据中每一语音帧的幅度谱进行脱敏处理，得到脱敏后的幅度谱数据；Based on the voice mask of each voice frame, desensitization processing is performed on the amplitude spectrum of each voice frame in the voice data to obtain desensitized amplitude spectrum data;

对所述脱敏后的幅度谱数据进行逆变换，得到脱敏后的语音数据。Perform inverse transformation on the desensitized amplitude spectrum data to obtain desensitized speech data.

根据本发明提供的一种语音脱敏方法，所述基于所述每一语音帧的语音掩码，对所述语音数据中每一语音帧的幅度谱进行脱敏处理，得到脱敏后的幅度谱数据，包括：According to a speech desensitization method provided by the present invention, the desensitization process is performed on the amplitude spectrum of each speech frame in the speech data based on the speech mask of each speech frame, and the desensitized amplitude is obtained. Spectral data, including:

基于所述每一语音帧的语音掩码，从所述语音数据中定位出敏感词语音段，并确定各敏感词语音段的脱敏方式；Based on the speech mask of each speech frame, locate the speech segment of the sensitive word from the speech data, and determine the desensitization mode of each speech segment of the sensitive word;

若所述脱敏方式为信息脱敏，则对所述敏感词语音段后指定帧数的语音帧的幅度谱，或对所述敏感词语音段中各语音帧的幅度谱以及所述敏感词语音段后指定帧数的语音帧的幅度谱进行脱敏处理；If the desensitization method is information desensitization, the amplitude spectrum of the speech frame with the specified number of frames after the sensitive word speech segment, or the amplitude spectrum of each speech frame in the sensitive word speech segment and the sensitive word The amplitude spectrum of the speech frame with the specified number of frames after the speech segment is desensitized;

若所述脱敏方式为敏感词脱敏，则对所述敏感词语音段中各语音帧的幅度谱进行脱敏处理。If the desensitization method is sensitive word desensitization, the amplitude spectrum of each speech frame in the speech segment of the sensitive word is desensitized.

根据本发明提供的一种语音脱敏方法，所述基于所述每一语音帧的语音掩码，从所述语音数据中定位出敏感词语音段，包括：According to a voice desensitization method provided by the present invention, the voice segment of a sensitive word is located from the voice data based on the voice mask of each voice frame, including:

确定所述语音数据中，语音掩码小于预设语音掩码阈值的语音帧作为敏感词语音帧；Determine that in the voice data, the voice frame whose voice mask is less than the preset voice mask threshold is used as a sensitive word voice frame;

将帧数大于预设帧数阈值的连续多个敏感词语音帧作为一段敏感词语音段。A plurality of consecutive sensitive word speech frames whose frame number is greater than the preset frame number threshold are regarded as a sensitive word speech segment.

根据本发明提供的一种语音脱敏方法，所述确定各敏感词语音段的脱敏方式，具体步骤包括：According to a speech desensitization method provided by the present invention, the specific steps for determining the desensitization mode of each sensitive word speech segment include:

从敏感词语音段的尾部向前截取预设截取帧数个语音帧，作为待分类语音段；From the end of the speech segment of the sensitive word, intercept the preset intercepted frames and several speech frames forward as the speech segment to be classified;

将所述待分类语音段输入到语音分类模型，得到所述语音分类模型输出的所述敏感词语音段的脱敏方式；Inputting the speech segment to be classified into a speech classification model to obtain a desensitization mode of the sensitive word speech segment output by the speech classification model;

所述语音分类模型基于样本敏感词语音段及其脱敏方式标签训练得到。The speech classification model is obtained by training based on the speech segment of the sample sensitive word and its desensitization method label.

根据本发明提供的一种语音脱敏方法，所述样本敏感词语音基于样本噪声，对原始敏感词语音进行加噪得到，所述敏感掩码基于所述样本噪声的幅度谱和所述原始敏感词语音的幅度谱确定。According to a speech desensitization method provided by the present invention, the sample sensitive word speech is obtained by adding noise to the original sensitive word speech based on sample noise, and the sensitivity mask is based on the amplitude spectrum of the sample noise and the original sensitive word. The magnitude spectrum of the word speech is determined.

本发明还提供一种语音脱敏装置，包括：The present invention also provides a voice desensitization device, comprising:

确定模块，用于确定待脱敏的语音数据；A determination module for determining the voice data to be desensitized;

预测模块，用于将所述语音数据中每一语音帧的幅度谱输入至敏感语音检测模型，得到所述敏感语音检测模型输出的每一语音帧的语音掩码；所述敏感语音检测模型基于样本通用语音以及其中每一样本语音帧的通用掩码，和样本敏感词语音以及其中每一样本语音帧的敏感掩码训练得到；The prediction module is used for inputting the amplitude spectrum of each speech frame in the speech data into the sensitive speech detection model to obtain the speech mask of each speech frame output by the sensitive speech detection model; the sensitive speech detection model is based on The sample general speech and the general mask of each sample speech frame, and the sample sensitive word speech and the sensitivity mask of each sample speech frame are trained;

消除模块，用于基于所述每一语音帧的语音掩码，消除所述语音数据中的敏感信息。An elimination module, configured to eliminate sensitive information in the speech data based on the speech mask of each speech frame.

本发明还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述任一种所述语音脱敏方法的步骤。The present invention also provides an electronic device, comprising a memory, a processor and a computer program stored in the memory and running on the processor, when the processor executes the program, the above-mentioned speech desensitization is implemented steps of the method.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述语音脱敏方法的步骤。The present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of any of the above-mentioned speech desensitization methods.

本发明还提供一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时实现如上述任一种所述语音脱敏方法的步骤。The present invention also provides a computer program product, comprising a computer program, which, when executed by a processor, implements the steps of any one of the above-mentioned speech desensitization methods.

本发明提供的一种语音脱敏方法、装置、电子设备及存储介质，通过将待脱敏的语音数据中每一帧的幅度谱输入到敏感语音检测模型，得到每一语音帧的语音掩码，再根据每一语音帧的语音掩码消除该语音数据中的敏感信息，实现了使用敏感语音检测模型基于输入的语音帧的幅度谱输出语音掩码以定位敏感词，并将该敏感词脱敏，减少了语音和文字互转的过程，提高了识别效率，克服敏感信息易泄露或过度消除语音以及效率和识别率低的问题，实现了语音实时精准的脱敏。The present invention provides a speech desensitization method, device, electronic device and storage medium, by inputting the amplitude spectrum of each frame in the speech data to be desensitized into a sensitive speech detection model, to obtain the speech mask of each speech frame Then, according to the voice mask of each voice frame, the sensitive information in the voice data is eliminated, and the sensitive voice detection model is used to output the voice mask based on the amplitude spectrum of the input voice frame to locate the sensitive word, and remove the sensitive word. Sensitivity reduces the process of inter-conversion between speech and text, improves recognition efficiency, overcomes the problems of sensitive information being easily leaked or over-eliminated speech, and low efficiency and recognition rate, and realizes real-time and accurate desensitization of speech.

附图说明Description of drawings

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are the For some embodiments of the invention, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1是本发明提供的用于语音脱敏方法的流程示意图之一；Fig. 1 is one of the schematic flow charts provided by the present invention for speech desensitization method;

图2是本发明提供的获取脱敏后的语音数据的流程示意图；Fig. 2 is the schematic flowchart of obtaining the voice data after desensitization provided by the present invention;

图3是本发明提供的确定敏感语音段方法的流程示意图；3 is a schematic flowchart of a method for determining a sensitive speech segment provided by the present invention;

图4是本发明提供的确定敏感词语音段脱敏方式的流程示意图；Fig. 4 is the schematic flow chart of determining the voice segment desensitization mode of sensitive words provided by the present invention;

图5是本发明提供的用于语音脱敏方法的流程示意图之二；Fig. 5 is the second schematic flow chart of the method for speech desensitization provided by the present invention;

图6是本发明提供的敏感语音检测模型的训练方法；Fig. 6 is the training method of the sensitive speech detection model provided by the present invention;

图7是本发明提供的敏感语音检测模型的处理流程示意图；Fig. 7 is the processing flow schematic diagram of the sensitive speech detection model provided by the present invention;

图8是本发明提供的语音脱敏装置的结构示意图；8 is a schematic structural diagram of a voice desensitization device provided by the present invention;

图9是本发明提供的电子设备的结构示意图。FIG. 9 is a schematic structural diagram of an electronic device provided by the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the objectives, technical solutions and advantages of the present invention clearer, the technical solutions in the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present invention. , not all examples. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

现有的语音脱敏方法中主要是通过先将语音数据转换成文字，将转换后的文字与预先设置的敏感词库中的敏感词进行比对，若检测到该文字中有包含有敏感词，则将敏感词部分的语音段落替换成其它语音段落完成语音的脱敏操作。The existing speech desensitization method mainly converts the speech data into text, and compares the converted text with the sensitive words in the preset sensitive word database. If it is detected that the text contains sensitive words. , then replace the speech paragraph of the sensitive word part with other speech paragraphs to complete the speech desensitization operation.

但是现有的方法中语音转文字的方式效率较低，无法满足实时语音的脱敏需求，后续步骤对语音转文字的依赖性很大，一旦转换文字出现问题将导致后续脱敏步骤全部失效，而且并非针对敏感词进行的语音转文字的操作，会导致敏感词识别率低的问题；并且当检测到敏感词后，将敏感词部分的语音段落替换成其它语音段落，需要将替换的语音段在原语音段上进行对齐，但通过文字在语音数据上进行对齐，容易出现错误，导致敏感信息出现泄露，而为了减少敏感信息泄露的几率，会进行过度消除敏感信息附近的语音段，影响实际体验。However, the voice-to-text method in the existing method is inefficient and cannot meet the desensitization requirements of real-time voice. The subsequent steps are highly dependent on the voice-to-text conversion. Once there is a problem in the conversion of text, the subsequent desensitization steps will all fail. Moreover, the voice-to-text operation is not carried out for sensitive words, which will lead to the problem of low recognition rate of sensitive words; and when sensitive words are detected, the voice paragraphs of the sensitive words are replaced with other voice paragraphs, and the replaced voice paragraphs need to be replaced. Aligning on the original voice segment, but aligning the voice data through text, is prone to errors and leads to the leakage of sensitive information. In order to reduce the chance of sensitive information leakage, the voice segments near the sensitive information will be excessively eliminated, affecting the actual experience. .

因此，如何实现实时准确的语音脱敏是本领域亟待解决的问题。Therefore, how to achieve real-time and accurate speech desensitization is an urgent problem to be solved in the art.

图1是本发明提供的用于语音脱敏方法的流程示意图之一，如图1所示，本发明实施例提供了一种语音脱敏方法，包括：Fig. 1 is one of the schematic flow charts of the method for speech desensitization provided by the present invention. As shown in Fig. 1, an embodiment of the present invention provides a method for speech desensitization, including:

步骤110，确定待脱敏的语音数据；Step 110, determine the voice data to be desensitized;

具体地，待脱敏的语音数据可以是实时语音数据，例如：语音通话，在线会议等，也可以是非实时语音数据，譬如：录音、对讲类语音留言等，本发明实施例对此不作限制。其中脱敏为消除语音数据中的敏感信息，而敏感信息可以是需要进行消除的信息，例如，不文明用语或者非法用语等，还可以是定位信息，用于定位并消除该敏感信息后面的指定信息，例如，密码、住址或者身份信息等，本发明实施例对此不作限制。Specifically, the voice data to be desensitized may be real-time voice data, such as voice calls, online conferences, etc., or non-real-time voice data, such as recording, intercom-like voice messages, etc., which is not limited in this embodiment of the present invention . Among them, desensitization is to eliminate sensitive information in voice data, and sensitive information can be information that needs to be eliminated, such as uncivilized language or illegal language, etc., or positioning information, which is used to locate and eliminate the designation behind the sensitive information. Information, such as password, address, or identity information, etc., is not limited in this embodiment of the present invention.

步骤120，将该语音数据中每一语音帧的幅度谱输入至敏感语音检测模型，得到该敏感语音检测模型输出的每一语音帧的语音掩码；该敏感语音检测模型基于样本通用语音以及其中每一样本语音帧的通用掩码，和样本敏感词语音以及其中每一样本语音帧的敏感掩码训练得到；Step 120, input the amplitude spectrum of each speech frame in this speech data to the sensitive speech detection model, obtain the speech mask of each speech frame output by this sensitive speech detection model; This sensitive speech detection model is based on the sample general speech and wherein The general mask of each sample speech frame, and the sample sensitive word speech and the sensitive mask of each sample speech frame are obtained by training;

考虑到语音转文字时会因为环境音，说话人的地方口音或者语义导致的转换错误，而将时域的语音数据变换至频域进行处理，再将处理完成的频域信号恢复到时域，其变换的是语音数据的表现方式，不容易出现转换错误的情况，本发明实施例中不再对语音数据做转文字处理，而是将语音数据变换到频域进行处理。进一步地，幅度谱作为语音信号的频域表现形式之一，能作为语音特征提取的手段，因此，本发明实施例通过语音数据的幅度谱进行敏感信息的检测定位。Considering the conversion errors caused by ambient sounds, the speaker's local accent or semantics when converting speech to text, the speech data in the time domain is converted to the frequency domain for processing, and then the processed frequency domain signal is restored to the time domain. What it transforms is the representation of voice data, and conversion errors are not prone to occur. In the embodiment of the present invention, the voice data is not converted to text, but the voice data is transformed into the frequency domain for processing. Further, the amplitude spectrum, as one of the frequency-domain representation forms of the speech signal, can be used as a means for extracting speech features. Therefore, the embodiment of the present invention uses the amplitude spectrum of the speech data to detect and locate sensitive information.

具体地，本发明实施例，可以将语音数据中每一语音帧的幅度谱输入至预先训练好的敏感语音检测模型，由敏感语音检测模型对语音数据中的每一语音帧的幅度谱进行检测，得到该语音帧对应的语音掩码，以供步骤130使用。需要说明的是，语音数据中的每一语音帧的幅度谱是将语音数据进行滑窗分帧，再将得到的每一帧语音帧进行时频变换得到的；敏感语音检测模型输出的语音掩码是该语音帧是否为包含敏感信息的语音帧的判断标签，例如，可以以“0”、“1”的形式反映该语音帧是否为包含敏感信息的语音帧，也可以以概率的形式反映该语音帧中包含敏感信息的可能性，本发明实施例对比不作具体限定。Specifically, in this embodiment of the present invention, the amplitude spectrum of each voice frame in the voice data can be input into a pre-trained sensitive voice detection model, and the sensitive voice detection model can detect the amplitude spectrum of each voice frame in the voice data , to obtain the speech mask corresponding to the speech frame for use in step 130 . It should be noted that the amplitude spectrum of each voice frame in the voice data is obtained by dividing the voice data into frames by sliding window, and then performing time-frequency transformation on each frame of the obtained voice frame; the voice mask output by the sensitive voice detection model is obtained. The code is a judgment label for whether the voice frame is a voice frame containing sensitive information. For example, whether the voice frame is a voice frame containing sensitive information can be reflected in the form of "0" and "1", or it can be reflected in the form of probability. The possibility that the speech frame contains sensitive information is not specifically limited for comparison in the embodiments of the present invention.

在执行步骤120之前，还需要预先训练得到敏感语音检测模型，具体在敏感语音检测模型训练时，需要以样本通用语音和样本敏感词语音作为样本，以样本通用语音中每一样本语音帧的通用掩码，和样本敏感词语音中每一样本语音帧的敏感掩码作为标签。其中，样本通用语音和样本敏感词语音均为样本语音，区别在于样本通用语音中不包含敏感词，即样本通用语音是不需要进行脱敏处理的语音，样本通用语音中每一样本语音帧的掩码均反映对应的样本语音帧不包含敏感信息，故此处记为通用掩码，而样本敏感词语音为仅包含敏感词的语音，即样本敏感词是需要整体消音的语音，样本敏感词语音中每一样语音帧的掩码均反映对应的样本语音帧包含敏感信息，故此处记为敏感掩码。Before performing step 120, a sensitive speech detection model needs to be pre-trained. Specifically, when the sensitive speech detection model is trained, the sample general speech and the sample sensitive word speech need to be used as samples, and the general speech of each sample speech frame in the sample general speech needs to be used as samples. mask, and the sensitivity mask of each sample speech frame in the sample sensitive word speech as the label. Among them, the sample general voice and the sample sensitive word voice are both sample voices. The difference is that the sample general voice does not contain sensitive words, that is, the sample general voice is the voice that does not need desensitization processing. The mask reflects that the corresponding sample speech frame does not contain sensitive information, so it is recorded as a general mask here, and the sample sensitive word speech is the speech that only contains sensitive words, that is, the sample sensitive word is the speech that needs to be silenced as a whole, and the sample sensitive word speech The mask of each speech frame in , reflects that the corresponding sample speech frame contains sensitive information, so it is recorded as a sensitive mask here.

在敏感语音检测模型训练时，可以分别将样本通用语音和样本敏感词语音的幅度谱输入到训练中的模型中，从而得到模型针对样本通用语音和样本敏感词语音中每一样本语音帧的幅度谱输出的预测掩码，在此基础上，将样本通用语音中每一样本语音帧的预测掩码与预先标注的通用掩码进行比较，将样本敏感词语音中每一样本语音帧输出的预测掩码与预先标注的敏感掩码进行比较，从而得到模型训练的损失值，基于损失值对模型参数进行迭代更新，在此过程中，模型可以学习到样本语音帧的幅度谱与掩码标签之间的对应关系，使得训练得到的敏感语音检测模型能够具备基于幅度谱进行敏感信息检测的能力。During the training of the sensitive speech detection model, the amplitude spectrum of the sample general speech and the sample sensitive word speech can be input into the model under training respectively, so as to obtain the amplitude of the model for each sample speech frame in the sample general speech and the sample sensitive word speech. The prediction mask of the spectral output. On this basis, the prediction mask of each sample speech frame in the sample general speech is compared with the pre-labeled general mask, and the prediction of each sample speech frame output in the sample sensitive word speech is compared. The mask is compared with the pre-labeled sensitive mask to obtain the loss value of model training, and the model parameters are iteratively updated based on the loss value. In this process, the model can learn the relationship between the amplitude spectrum of the sample speech frame and the mask label. The corresponding relationship between them enables the trained sensitive speech detection model to have the ability to detect sensitive information based on the amplitude spectrum.

步骤130，基于每一语音帧的语音掩码，消除该语音数据中的敏感信息。Step 130: Eliminate sensitive information in the voice data based on the voice mask of each voice frame.

具体地，根据步骤120输出的每一语音帧的语音掩码，消除该语音数据中的敏感信息。需要说明的是消除该语音数据中的敏感信息可以根据语音掩码对该原始语音帧的幅度谱进行脱敏处理，再将脱敏处理后的幅度谱恢复成为语音数据，还可以根据每个语音帧的语音掩码，定位多个连续敏感信息的语音帧组成语音段，从而对语音段进行整体消除或者对该语音段之后的指定帧数的语音段进行消除，本发明实施例对此不作限制。Specifically, according to the voice mask of each voice frame output in step 120, the sensitive information in the voice data is eliminated. It should be noted that to eliminate the sensitive information in the voice data, the amplitude spectrum of the original voice frame can be desensitized according to the voice mask, and then the desensitized amplitude spectrum can be restored into voice data. The voice mask of the frame, locates a plurality of voice frames of continuous sensitive information to form a voice segment, so that the voice segment is eliminated as a whole or the voice segment with a specified number of frames after the voice segment is eliminated, which is not limited in this embodiment of the present invention. .

本发明实施例提供的一种语音脱敏方法，应用敏感语音检测模型对输入的语音帧的幅度谱进行敏感语音检测，以定位语音数据中的敏感信息，从而实现语音脱敏，此过程中不存在语音和文字互转的过程，避免了语音转写导致的敏感信息易泄露、过度消除语音以及转写效率和识别率低的问题，实现了实时精准的语音脱敏。In a voice desensitization method provided by an embodiment of the present invention, a sensitive voice detection model is used to perform sensitive voice detection on the amplitude spectrum of an input voice frame, so as to locate the sensitive information in the voice data, thereby realizing voice desensitization. There is a process of inter-conversion between speech and text, which avoids the problems of easy leakage of sensitive information caused by speech transcription, excessive elimination of speech, and low transcription efficiency and recognition rate, and realizes real-time and accurate speech desensitization.

基于上述实施例，步骤120包括：Based on the above embodiment, step 120 includes:

将语音数据中各语音帧的幅度谱逐帧输入至敏感语音检测模型，得到敏感语音检测模型逐帧输出的各语音帧的语音掩码；Input the amplitude spectrum of each speech frame in the speech data to the sensitive speech detection model frame by frame, and obtain the speech mask of each speech frame output by the sensitive speech detection model frame by frame;

其中，同一时刻的输入语音帧和输出语音帧相差预设帧数，输入语音帧为输入敏感语音检测模型的幅度谱对应的语音帧，输出语音帧为从敏感语音检测模型中输出的语音掩码对应的语音帧。The input speech frame and the output speech frame at the same time differ by a preset number of frames, the input speech frame is the speech frame corresponding to the amplitude spectrum of the input sensitive speech detection model, and the output speech frame is the speech mask output from the sensitive speech detection model the corresponding speech frame.

考虑到对实时语音进行脱敏处理的需求，基于流式处理的低延时和高吞吐量的特点，将其引入到本发明实施例中。具体地，将语音数据中的每一语音帧的幅度谱逐帧输入到敏感语音检测模型，得到该敏感语音检测模型逐帧输出的各语音帧对应的语音掩码，通过逐帧输入输出的形式，实现针对语音数据的流式敏感语音检测，保证语音脱敏的实时性。Considering the need for desensitization processing of real-time speech, based on the characteristics of low latency and high throughput of streaming processing, it is introduced into the embodiments of the present invention. Specifically, the amplitude spectrum of each speech frame in the speech data is input into the sensitive speech detection model frame by frame, and the speech mask corresponding to each speech frame output by the sensitive speech detection model frame by frame is obtained. , to achieve stream-sensitive voice detection for voice data and ensure the real-time performance of voice desensitization.

同时，考虑到流式敏感语音检测中，单语音帧的幅度谱中信息含量较少，为了提高模型预测的效果，需要多个语音帧的幅度谱中的信息协助进行敏感语音检测，因此，在对当前语音帧进行预测处理时，可以针对模型的输入、输出设置时延，使得模型在对当前语音帧进行敏感语音检测时，在依据当前语音帧的幅度谱的同时，还可以参考到语音帧之后预设帧数个各语音帧的幅度谱的信息。At the same time, considering that in streaming sensitive speech detection, the information content in the amplitude spectrum of a single speech frame is less, in order to improve the effect of model prediction, the information in the amplitude spectrum of multiple speech frames is needed to assist in sensitive speech detection. Therefore, in When predicting the current speech frame, you can set the time delay for the input and output of the model, so that when the model performs sensitive speech detection on the current speech frame, it can also refer to the speech frame according to the amplitude spectrum of the current speech frame. After that, the information of the amplitude spectrum of each speech frame is preset.

具体地，同一时刻的输入语音帧和输出语音帧相差预设帧数。需要说明的是，同一时刻的输入语音帧和输出语音帧相差的预设帧数为设置一个预设帧数时延，当输入语音帧的时候该敏感语音检测模型输出的是该语音帧之前的预设帧数时延的语音帧，例如：预设帧数时延为2帧，当前输入为帧数为第3语音帧的幅度谱，则同时输出的为第1语音帧的语音掩码。Specifically, the input speech frame and the output speech frame at the same moment differ by a preset number of frames. It should be noted that the preset frame number that differs between the input voice frame and the output voice frame at the same time is a preset frame delay. When a voice frame is input, the output of the sensitive voice detection model is the voice frame before the voice frame. A voice frame with a preset frame delay, for example: the preset frame delay is 2 frames, and the current input is the amplitude spectrum of the third voice frame, and the output is the voice mask of the first voice frame at the same time.

本发明实施例提供的方法，通过逐帧输入语音帧并逐帧输出敏感信息检测结果，实现流式语音数据中语音帧的敏感信息检测。The method provided by the embodiment of the present invention realizes the sensitive information detection of the voice frame in the streaming voice data by inputting the voice frame frame by frame and outputting the sensitive information detection result frame by frame.

在流式语音帧的敏感信息检测中，应用预设帧数的时延，使得敏感语音检测模型能够基于当前语音帧后续预设帧数的语音帧的幅度谱，联合当前帧的幅度谱及当前帧前一阵的敏感信息检测结果进行当前帧的敏感信息检测，提高检测的可靠性。In the detection of sensitive information of streaming speech frames, the delay of the preset number of frames is applied, so that the sensitive speech detection model can combine the amplitude spectrum of the current frame with the current The sensitive information detection result of the previous frame is used to detect the sensitive information of the current frame, so as to improve the reliability of detection.

基于上述实施例，步骤120中，将语音数据中各语音帧的幅度谱逐帧输入至敏感语音检测模型，得到敏感语音检测模型逐帧输出的各语音帧的语音掩码，包括：Based on the above embodiment, in step 120, the amplitude spectrum of each speech frame in the speech data is input into the sensitive speech detection model frame by frame, and the speech mask of each speech frame output by the sensitive speech detection model frame by frame is obtained, including:

将语音数据中各语音帧的幅度谱逐帧输入至敏感语音检测模型，由敏感语音检测模型基于各语音帧的幅度谱、各语音帧之后连续的预设帧数个语音帧的幅度谱，以及各语音帧之前一帧的状态向量，编码各语音帧的状态向量，并基于各语音帧的状态向量进行敏感语音检测，得到敏感语音检测模型逐帧输出的各语音帧的语音掩码。inputting the amplitude spectrum of each speech frame in the speech data frame by frame into the sensitive speech detection model, and the sensitive speech detection model is based on the amplitude spectrum of each speech frame, the amplitude spectrum of the continuous preset number of speech frames after each speech frame, and The state vector of the frame before each speech frame, encode the state vector of each speech frame, and perform sensitive speech detection based on the state vector of each speech frame to obtain the speech mask of each speech frame output by the sensitive speech detection model frame by frame.

具体地，考虑到语音本身的时序性，语音帧所表示的信息往往与该语音帧之前和之后的语音帧所表示的信息相关。在基于敏感语音检测模型进行流式敏感语音检测时，敏感语音检测模型可以通过自身模型结构，实现基于模型左视野的敏感语音检测，例如因果卷积、LSTM(Long Short-Term Memory，长短期记忆网络)等模型结构均带有左视野，即在流式处理过程中，模型可以根据输入的任一语音帧自身的幅度谱以及该语音帧之前一帧的状态向量，编码该语音帧的状态向量，然后根据各语音帧的状态向量进行敏感语音检测。Specifically, considering the timing of speech itself, the information represented by the speech frame is often related to the information represented by the speech frames before and after the speech frame. When performing streaming sensitive speech detection based on the sensitive speech detection model, the sensitive speech detection model can realize sensitive speech detection based on the left field of view of the model through its own model structure, such as causal convolution, LSTM (Long Short-Term Memory, long short-term memory Network) and other model structures have a left view, that is, in the process of streaming, the model can encode the state vector of the speech frame according to the amplitude spectrum of any input speech frame and the state vector of the frame before the speech frame. , and then perform sensitive speech detection according to the state vector of each speech frame.

但是不带右视野的流式的敏感语音检测模型直接预测当前帧的语音掩码效果并不理想，因此，本发明实施例将各语音帧之后连续的预设帧数个语音帧的幅度谱作为敏感语音检测模型的右视野辅助敏感语音检测模型预测输出各语音帧的语音掩码。However, the effect of directly predicting the speech mask of the current frame by the streaming sensitive speech detection model without the right view is not ideal. Therefore, in the embodiment of the present invention, the amplitude spectrum of the continuous preset number of speech frames after each speech frame is used as The right view of the sensitive speech detection model assists the sensitive speech detection model to predict and output the speech mask of each speech frame.

具体地，将语音数据中各语音帧的幅度谱逐帧输入至敏感语音检测模型的过程中，针对其中的任一语音帧，该模型可以根据输入的该语音帧之后的连续的预设帧数个语音帧的幅度谱、各语音帧自身的幅度谱以及各语音帧之前一帧的状态向量，编码各语音帧的状态向量，然后根据各语音帧的状态向量进行敏感语音检测，得到该敏感语音检测模型逐帧输出的各语音帧的语音掩码。Specifically, in the process of inputting the amplitude spectrum of each speech frame in the speech data into the sensitive speech detection model frame by frame, for any speech frame, the model can be based on the number of consecutive preset frames after the input speech frame. The amplitude spectrum of each voice frame, the amplitude spectrum of each voice frame itself, and the state vector of the frame before each voice frame, encode the state vector of each voice frame, and then perform sensitive voice detection according to the state vector of each voice frame to obtain the sensitive voice. The speech mask of each speech frame output by the detection model frame by frame.

需要说明的是，状态向量为敏感语音检测模型根据语音帧的幅度谱以及其上下文语音帧的幅度谱进行编码得到；当敏感语音检测模型处理某一语音帧时，需要将该语音帧后续的预设帧数个语音帧的幅度谱(作为模型的右视野)、该语音帧的幅度谱，以及该帧前一帧的幅度谱的状态向量(作为模型的左视野)进行编码，得到该语音帧的状态向量，然后根据该语音帧的状态向量进行敏感语音检测，得到该帧的语音掩码，基于这样的处理流程，逐帧输出的各语音帧的语音掩码，在保证实时性的同时，也保证了可靠性和准确性。It should be noted that the state vector is encoded by the sensitive voice detection model according to the amplitude spectrum of the voice frame and the amplitude spectrum of the contextual voice frame; when the sensitive voice detection model processes a voice frame, the subsequent prediction of the voice frame is required. Set the amplitude spectrum of several speech frames (as the right field of view of the model), the amplitude spectrum of the speech frame, and the state vector of the amplitude spectrum of the frame before the frame (as the left field of view of the model) for encoding to obtain the speech frame Then, according to the state vector of the voice frame, sensitive voice detection is performed to obtain the voice mask of the frame. Reliability and accuracy are also guaranteed.

例如，预设帧数为2，当前处理语音帧为第4语音帧，则敏感语音检测模型会将当前处理语音帧的后两个语音帧作为模型右视野，即第5语音帧和第6语音帧，然后将当前处理语音帧的前一语音帧即第3语音帧的状态向量、当前处理语音帧即第4语音帧的幅度谱以及第5语音帧和第6语音帧的幅度谱进行编码，得到第4语音帧对应的状态向量，由此得到第4语音帧的语音掩码，此处，该模型输出第4语音帧的语音掩码是在第6语音帧的幅度谱输入模型之后，即存在2帧时延。For example, if the preset number of frames is 2, and the currently processed voice frame is the fourth voice frame, the sensitive voice detection model will use the last two voice frames of the currently processed voice frame as the right field of view of the model, that is, the fifth voice frame and the sixth voice frame. frame, then encode the state vector of the previous speech frame of the currently processed speech frame, that is, the state vector of the 3rd speech frame, the amplitude spectrum of the currently processed speech frame, that is, the 4th speech frame, and the amplitude spectrum of the 5th speech frame and the 6th speech frame, The state vector corresponding to the 4th speech frame is obtained, thereby obtaining the speech mask of the 4th speech frame. Here, the model outputs the speech mask of the 4th speech frame after the amplitude spectrum input model of the 6th speech frame, that is, There is a 2 frame delay.

基于上述实施例，敏感语音检测模型可以采用CNN(Convolutional NeuralNetwork，卷积神经网络)+RNN(Recurrent Neural Network，循环神经网络)结构，为了兼顾模型效果与流式处理的实时性，对敏感语音检测模型的CNN部分使用因果卷积，即不带模型右视野的卷积层；RNN部分使用LSTM结构，同样不需要模型右视野，由此整个模型可以保持进一帧出一帧的流式处理效果。Based on the above embodiment, the sensitive speech detection model may adopt a CNN (Convolutional Neural Network, convolutional neural network) + RNN (Recurrent Neural Network, recurrent neural network) structure. The CNN part of the model uses causal convolution, that is, the convolution layer without the right view of the model; the RNN part uses the LSTM structure, which also does not require the right view of the model, so that the entire model can maintain the streaming effect of one frame and one frame. .

基于上述实施例，图2是本发明提供的获取脱敏后的语音数据的流程示意图，如图2所示，步骤130包括：Based on the above-mentioned embodiment, FIG. 2 is a schematic flowchart of obtaining desensitized voice data provided by the present invention. As shown in FIG. 2 , step 130 includes:

步骤131，基于每一语音帧的语音掩码，对语音数据中每一语音帧的幅度谱进行脱敏处理，得到脱敏后的幅度谱数据；Step 131, based on the voice mask of each voice frame, desensitize the amplitude spectrum of each voice frame in the voice data to obtain desensitized amplitude spectrum data;

具体地，基于每一语音帧的语音掩码，对语音数据中每一语音帧的幅度谱进行脱敏处理，得到脱敏后的幅度谱数据。需要说明的是，基于每一语音帧的语音掩码，对语音数据中每一语音帧的幅度谱进行脱敏处理，其中脱敏处理可以为根据每一语音帧的语音掩码将原语音帧的幅度谱进行脱敏，例如：输出的语音掩码为0或者接近0时表示为敏感语音帧，则该语音掩码乘以该帧原幅度谱则可以得到脱敏的语音帧的幅度谱。Specifically, based on the voice mask of each voice frame, desensitization processing is performed on the amplitude spectrum of each voice frame in the voice data to obtain desensitized amplitude spectrum data. It should be noted that, based on the voice mask of each voice frame, desensitization processing is performed on the amplitude spectrum of each voice frame in the voice data, wherein the desensitization processing may be to desensitize the original voice frame according to the voice mask of each voice frame. For example, when the output voice mask is 0 or close to 0, it is expressed as a sensitive voice frame, then the voice mask is multiplied by the original amplitude spectrum of the frame to obtain the amplitude spectrum of the desensitized voice frame.

但考虑到敏感语音检测模型输出会出现一些误判的情况，基于单帧语音掩码进行脱敏处理会出现误处理，会消除了通用语音，导致用户体验不佳，因此，具体的脱敏处理还可以是将检测结果为敏感信息的语音掩码对应的连续多帧组合成语音段，然后将该语音段的语音掩码乘以该语音段的原始幅度谱，得到脱敏后的幅度谱数据。However, considering that there will be some misjudgments in the output of the sensitive voice detection model, the desensitization processing based on the single-frame voice mask will cause misprocessing, which will eliminate the general voice and lead to poor user experience. Therefore, the specific desensitization processing It is also possible to combine the continuous multiple frames corresponding to the voice mask whose detection result is sensitive information into a voice segment, and then multiply the voice mask of the voice segment by the original amplitude spectrum of the voice segment to obtain desensitized amplitude spectrum data. .

步骤132，对脱敏后的幅度谱数据进行逆变换，得到脱敏后的语音数据。Step 132: Perform inverse transformation on the desensitized amplitude spectrum data to obtain desensitized speech data.

具体地，对脱敏后的幅度谱数据进行逆变换，得到脱敏后的语音数据。需要说明的是，将脱敏后的幅度谱数据通过短时傅里叶逆变换转换成语音数据，即得到脱敏后的语音数据。Specifically, inverse transform is performed on the desensitized amplitude spectrum data to obtain desensitized speech data. It should be noted that the desensitized amplitude spectrum data is converted into speech data through inverse short-time Fourier transform, that is, desensitized speech data is obtained.

基于上述实施例，步骤131包括：Based on the above embodiment, step 131 includes:

基于每一语音帧的语音掩码，从语音数据中定位出敏感词语音段，并确定各敏感词语音段的脱敏方式；Based on the speech mask of each speech frame, the speech segment of the sensitive word is located from the speech data, and the desensitization method of each speech segment of the sensitive word is determined;

若脱敏方式为信息脱敏，则对敏感词语音段后指定帧数的语音帧的幅度谱，或对敏感词语音段中各语音帧的幅度谱以及敏感词语音段后指定帧数的语音帧的幅度谱进行脱敏处理；If the desensitization method is information desensitization, the amplitude spectrum of the speech frame with the specified number of frames after the speech segment of the sensitive word, or the amplitude spectrum of each speech frame in the speech segment of the sensitive word and the speech of the specified number of frames after the speech segment of the sensitive word The amplitude spectrum of the frame is desensitized;

若脱敏方式为敏感词脱敏，则对敏感词语音段中各语音帧的幅度谱进行脱敏处理。If the desensitization method is sensitive word desensitization, the amplitude spectrum of each speech frame in the speech segment of the sensitive word is desensitized.

考虑到脱敏的方式可以包括对敏感词直接进行脱敏处理，例如：不文明用语的脱敏，或者是基于敏感词进行定位，对该词后面的内容进行脱敏处理，譬如：对语音中账号、密码等词进行定位，对该词后续部分内容进行脱敏。基于上述内容，本发明实施例对敏感词进行定位，并对该敏感词的脱敏方法进行确定。Considering that the method of desensitization may include desensitization of sensitive words directly, such as desensitization of uncivilized words, or based on the location of sensitive words, desensitization of the content behind the word, such as desensitization of speech The words such as account number and password are located, and the subsequent part of the word is desensitized. Based on the above content, the embodiment of the present invention locates a sensitive word, and determines a desensitization method for the sensitive word.

具体地，根据敏感信息监测模型输出的每一语音帧的语音掩码判断该语音帧是否为敏感语音帧并基于定位敏感词语音段的条件得到敏感词语音段，并根据该敏感词语音段的幅度谱查找到其对应的脱敏方式，其中脱敏方式包括信息脱敏以及敏感词脱敏。若脱敏方式为信息脱敏，则对敏感词语音段后指定帧数的语音帧的幅度谱，或对敏感词语音段中各语音帧的幅度谱以及敏感词语音段后指定帧数的语音帧的幅度谱进行脱敏处理；若脱敏方式为敏感词脱敏，则对敏感词语音段中各语音帧的幅度谱进行脱敏处理。需要说明的是，定位敏感词语音段的条件可以是根据多个语音帧中敏感语音帧的占比达到预设比例则确定该语音段为敏感语音段，还可以是连续的敏感语音帧达到预设帧数则确定该语音段为敏感语音段，本发明实施例对此不作限制；确定该语音段的脱敏方式可以根据预先设置的语音段幅度谱与脱敏方法的对应关系进行脱敏方式的确定，还可以将该语音段幅度谱输入到脱敏方式分类模型中，由该分类模型输出脱敏方式，本发明实施例对此不作限制。Specifically, according to the voice mask of each voice frame output by the sensitive information monitoring model, determine whether the voice frame is a sensitive voice frame and obtain the voice segment of the sensitive word based on the condition of locating the voice segment of the sensitive word, and according to the voice segment of the sensitive word The amplitude spectrum finds its corresponding desensitization method, wherein the desensitization methods include information desensitization and sensitive word desensitization. If the desensitization method is information desensitization, the amplitude spectrum of the speech frame with the specified number of frames after the speech segment of the sensitive word, or the amplitude spectrum of each speech frame in the speech segment of the sensitive word and the speech of the specified number of frames after the speech segment of the sensitive word The amplitude spectrum of the frame is desensitized; if the desensitization method is sensitive word desensitization, the amplitude spectrum of each speech frame in the speech segment of the sensitive word is desensitized. It should be noted that the condition for locating the speech segment of the sensitive word may be that the speech segment is determined to be a sensitive speech segment according to the proportion of the sensitive speech frames in the multiple speech frames reaching a preset ratio, or that the continuous sensitive speech frame reaches the predetermined ratio. If the number of frames is set, it is determined that the speech segment is a sensitive speech segment, which is not limited in this embodiment of the present invention; the desensitization method for determining the speech segment can be desensitized according to the preset correspondence between the amplitude spectrum of the speech segment and the desensitization method. If determined, the amplitude spectrum of the speech segment may also be input into the desensitization mode classification model, and the desensitization mode is output from the classification model, which is not limited in this embodiment of the present invention.

基于上述实施例，图3是本发明提供的确定敏感语音段方法的流程示意图。如图3所示，步骤131中，基于每一语音帧的语音掩码，从语音数据中定位出敏感词语音段，包括：Based on the foregoing embodiment, FIG. 3 is a schematic flowchart of a method for determining a sensitive speech segment provided by the present invention. As shown in Figure 3, in step 131, based on the voice mask of each voice frame, the sensitive word voice segment is located from the voice data, including:

步骤310，确定语音数据中，语音掩码小于预设语音掩码阈值的语音帧作为敏感词语音帧；Step 310, determine that in the voice data, the voice frame whose voice mask is less than the preset voice mask threshold is used as the sensitive word voice frame;

考虑到幅度谱的幅度值越小则声音强度越小，因此为了消除敏感语音即需要将敏感语音从语音数据中消除，即需要将敏感语音的幅度谱中的幅度值变为0或接近于0。因此，预设一个语音掩码阈值用来区分通用语音和敏感语音，由于敏感语音是需要消音的语言，敏感语音的语音掩码必然是0或者接近于0，因此可以通过设置预设语音掩码阈值，并将预设语音掩码阈值与语音掩码进行大小比较的方式，从语音数据的各语音帧中筛选出敏感词语音帧。Considering that the smaller the amplitude value of the amplitude spectrum is, the smaller the sound intensity is. Therefore, in order to eliminate the sensitive voice, the sensitive voice needs to be eliminated from the voice data, that is, the amplitude value in the amplitude spectrum of the sensitive voice needs to be changed to 0 or close to 0. . Therefore, a voice mask threshold is preset to distinguish general voice and sensitive voice. Since sensitive voice is a language that needs to be silenced, the voice mask of sensitive voice must be 0 or close to 0, so you can set the preset voice mask by setting the voice mask. The threshold value is selected, and the voice frame of the sensitive word is selected from the voice frames of the voice data by comparing the preset voice mask threshold value with the voice mask.

具体地，确定语音数据中，将小于预设语音掩码阈值的语音帧作为敏感词语音帧。需要说明的是语音掩码的值可以是向量的长度，也可以是向量中的各元素的均值，本发明实施例对此不作限制。Specifically, it is determined that in the speech data, the speech frame smaller than the preset speech mask threshold is used as the speech frame of the sensitive word. It should be noted that the value of the speech mask may be the length of the vector, or may be the average value of each element in the vector, which is not limited in this embodiment of the present invention.

步骤320，将帧数大于预设帧数阈值的连续多个敏感词语音帧作为一段敏感词语音段。In step 320, a plurality of consecutive sensitive word speech frames whose frame number is greater than the preset frame number threshold are used as a sensitive word speech segment.

考虑到单帧的数据信息较少，敏感语音检测模型会出现误触发的情况，并且敏感词会有一定的持续时长，因此，可以将小于某个预设帧数的语音段认为是误触发。Considering that there is little data information in a single frame, the sensitive speech detection model will be falsely triggered, and the sensitive words will have a certain duration. Therefore, the speech segment smaller than a certain preset number of frames can be regarded as false triggering.

具体地，可以将帧数大于预设帧数阈值的连续多个敏感词语音帧作为一段敏感词语音段。需要说明的是在逐帧处理时，在通用语音帧后检测到一帧敏感词语音帧，则将该帧作为起点，在该起点之后检测到通用语音帧则作为终点，起点到终点(不包括终点)的语音段区间中的帧数大于预设帧数阈值则该语音段为敏感词语音段。Specifically, a plurality of consecutive sensitive word speech frames with a frame number greater than a preset frame number threshold may be used as a sensitive word speech segment. It should be noted that during frame-by-frame processing, if a frame of sensitive word speech frame is detected after the general speech frame, the frame is used as the starting point, and the general speech frame detected after the starting point is used as the end point. If the number of frames in the speech segment interval of the end point) is greater than the preset frame number threshold, the speech segment is a sensitive word speech segment.

基于上述实施例，图4是本发明提供的确定敏感词语音段脱敏方式的流程示意图，如图4所示，本发明实施例提供的一种确定敏感词语音段脱敏方式的方法，包括：Based on the above embodiment, FIG. 4 is a schematic flowchart of a method for determining a voice segment desensitization method for a sensitive word provided by the present invention. As shown in FIG. 4 , a method for determining a voice segment desensitization method for a sensitive word provided by an embodiment of the present invention includes: :

步骤410，从敏感词语音段的尾部向前截取预设截取帧数个语音帧，作为待分类语音段；Step 410, from the tail of the sensitive word speech segment forwardly intercepting several speech frames of preset interception frames, as the speech segment to be classified;

考虑到需保证语音脱敏的实时性，本发明实施例将敏感词语音段的尾部向前截取一段语音帧，作为待分类语音段。同时，考虑到为了方便步骤420中语音分类模型的训练，将样本的语音段的帧数进行了固定，此处，对进行预测的语音段的长度也固定成和样本语音段的帧数相同。Considering that the real-time performance of speech desensitization needs to be ensured, in the embodiment of the present invention, a segment of speech frame is cut forward from the tail of the speech segment of the sensitive word, as the speech segment to be classified. Meanwhile, in order to facilitate the training of the speech classification model in step 420, the frame number of the speech segment of the sample is fixed. Here, the length of the speech segment to be predicted is also fixed to be the same as the frame number of the speech segment of the sample.

需要说明的是，敏感词语音段的尾部是敏感词语音段的最后一语音帧；截取语音帧的预设数量可以根据步骤420中语音分类模型的要求进行设置。It should be noted that the end of the speech segment of the sensitive word is the last speech frame of the speech segment of the sensitive word; the preset number of intercepted speech frames can be set according to the requirements of the speech classification model in step 420 .

步骤420，将该待分类语音段输入到语音分类模型，得到该语音分类模型输出的敏感词语音段的脱敏方式；Step 420, input the speech segment to be classified into the speech classification model, and obtain the desensitization mode of the sensitive word speech segment output by the speech classification model;

该语音分类模型基于样本敏感词语音段及其脱敏方式标签训练得到。The speech classification model is trained based on the speech segments of sample sensitive words and their desensitization method labels.

具体地，将步骤410中得到的待分类语音段输入至由样本敏感词语音段及其脱敏方式标签训练得到的语音分类模型，由该模型预测输出该待分类语音段的脱敏方式，供后续部分根据脱敏方式对语音数据进行脱敏处理。Specifically, the to-be-classified speech segment obtained in step 410 is input into a speech classification model trained by the sample-sensitive word speech segment and its desensitization method label, and the model predicts and outputs the desensitization method of the to-be-classified speech segment for The subsequent part desensitizes the voice data according to the desensitization method.

本发明实施例提供的方法，从敏感词语音段的尾部向前截取预设截取帧数个语音帧用于脱敏方式的判断，有助于保证敏感信息监测模型的实时处理效率，不会产生时延。The method provided by the embodiment of the present invention intercepts a preset interception frame and several speech frames forward from the end of the speech segment of the sensitive word to determine the desensitization method, which helps to ensure the real-time processing efficiency of the sensitive information monitoring model, and does not generate time delay.

基于上述实施例，样本敏感词语音基于样本噪声，对原始敏感词语音进行加噪得到，敏感掩码基于样本噪声的幅度谱和原始敏感词语音的幅度谱确定。Based on the above embodiment, the sample sensitive word speech is obtained by adding noise to the original sensitive word speech based on sample noise, and the sensitivity mask is determined based on the amplitude spectrum of the sample noise and the amplitude spectrum of the original sensitive word speech.

考虑到敏感词会出现在各种环境中，为了更准确的对敏感语音模型进行训练，需模拟在噪音环境下的敏感词的出现场景。本发明实施例对原始敏感词语音进行了加噪处理。Considering that sensitive words will appear in various environments, in order to train the sensitive speech model more accurately, it is necessary to simulate the appearance of sensitive words in a noisy environment. The embodiment of the present invention performs noise processing on the original sensitive word speech.

针对于加噪处理所得的样本敏感词语音，由于样本敏感词语音不再是单纯的敏感词语音，相应的敏感掩码也需要基于样本噪声的幅度谱和原始敏感词语音的幅度谱确定。For the sample sensitive word speech obtained by adding noise, since the sample sensitive word speech is no longer pure sensitive word speech, the corresponding sensitive mask also needs to be determined based on the amplitude spectrum of the sample noise and the amplitude spectrum of the original sensitive word speech.

例如，可以通过如下公式确定敏感掩码：For example, the sensitivity mask can be determined by the following formula:

其中，S为敏感词语音幅度谱，N为背景噪声语音幅度谱。模型训练的损失函数为MSE(Mean Square Error，均值平方差)损失函数，用于减小预测出的语音掩码与真实语音掩码之间的误差。Among them, S is the speech amplitude spectrum of sensitive words, and N is the speech amplitude spectrum of background noise. The loss function of the model training is MSE (Mean Square Error, Mean Square Error) loss function, which is used to reduce the error between the predicted speech mask and the real speech mask.

基于上述实施例，图5是本发明提供的用于语音脱敏方法的流程示意图之二，如图5所示，本发明实施例提供的一种语音脱敏方法，包括：Based on the foregoing embodiment, FIG. 5 is the second schematic flowchart of the method for speech desensitization provided by the present invention. As shown in FIG. 5 , a speech desensitization method provided by an embodiment of the present invention includes:

步骤510，对语音进行滑窗分帧，得到语音数据的语音帧集合；Step 510, performing sliding window framing on the voice to obtain a set of voice frames of the voice data;

步骤520，对语音帧集合中每一语音帧进行STFT(Short-time Fouriertransform，短时傅里叶变换)，得到每一语音帧的幅度谱；Step 520, carry out STFT (Short-time Fourier transform, short-time Fourier transform) to each speech frame in the speech frame set, obtain the amplitude spectrum of each speech frame;

步骤530，将语音的每一帧的幅度谱逐帧输入到敏感语音检测模型中，得到每一帧的语音掩码，通用语音的每一帧语音掩码都接近于1，敏感词或者类似敏感词的语音会出现接近于0值的语音掩码，并根据步骤310和步骤320确定敏感词语音段；Step 530: Input the amplitude spectrum of each frame of the voice into the sensitive voice detection model frame by frame, and obtain the voice mask of each frame, the voice mask of each frame of the general voice is close to 1, and the sensitive words or similar sensitive words The speech of the word will have a speech mask with a value close to 0, and determine the speech segment of the sensitive word according to step 310 and step 320;

步骤540，对敏感词语音段进行后处理，可以包括根据敏感词语音段中的语音掩码直接乘以其对应的原始STFT幅度谱，得到消除敏感词、只保留背景声音的STFT幅度谱，再使用ISTFT(Inverse Short-Time Fourier Transformation，短时傅里叶逆变换)逆变换回语音数据，这样能起到自然过滤敏感词的作用；也可以根据检测到的敏感词前后时间端点，直接对敏感词做预设处理，比如替换为消音音频等，同时对敏感词之后可能出现的复杂敏感信息进行进一步处理，比如为防止敏感词过长，影响比对效率，可以设置短敏感词，在敏感词结束后，针对结束帧后的1s时长的音频分帧进行替换处理，将提前准备好的1s时长的静音帧替换掉结束帧之后的原1s音频时长的分帧，从而针对该段时间内的音频实现静音效果，达到敏感词及敏感词后的敏感信息共同实现主动隐藏的目的。Step 540, post-processing the speech segment of the sensitive word, which may include directly multiplying the original STFT amplitude spectrum corresponding to the speech mask in the speech mask of the sensitive word to obtain the STFT amplitude spectrum that eliminates the sensitive word and only retains the background sound, and then Use ISTFT (Inverse Short-Time Fourier Transformation, Inverse Short-Time Fourier Transformation) to inversely transform back the speech data, which can naturally filter sensitive words; Sensitive words are pre-processed, such as replacing with silenced audio, etc., and further processing of complex sensitive information that may appear after the sensitive words. For example, to prevent the sensitive words from being too long and affecting the comparison efficiency, you can set short sensitive words. After the end of the word, the 1s-long audio frame after the end frame is replaced, and the 1s-long mute frame prepared in advance is replaced by the original 1s-long audio frame after the end frame. Audio to achieve mute effect, to achieve the purpose of active hiding of sensitive words and sensitive information after sensitive words.

基于上述实施例，图6是本发明提供的敏感语音检测模型的训练方法，如图6所示，本发明实施例提供一种敏感语音检测模型的训练方法，包括：Based on the above embodiment, FIG. 6 is a training method for a sensitive speech detection model provided by the present invention. As shown in FIG. 6 , an embodiment of the present invention provides a training method for a sensitive speech detection model, including:

步骤610，对原始语音进行滑窗分帧处理，做STFT变换，例如：对16k采样率的语音数据，帧的窗长为1024，窗移为512，做STFT变换取幅度谱，每帧语音转换为513维频域特征向量；Step 610: Perform sliding window and frame processing on the original voice, and perform STFT transformation. For example, for voice data with a sampling rate of 16k, the frame window length is 1024, and the window shift is 512. Perform STFT transformation to obtain the amplitude spectrum, and convert each frame of voice. is a 513-dimensional frequency domain feature vector;

步骤620，将原始语音的每一帧幅度谱及标签逐帧输入到初始敏感语音检测模型中，并输出对应的语音掩码。对于通用语音数据，输出的语音掩码值全为1，该语音掩码乘上原始语音幅度谱，不会改变幅度谱的值，即对通用语音不做修改；对于敏感词语音数据，会对该语音数据随机加噪声处理，用于模拟复杂环境下的敏感词出现场景，模型输入加噪后敏感词语音的幅度谱，输出的语音掩码乘以加噪后幅度谱，得到原始噪音的幅度谱，换言之，该语音掩码只屏蔽掉了敏感词成分，而尽可能保留原始的背景声音。语音掩码作为模型输出训练标签的获得公式如下：Step 620: Input the amplitude spectrum and label of each frame of the original speech into the initial sensitive speech detection model frame by frame, and output the corresponding speech mask. For general speech data, the output speech mask value is all 1, and the speech mask is multiplied by the original speech amplitude spectrum, and the value of the amplitude spectrum will not be changed, that is, the general speech will not be modified; The speech data is randomly processed with noise to simulate the appearance of sensitive words in complex environments. The model inputs the amplitude spectrum of the sensitive word voice after adding noise, and the output voice mask is multiplied by the amplitude spectrum after adding noise to obtain the amplitude of the original noise. In other words, the speech mask only blocks out the sensitive word components, while retaining the original background sound as much as possible. The formula for obtaining the speech mask as the model output training label is as follows:

其中，S为敏感词语音幅度谱，N为背景噪声语音幅度谱。模型训练的损失函数为MSE损失函数，用于减小预测出的语音掩码与真实语音掩码之间的误差。Among them, S is the speech amplitude spectrum of sensitive words, and N is the speech amplitude spectrum of background noise. The loss function trained by the model is the MSE loss function, which is used to reduce the error between the predicted speech mask and the real speech mask.

同时，为了提升敏感语音检测模型的检测效果，对敏感语音检测模型训练的标签做了若干帧时延，相当于在当前帧时刻给出历史帧的分类结果，这样即使模型不带有右视野，对待给出的历史帧结果而言依然相当于有了一部分右视野(即历史帧特征到当前帧特征部分)，对于CNN模型而言双侧视野都存在。图7是本发明提供的敏感语音检测模型的处理流程示意图，如图7所示：At the same time, in order to improve the detection effect of the sensitive speech detection model, a number of frame delays are applied to the labels trained by the sensitive speech detection model, which is equivalent to giving the classification results of the historical frames at the current frame moment, so that even if the model does not have a right view, For the given historical frame results, it is still equivalent to having a part of the right field of view (that is, the historical frame feature to the current frame feature part), and both sides of the field of view exist for the CNN model. Fig. 7 is the processing flow schematic diagram of the sensitive speech detection model provided by the present invention, as shown in Fig. 7:

假设时延为2帧，也就是64ms的时延，则在输入第2语音帧的时候，敏感语音检测模型输出第0语音帧，然后经过MSE损失函数进行训练收敛，完成全部帧的训练后进行延时对齐标签序号。Assuming that the delay is 2 frames, that is, the delay of 64ms, when the second speech frame is input, the sensitive speech detection model outputs the 0th speech frame, and then passes through the MSE loss function to train and converge, and complete the training of all frames. Delayed alignment label sequence number.

下面对本发明提供的语音脱敏装置进行描述，下文描述的语音脱敏装置与上文描述的语音脱敏方法可相互对应参照。The speech desensitization device provided by the present invention is described below, and the speech desensitization device described below and the speech desensitization method described above can be referred to each other correspondingly.

图8是本发明提供的语音脱敏装置的结构示意图，如图8所示，该装置包括：确定模块810，预测模块820，消除模块830。FIG. 8 is a schematic structural diagram of a speech desensitization device provided by the present invention. As shown in FIG. 8 , the device includes: a determination module 810 , a prediction module 820 , and an elimination module 830 .

其中，in,

确定模块810，用于确定待脱敏的语音数据；A determination module 810, configured to determine the voice data to be desensitized;

预测模块820，用于将语音数据中每一语音帧的幅度谱输入至敏感语音检测模型，得到敏感语音检测模型输出的每一语音帧的语音掩码；敏感语音检测模型基于样本通用语音以及其中每一样本语音帧的通用掩码，和样本敏感词语音以及其中每一样本语音帧的敏感掩码训练得到；The prediction module 820 is used for inputting the amplitude spectrum of each speech frame in the speech data into the sensitive speech detection model to obtain the speech mask of each speech frame output by the sensitive speech detection model; the sensitive speech detection model is based on the sample general speech and wherein The general mask of each sample speech frame, and the sample sensitive word speech and the sensitive mask of each sample speech frame are obtained by training;

消除模块830，用于基于每一语音帧的语音掩码，消除语音数据中的敏感信息。The elimination module 830 is configured to eliminate sensitive information in the speech data based on the speech mask of each speech frame.

在本发明实施例中，通过确定模块810，用于确定待脱敏的语音数据；预测模块820，用于将语音数据中每一语音帧的幅度谱输入至敏感语音检测模型，得到敏感语音检测模型输出的每一语音帧的语音掩码；敏感语音检测模型基于样本通用语音以及其中每一样本语音帧的通用掩码，和样本敏感词语音以及其中每一样本语音帧的敏感掩码训练得到；消除模块830，用于基于每一语音帧的语音掩码，消除语音数据中的敏感信息，实现了使用敏感语音检测模型基于输入的语音帧的幅度谱输出语音掩码以定位敏感词，并将该敏感词脱敏，减少了语音和文字互转的过程，提高了识别效率，克服敏感信息易泄露或过度消除语音以及效率和识别率低的问题，实现了语音实时精准的脱敏。In the embodiment of the present invention, the determination module 810 is used to determine the speech data to be desensitized; the prediction module 820 is used to input the amplitude spectrum of each speech frame in the speech data into the sensitive speech detection model to obtain the sensitive speech detection model. The speech mask of each speech frame output by the model; the sensitive speech detection model is trained based on the sample general speech and the general mask of each sample speech frame, and the sample sensitive word speech and the sensitive mask of each sample speech frame The elimination module 830 is used to eliminate the sensitive information in the speech data based on the speech mask of each speech frame, and realizes the use of the sensitive speech detection model to output the speech mask based on the amplitude spectrum of the input speech frame to locate the sensitive word, and Desensitizing the sensitive word reduces the process of inter-conversion between speech and text, improves the recognition efficiency, overcomes the problems of sensitive information being easily leaked or excessive elimination of speech and low efficiency and recognition rate, and realizes real-time and accurate speech desensitization.

基于上述任一实施例，预测模块820，包括：Based on any of the above embodiments, the prediction module 820 includes:

流式预测子模块，用于将语音数据中各语音帧的幅度谱逐帧输入至敏感语音检测模型，得到敏感语音检测模型逐帧输出的各语音帧的语音掩码；The streaming prediction submodule is used to input the amplitude spectrum of each speech frame in the speech data to the sensitive speech detection model frame by frame, and obtain the speech mask of each speech frame output by the sensitive speech detection model frame by frame;

基于上述任一实施例，流式预测子模块中将语音数据中各语音帧的幅度谱逐帧输入至敏感语音检测模型，得到敏感语音检测模型逐帧输出的各语音帧的语音掩码，包括：Based on any of the above-mentioned embodiments, in the streaming prediction sub-module, the amplitude spectrum of each speech frame in the speech data is input into the sensitive speech detection model frame by frame, and the speech mask of each speech frame output by the sensitive speech detection model frame by frame is obtained, including :

基于上述任一实施例，消除模块830，包括：Based on any of the above embodiments, the elimination module 830 includes:

脱敏子模块，用于基于每一语音帧的语音掩码，对语音数据中每一语音帧的幅度谱进行脱敏处理，得到脱敏后的幅度谱数据；The desensitization sub-module is used to desensitize the amplitude spectrum of each voice frame in the voice data based on the voice mask of each voice frame to obtain desensitized amplitude spectrum data;

逆转换子模块，用于对脱敏后的幅度谱数据进行逆变换，得到脱敏后的语音数据。The inverse transformation sub-module is used to inversely transform the desensitized amplitude spectrum data to obtain desensitized speech data.

基于上述任一实施例，脱敏子模块，包括：Based on any of the above-mentioned embodiments, the desensitization sub-module includes:

定位子模块，用于基于每一语音帧的语音掩码，从语音数据中定位出敏感词语音段，并确定各敏感词语音段的脱敏方式；The positioning sub-module is used to locate the speech segment of the sensitive word from the speech data based on the speech mask of each speech frame, and determine the desensitization method of the speech segment of each sensitive word;

脱敏处理子模块，用于若脱敏方式为信息脱敏，则对敏感词语音段后指定帧数的语音帧的幅度谱，或对敏感词语音段中各语音帧的幅度谱以及敏感词语音段后指定帧数的语音帧的幅度谱进行脱敏处理；The desensitization processing sub-module is used to desensitize the amplitude spectrum of the speech frame with the specified number of frames after the speech segment of the sensitive word, or the amplitude spectrum of each speech frame in the speech segment of the sensitive word and the sensitivity word if the desensitization method is information desensitization. The amplitude spectrum of the speech frame with the specified number of frames after the speech segment is desensitized;

基于上述任一实施例，定位子模块，包括：Based on any of the above embodiments, the positioning sub-module includes:

敏感帧确定子模块，用于确定语音数据中，语音掩码小于预设语音掩码阈值的语音帧作为敏感词语音帧；The sensitive frame determination submodule is used to determine in the voice data, the voice frame whose voice mask is smaller than the preset voice mask threshold as the voice frame of the sensitive word;

敏感语音段确定子模块，用于将帧数大于预设帧数阈值的连续多个敏感词语音帧作为一段敏感词语音段。The sensitive speech segment determination sub-module is used to use a plurality of consecutive sensitive word speech frames whose frame number is greater than the preset frame number threshold as a sensitive word speech segment.

基于上述任一实施例，定位子模块，还包括：Based on any of the foregoing embodiments, the positioning submodule further includes:

确定待分类语音段子模块，用于从敏感词语音段的尾部向前截取预设截取帧数个语音帧，作为待分类语音段；Determine the speech segment submodule to be classified, which is used for intercepting a preset number of speech frames forward from the tail of the speech segment of the sensitive word, as the speech segment to be classified;

方式预测子模块，用于将待分类语音段输入到语音分类模型，得到语音分类模型输出的敏感词语音段的脱敏方式；The mode prediction submodule is used to input the speech segment to be classified into the speech classification model, and obtain the desensitization method of the speech segment of the sensitive word output by the speech classification model;

分类模型训练模块，用于语音分类模型基于样本敏感词语音段及其脱敏方式标签训练得到The classification model training module is used to train the speech classification model based on the sample-sensitive word speech segment and its desensitization method label.

基于上述任一实施例，预测模块820中样本敏感词语音基于样本噪声，对原始敏感词语音进行加噪得到，敏感掩码基于样本噪声的幅度谱和原始敏感词语音的幅度谱确定。Based on any of the above embodiments, the sample sensitive word speech in the prediction module 820 is obtained by adding noise to the original sensitive word speech based on sample noise, and the sensitivity mask is determined based on the amplitude spectrum of the sample noise and the amplitude spectrum of the original sensitive word speech.

图9示例了一种电子设备的实体结构示意图，如图9所示，该电子设备可以包括：处理器(processor)910、通信接口(Communications Interface)920、存储器(memory)930和通信总线940，其中，处理器910，通信接口920，存储器930通过通信总线940完成相互间的通信。处理器910可以调用存储器930中的逻辑指令，以执行语音脱敏方法，该方法包括：确定待脱敏的语音数据；将语音数据中每一语音帧的幅度谱输入至敏感语音检测模型，得到敏感语音检测模型输出的每一语音帧的语音掩码；敏感语音检测模型基于样本通用语音以及其中每一样本语音帧的通用掩码，和样本敏感词语音以及其中每一样本语音帧的敏感掩码训练得到；基于每一语音帧的语音掩码，消除语音数据中的敏感信息。FIG. 9 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG. 9 , the electronic device may include: a processor (processor) 910, a communication interface (Communications Interface) 920, a memory (memory) 930, and a communication bus 940, The processor 910 , the communication interface 920 , and the memory 930 communicate with each other through the communication bus 940 . The processor 910 can call the logic instructions in the memory 930 to execute the speech desensitization method, the method includes: determining the speech data to be desensitized; inputting the amplitude spectrum of each speech frame in the speech data into the sensitive speech detection model to obtain The speech mask of each speech frame output by the sensitive speech detection model; the sensitive speech detection model is based on the sample general speech and the general mask of each sample speech frame, and the sample sensitive word speech and the sensitive mask of each sample speech frame. It is obtained by code training; based on the speech mask of each speech frame, the sensitive information in the speech data is eliminated.

此外，上述的存储器930中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logic instructions in the memory 930 may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as an independent product. Based on such understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, removable hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够执行上述各方法所提供的语音脱敏方法，该方法包括：确定待脱敏的语音数据；将语音数据中每一语音帧的幅度谱输入至敏感语音检测模型，得到敏感语音检测模型输出的每一语音帧的语音掩码；敏感语音检测模型基于样本通用语音以及其中每一样本语音帧的通用掩码，和样本敏感词语音以及其中每一样本语音帧的敏感掩码训练得到；基于每一语音帧的语音掩码，消除语音数据中的敏感信息。In another aspect, the present invention also provides a computer program product, the computer program product includes a computer program, the computer program can be stored on a non-transitory computer-readable storage medium, and when the computer program is executed by a processor, the computer can Execute the speech desensitization method provided by the above methods, the method includes: determining the speech data to be desensitized; inputting the amplitude spectrum of each speech frame in the speech data into the sensitive speech detection model, and obtaining each output of the sensitive speech detection model. A speech mask of a speech frame; the sensitive speech detection model is trained based on the sample general speech and the general mask of each sample speech frame, and the sample sensitive word speech and the sensitivity mask of each sample speech frame; based on each sample speech frame Speech mask for speech frames to remove sensitive information from speech data.

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各方法提供的语音脱敏方法，该方法包括：确定待脱敏的语音数据；将语音数据中每一语音帧的幅度谱输入至敏感语音检测模型，得到敏感语音检测模型输出的每一语音帧的语音掩码；敏感语音检测模型基于样本通用语音以及其中每一样本语音帧的通用掩码，和样本敏感词语音以及其中每一样本语音帧的敏感掩码训练得到；基于每一语音帧的语音掩码，消除语音数据中的敏感信息。In another aspect, the present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored, the computer program is implemented by a processor to execute the speech desensitization method provided by the above methods, and the method includes : determine the speech data to be desensitized; input the amplitude spectrum of each speech frame in the speech data into the sensitive speech detection model to obtain the speech mask of each speech frame output by the sensitive speech detection model; the sensitive speech detection model is based on the sample general The voice and the general mask of each sample voice frame, and the sample sensitive word voice and the sensitivity mask of each sample voice frame are obtained by training; based on the voice mask of each voice frame, the sensitive information in the voice data is eliminated.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. a speech desensitization method, is characterized in that, comprises:

Determine the voice data to be desensitized;

The amplitude spectrum of each speech frame in the speech data is input to the sensitive speech detection model, and the speech mask of each speech frame output by the sensitive speech detection model is obtained; the sensitive speech detection model is based on the sample general speech and wherein The general mask of each sample speech frame, and the sample sensitive word speech and the sensitive mask of each sample speech frame are obtained by training;

Based on the speech mask for each speech frame, sensitive information in the speech data is eliminated.

2. The speech desensitization method according to claim 1, wherein the amplitude spectrum of each speech frame in the speech data is input to a sensitive speech detection model, and each output of the sensitive speech detection model is obtained. A speech mask for a speech frame, including:

The amplitude spectrum of each speech frame in the described speech data is input to the sensitive speech detection model frame by frame, and the speech mask of each speech frame output by the described sensitive speech detection model frame by frame is obtained;

Wherein, the input voice frame and the output voice frame at the same time differ by a preset number of frames, the input voice frame is the voice frame corresponding to the amplitude spectrum of the input sensitive voice detection model, and the output voice frame is from the sensitive voice Detect the speech frame corresponding to the speech mask output in the model.

3. speech desensitization method according to claim 2, is characterized in that, described in described speech data, the amplitude spectrum of each speech frame is input to sensitive speech detection model frame by frame, obtain described sensitive speech detection model frame by frame The speech mask of each speech frame output, including:

The amplitude spectrum of each voice frame in the voice data is input into the sensitive voice detection model frame by frame, and the sensitive voice detection model is based on the amplitude spectrum of each voice frame, the continuous preset frame number of voice frames after each voice frame. Amplitude spectrum, and the state vector of the frame before each speech frame, encode the state vector of each speech frame, and perform sensitive speech detection based on the state vector of each speech frame, and obtain the frame-by-frame output of the sensitive speech detection model. Voice mask.

4. The voice desensitization method according to claim 1, wherein the voice mask based on each voice frame, to eliminate sensitive information in the voice data, comprises:

Based on the voice mask of each voice frame, desensitization processing is performed on the amplitude spectrum of each voice frame in the voice data to obtain desensitized amplitude spectrum data;

Perform inverse transformation on the desensitized amplitude spectrum data to obtain desensitized speech data.

5. speech desensitization method according to claim 4, described based on the speech mask of each speech frame, carry out desensitization processing to the amplitude spectrum of each speech frame in the described speech data, obtain after desensitization. Amplitude spectral data, including:

Based on the speech mask of each speech frame, locate the speech segment of the sensitive word from the speech data, and determine the desensitization mode of each speech segment of the sensitive word;

If the desensitization method is information desensitization, the amplitude spectrum of the speech frame with the specified number of frames after the sensitive word speech segment, or the amplitude spectrum of each speech frame in the sensitive word speech segment and the sensitive word The amplitude spectrum of the speech frame with the specified number of frames after the speech segment is desensitized;

If the desensitization method is sensitive word desensitization, the amplitude spectrum of each speech frame in the speech segment of the sensitive word is desensitized.

6. The speech desensitization method according to claim 5, wherein the speech mask based on each speech frame, locates a sensitive word speech segment from the speech data, comprising:

Determine that in the voice data, the voice frame whose voice mask is less than the preset voice mask threshold is used as a sensitive word voice frame;

A plurality of consecutive sensitive word speech frames whose frame number is greater than the preset frame number threshold are regarded as a sensitive word speech segment.

7. speech desensitization method according to claim 5, the described desensitization mode that determines each sensitive word speech segment, concrete steps comprise:

From the end of the speech segment of the sensitive word, intercept the preset intercepted frames and several speech frames forward as the speech segment to be classified;

Inputting the speech segment to be classified into a speech classification model to obtain a desensitization mode of the sensitive word speech segment output by the speech classification model;

The speech classification model is obtained by training based on the speech segment of the sample-sensitive word and its desensitization method label.

8. The speech desensitization method according to any one of claims 1 to 7, wherein the sample sensitive word speech is obtained by adding noise to the original sensitive word speech based on sample noise, and the sensitive mask is based on the sample noise. The magnitude spectrum and the magnitude spectrum of the original sensitive word speech are determined.

9. A voice desensitization device, characterized in that, comprising:

A determination module for determining the voice data to be desensitized;

The prediction module is used for inputting the amplitude spectrum of each speech frame in the speech data into the sensitive speech detection model to obtain the speech mask of each speech frame output by the sensitive speech detection model; the sensitive speech detection model is based on The sample general speech and the general mask of each sample speech frame, and the sample sensitive word speech and the sensitivity mask of each sample speech frame are trained;

An elimination module, configured to eliminate sensitive information in the speech data based on the speech mask of each speech frame.

10. An electronic device, comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the program as claimed in claim 1 when the processor executes the program Steps of the speech desensitization method described in any one of to 8.

11. A non-transitory computer-readable storage medium on which a computer program is stored, wherein the computer program implements the speech desensitization method according to any one of claims 1 to 8 when the computer program is executed by a processor A step of.