CN115831127B

CN115831127B - Method, device and storage medium for constructing voiceprint reconstruction model based on speech conversion

Info

Publication number: CN115831127B
Application number: CN202310029836.3A
Authority: CN
Inventors: 陈艳姣; 徐文渊; 邓江毅; 苗钱浩
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-05-05
Anticipated expiration: 2043-01-09
Also published as: CN115831127A

Abstract

The embodiment of the specification provides a voiceprint reconstruction model construction method, a voiceprint reconstruction model construction device and a storage medium based on voice conversion, wherein the method comprises the following steps: constructing a first training set and a second training set based on voice conversion, wherein the first training set is used for indicating the original sound of a camouflage person, and the second training set is used for indicating the sound obtained by the camouflage person through voice conversion; constructing a first voice vector according to the first training set; constructing a second speech vector according to the second training set; and determining the feature vector of the original sound of the camouflage person through a vector decomposition method according to the first voice vector and the second voice vector. The technical scheme that this application provided is used for solving prior art and can not discern the problem of disguise person's acoustic.

Description

Method, device and storage medium for constructing voiceprint reconstruction model based on speech conversion

技术领域technical field

本文件涉及人工智能与计算机安全领域，尤其涉及一种基于语音转换的声纹重构模型构建方法、装置和存储介质。This document relates to the fields of artificial intelligence and computer security, and in particular to a method, device and storage medium for constructing a voiceprint reconstruction model based on voice conversion.

背景技术Background technique

有部分用户利用语音转换技术，通过模仿他人的声音伪装成相应的身份，这会给个人安全带来隐患。Some users use voice conversion technology to pretend to be the corresponding identity by imitating the voice of others, which will bring hidden dangers to personal safety.

现有的语音转换检测技术通常基于人工智能，辨认一段音频是真实采样的还是虚假生成的。Existing speech transition detection techniques are usually based on artificial intelligence to identify whether a piece of audio is actually sampled or artificially generated.

然而，上述方案并不能识别伪装者的原声，因此很难识别伪装者的身份信息。对于个人安全而言，上述方法只能治标，不能治本，无法彻底消除存在的安全隐患。However, the above scheme cannot identify the original voice of the pretender, so it is difficult to identify the identity information of the pretender. For personal safety, the above methods can only cure the symptoms, not the root cause, and cannot thoroughly eliminate the potential safety hazards that exist.

发明内容Contents of the invention

鉴于上述的分析，本申请旨在提出了一种基于语音转换的声纹重构模型构建方法、装置和存储介质，能够识别模仿声音的伪装者的原声，以便于识别伪装者的身份信息。In view of the above analysis, this application aims to propose a method, device and storage medium for constructing a voiceprint reconstruction model based on speech conversion, which can identify the original voice of a faker who imitates the voice, so as to identify the identity information of the faker.

第一方面，本说明书一个或多个实施例提供了一种基于语音转换的声纹重构模型构建方法，包括：In the first aspect, one or more embodiments of this specification provide a method for constructing a voiceprint reconstruction model based on speech conversion, including:

基于语音转换构建第一训练集和第二训练集，所述第一训练集用于指示伪装者的原声，所述第二训练集用于指示伪装者通过语音转化得到的声音；Constructing a first training set and a second training set based on speech conversion, the first training set is used to indicate the original sound of the pretender, and the second training set is used to indicate the voice converted by the pretender;

根据所述第一训练集，构建第一语音向量；Constructing a first speech vector according to the first training set;

根据所述第二训练集，构建第二语音向量；Constructing a second speech vector according to the second training set;

根据所述第一语音向量和所述第二语音向量，通过向量分解法，确定伪装者原声的特征向量。According to the first speech vector and the second speech vector, a feature vector of the pretender's original voice is determined through a vector decomposition method.

进一步地，所述基于语音转换构建第一训练集和第二训练集，包括：Further, said constructing the first training set and the second training set based on speech conversion includes:

采集多个原声数据和多个语音转换模型；Acquire multiple original sound data and multiple speech conversion models;

利用所述多个原声数据构建所述第一训练集；constructing the first training set by using the plurality of original sound data;

利用任一所述语音转换模型将任一所述原声数据对应的声纹转换成另一个所述原声语音数据对应的声纹，得到所述第二训练集的一个训练样本。Using any of the voice conversion models to convert the voiceprint corresponding to any one of the original voice data into another voiceprint corresponding to the original voice data, to obtain a training sample of the second training set.

进一步地，根据所述第一训练集，构建第一语音向量，包括：Further, according to the first training set, construct the first speech vector, including:

确定所述第一训练集中的各训练样本对应的语音向量；determining a speech vector corresponding to each training sample in the first training set;

确定各所述训练样本对应的语音向量的平均向量为所述第一语音向量。Determining an average vector of speech vectors corresponding to each of the training samples as the first speech vector.

进一步地，根据所述第二训练集，构建第二语音向量，包括：Further, according to the second training set, construct a second speech vector, including:

确定所述第二训练集中的各训练样本对应的语音向量；determining a speech vector corresponding to each training sample in the second training set;

确定各所述训练样本对应的语音向量的平均向量为所述第二语音向量。Determining an average vector of speech vectors corresponding to each of the training samples as the second speech vector.

进一步地，所述根据所述第一语音向量和所述第二语音向量，通过向量分解法，确定伪装者原声的特征向量，包括：Further, according to the first speech vector and the second speech vector, the feature vector of the original voice of the pretender is determined by vector decomposition method, including:

以所述第二语音向量的方向为坐标轴，创建坐标系；Create a coordinate system with the direction of the second speech vector as the coordinate axis;

在所述坐标系下，分解所述第一语音向量为正交分量和平行分量；Under the coordinate system, decomposing the first speech vector into orthogonal components and parallel components;

根据所述平行分量和所述第一语音向量，确定所述伪装者原声的特征向量。A feature vector of the original voice of the pretender is determined according to the parallel component and the first voice vector.

第二方面，本说明书一个或多个实施例提供了一种基于语音转换的声纹重构模型构建装置，包括：训练集构建模块、向量构建模块和数据处理模块；In the second aspect, one or more embodiments of this specification provide a voiceprint reconstruction model construction device based on speech conversion, including: a training set construction module, a vector construction module and a data processing module;

所述训练集构建模块用于基于语音转换构建第一训练集和第二训练集，所述第一训练集用于指示伪装者的原声，所述第二训练集用于指示伪装者通过语音转化得到的声音；The training set construction module is used to construct a first training set and a second training set based on speech conversion, the first training set is used to indicate the original sound of the pretender, and the second training set is used to indicate that the pretender converts the get the sound;

所述向量构建模块用于根据所述第一训练集，构建第一语音向量；根据所述第二训练集，构建第二语音向量；The vector construction module is used to construct a first speech vector according to the first training set; construct a second speech vector according to the second training set;

所述数据处理模块用于根据所述第一语音向量和所述第二语音向量，通过向量分解法，确定伪装者原声的特征向量。The data processing module is used to determine the feature vector of the pretender's original voice through a vector decomposition method according to the first speech vector and the second speech vector.

进一步地，所述训练集构建模块用于采集多个原声数据和多个语音转换模型；利用所述多个原声数据构建所述第一训练集；利用任一所述语音转换模型将任一所述原声数据对应的声纹转换成另一个所述原声语音数据对应的声纹，得到所述第二训练集的一个训练样本。Further, the training set construction module is used to collect a plurality of original sound data and a plurality of speech conversion models; use the plurality of original sound data to construct the first training set; use any of the speech conversion models to convert any of the speech conversion models to The voiceprint corresponding to the original sound data is converted into another voiceprint corresponding to the original voice data to obtain a training sample of the second training set.

进一步地，所述向量构建模块用于确定所述第一训练集中的各训练样本对应的语音向量；确定各所述训练样本对应的语音向量的平均向量为所述第一语音向量。Further, the vector construction module is used to determine the speech vector corresponding to each training sample in the first training set; determine the average vector of the speech vectors corresponding to each of the training samples as the first speech vector.

进一步地，所述数据处理模块用于以所述第二语音向量的方向为坐标轴，创建坐标系；在所述坐标系下，分解所述第一语音向量为正交分量和平行分量；根据所述平行分量和所述第一语音向量，确定所述伪装者原声的特征向量。Further, the data processing module is used to create a coordinate system with the direction of the second speech vector as a coordinate axis; under the coordinate system, decompose the first speech vector into an orthogonal component and a parallel component; according to The parallel component and the first voice vector determine a feature vector of the pretender's original voice.

第三方面，本说明书一个或多个实施例提供了一种存储介质，包括：In a third aspect, one or more embodiments of this specification provide a storage medium, including:

用于存储计算机可执行指令，所述计算机可执行指令在被执行时实现第一方面所述的方法。It is used to store computer-executable instructions that, when executed, implement the method described in the first aspect.

与现有技术相比，本申请至少能实现以下技术效果：Compared with the prior art, the present application can at least achieve the following technical effects:

1、基于伪装者的原声和伪装者通过语音转化得到的声音构建训练集，为得到伪装者的原声提供数据基础。1. Construct a training set based on the original voice of the pretender and the voice obtained by the pretender through speech conversion, providing a data basis for obtaining the original voice of the pretender.

2、通过向量分解法有效地分离伪装者原声和被伪装的声音，使得声纹重构模型能识别出伪装者的原声。2. The original voice of the pretender and the disguised voice are effectively separated by the vector decomposition method, so that the voiceprint reconstruction model can recognize the original voice of the pretender.

3、该方案计算过程简单，适用性较强，可以适用于音频语种未包含于训练集、模仿电话录音的物理域等场景下。3. The calculation process of this scheme is simple and the applicability is strong. It can be applied to scenarios where the audio language is not included in the training set and the physical domain that imitates telephone recording.

附图说明Description of drawings

为了更清楚地说明本说明书一个或多个实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本说明书中记载的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate one or more embodiments of this specification or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or prior art. Obviously, in the following description The accompanying drawings are only some embodiments described in this specification, and those skilled in the art can also obtain other drawings according to these drawings without any creative work.

图1为本说明书一个或多个实施例提供的一种基于语音转换的声纹重构模型构建方法流程图。Fig. 1 is a flow chart of a method for constructing a voiceprint reconstruction model based on speech conversion provided by one or more embodiments of this specification.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本说明书一个或多个实施例中的技术方案，下面将结合本说明书一个或多个实施例中的附图，对本说明书一个或多个实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本说明书的一部分实施例，而不是全部的实施例。基于本说明书一个或多个实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都应当属于本文件的保护范围。In order to enable those skilled in the art to better understand the technical solutions in one or more embodiments of this specification, the following will describe the technical solutions in one or more embodiments of this specification in conjunction with the drawings in one or more embodiments of this specification The technical solution is clearly and completely described, and obviously, the described embodiments are only a part of the embodiments in this specification, rather than all the embodiments. Based on one or more embodiments in this specification, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the scope of protection of this document.

基于机器学习的语音转换技术能够在保持说话内容不变的情况下，让原始说话人的声音听起来像是目标说话人，目前已在电影配音、言语障碍帮扶、声音模仿等领域得到了广泛的应用。因此，伪装者基于上述技术，将自己的声音转化成其他人的声音，上述行为会损害个人安全。而如果只是单纯识别声音是否被模仿，伪装者可以换一种方法继续模仿其他人的声音。显然，现有技术不能很好震慑这些伪装者，因此很难起到标本兼治的效果。Voice conversion technology based on machine learning can make the voice of the original speaker sound like the target speaker while keeping the content of the speech unchanged. It has been widely used in film dubbing, speech impediment assistance, and voice imitation. Applications. Therefore, based on the above-mentioned technology, the impostor transforms his own voice into that of other people, and the above-mentioned behavior will compromise personal security. And if it is only to identify whether the voice is imitated, the pretender can continue to imitate other people's voices in another way. Obviously, the existing technology cannot deter these pretenders very well, so it is difficult to achieve the effect of treating both symptoms and root causes.

为了解决上述技术问题，本申请实施例提供了一种基于语音转换的声纹重构模型构建方法，如图1所示，包括以下步骤：In order to solve the above technical problems, the embodiment of the present application provides a method for constructing a voiceprint reconstruction model based on speech conversion, as shown in Figure 1, including the following steps:

步骤1、基于语音转换构建第一训练集和第二训练集。Step 1. Construct a first training set and a second training set based on speech conversion.

在本申请实施例中，第一训练集用于指示伪装者的原声，第二训练集用于指示伪装者通过语音转化得到的声音。步骤1具体包括：In the embodiment of the present application, the first training set is used to indicate the original voice of the pretender, and the second training set is used to indicate the voice converted by the pretender. Step 1 specifically includes:

（1.1）搜集说话人语音数据集：搜集若干开源的说话人语音数据集，构成总人数为n的训练数据集说话人集合，每个说话人包含若干条长度不一的音频数据，保证训练人数与训练数据的充足性。集合S即为第一训练集，s₁，s₂…，s_n即为原声数据。(1.1) Collect speaker voice data sets: collect several open source speaker voice data sets to form a training data set with a total number of n speakers , each speaker contains several pieces of audio data with different lengths to ensure the adequacy of the number of people trained and the training data. The set S is the first training set, and s ₁ , s ₂ ..., s _n are the original sound data.

（1.2）预处理说话人语音数据集：将所有说话人语音数据集的命名格式调整为统一规范，并使用FFmpeg工具将音频数据重采样为所需采样率下的wav格式，便于后续语音转换模型的训练。(1.2) Preprocessing speaker voice data sets: adjust the naming format of all speaker voice data sets to a unified standard, and use FFmpeg tools to resample the audio data into wav format at the required sampling rate, which is convenient for subsequent voice conversion models training.

（1.3）训练语音转换模型：调研多种现有的主流语音转换方法，采用开源的预训练模型或自行训练以复现预期效果，得到多种语音转换模型。(1.3) Training voice conversion models: investigate a variety of existing mainstream voice conversion methods, use open source pre-training models or self-training to reproduce the expected effect, and obtain a variety of voice conversion models .

（1.4）生成语音转换数据集：将语音数据集中的说话人依次作为原始说话人S_source和目标说话人S_target，对于t个训练完毕的语音转换模型各自生成语音转换数据集。语音转换数据集即为第二训练集。(1.4) Generating voice conversion data sets: the speakers in the voice data set are used as the original speaker S _source and the target speaker S _target in turn, and the voice conversion data sets are generated for each of the t trained voice conversion models . The speech conversion data set is the second training set.

其中，针对第二训练集的样本难于获取的问题，步骤1.1、步骤1.3和1.4是为了保证第二训练集的样本数量。Among them, for the problem that the samples of the second training set are difficult to obtain, steps 1.1, 1.3 and 1.4 are to ensure the number of samples of the second training set.

具体地，利用任一语音转换模型将任一原声数据对应的声纹转换成另一个原声语音数据对应的声纹，得到第二训练集的一个训练样本。例如，第一训练样本存在3个原声数据A、B、C，采集了2个语音转换模型a、b，则利用模型a将A（原始说话人S_source）的声纹转换成B（目标说话人S_target）的声纹，即可得到第二训练集的一个样本。依次类推可以得到3*2*3个训练样本，通过上述方式可以极大的增加第二训练集的样本数。Specifically, any voice conversion model is used to convert a voiceprint corresponding to any original voice data into a voiceprint corresponding to another original voice data, so as to obtain a training sample of the second training set. For example, there are three original voice data A, B, and C in the first training sample, and two speech conversion models a, b are collected, then use model a to convert the voiceprint of A (original speaker S _source ) into B (target speech The voiceprint of the person S _target ), a sample of the second training set can be obtained. By analogy, 3*2*3 training samples can be obtained, and the number of samples in the second training set can be greatly increased through the above method.

步骤2、根据第一训练集，构建第一语音向量。Step 2. Construct a first speech vector according to the first training set.

在本申请实施例中，构建第一语音向量时，需要先获得声纹的初步表征，具体地，（2.1.1）构建特征频谱：输入目标说话人证据音频，Mel滤波器组将二者的时域信息分别转换成维度为的特征频谱；In the embodiment of this application, when constructing the first speech vector, it is necessary to obtain the preliminary representation of the voiceprint first, specifically, (2.1.1) constructing the feature spectrum: input the target speaker evidence audio , the Mel filter bank converts the time domain information of the two into dimensions of The characteristic spectrum of

（2.1.2）提取时域特征：构建一个模块和三个-模块，前者用于提取时域特征，后者用于建立全局通道的相关性；(2.1.2) Extract temporal features: construct a module and three - module, the former is used to extract time-domain features, and the latter is used to establish the correlation of global channels;

（2.1.3）构建特征图：将三个-模块的输出连接成通道数为的特征图，目标说话人证据音频的特征向量；(2.1.3) Construct feature map: combine three - The output of the module is connected into a channel number of Feature maps of the target speaker evidence audio The eigenvectors of ;

之后，根据初步特征，确定第一训练集中的各训练样本对应的语音向量；确定各训练样本对应的语音向量的平均向量为第一语音向量。Afterwards, according to the preliminary feature, determine the speech vector corresponding to each training sample in the first training set; determine the average vector of the speech vector corresponding to each training sample as the first speech vector.

步骤3、根据第二训练集，构建第二语音向量。Step 3. Construct a second speech vector according to the second training set.

在本申请实施例中，构建第二语音向量时，需要先获得声纹的初步表征，具体地，（3.1.1）构建特征频谱：输入语音转换音频，Mel滤波器组将二者的时域信息分别转换成维度为的特征频谱；In the embodiment of this application, when constructing the second speech vector, it is necessary to obtain the preliminary representation of the voiceprint first, specifically, (3.1.1) Constructing the characteristic spectrum: input speech conversion audio , the Mel filter bank converts the time domain information of the two into dimensions of The characteristic spectrum of

（3.1.2）提取时域特征：构建一个模块和三个-模块，前者用于提取时域特征，后者用于建立全局通道的相关性；(3.1.2) Extract temporal features: construct a module and three - module, the former is used to extract time-domain features, and the latter is used to establish the correlation of global channels;

（3.1.3）构建特征图：将三个-模块的输出连接成通道数为的特征图，输出语音转换音频特征向量为；(3.1.3) Construct feature map: combine three - The output of the module is connected into a channel number of A feature map of the output speech-to-speech audio The eigenvectors are ;

之后，根据初步特征，确定第二训练集中的各训练样本对应的语音向量；确定各训练样本对应的语音向量的平均向量为第二语音向量。Afterwards, according to the preliminary features, determine the speech vector corresponding to each training sample in the second training set; determine the average vector of the speech vectors corresponding to each training sample as the second speech vector.

步骤4、根据第一语音向量和第二语音向量，通过向量分解法，确定伪装者原声的特征向量。Step 4. According to the first speech vector and the second speech vector, the feature vector of the pretender's original voice is determined by vector decomposition method.

在本申请实施例中，向量分解法的过程为：In the embodiment of this application, the process of the vector decomposition method is:

以第二语音向量的方向为坐标轴，创建坐标系；在坐标系下，分解第一语音向量为正交分量和平行分量；根据平行分量和第一语音向量，确定伪装者原声的特征向量。Taking the direction of the second voice vector as the coordinate axis, create a coordinate system; under the coordinate system, decompose the first voice vector into orthogonal components and parallel components; determine the feature vector of the original voice of the pretender according to the parallel components and the first voice vector.

具体地，计算目标说话人音频特征向量的平均向量：记为的第帧信息，在时轴上求平均；Specifically, the target speaker audio feature vector is calculated The mean vector of :remember for First Frame information, averaged on the time axis ;

（2.2.2）计算归一化方向向量：，长度为，其中常数用于防止错误；(2.2.2) Calculate the normalized direction vector : , with a length of , where the constant to prevent errors;

（2.2.3）分解语音转换音频的特征向量：平行分量，正交分量，其中是大小为的单位矩阵；(2.2.3) Decomposing the eigenvectors of speech-to-speech audio : Parallel component , the orthogonal component ,in is the size of the identity matrix;

在本申请实施例中，从M中分解出的M⊥部分，可以削弱目标说话人声纹特征（N）的影响，增强原始说话人声纹特征的比重，从而驱使恢复出来的声纹特征往靠近原始说话人声纹特征的方向优化。In the embodiment of this application, the M⊥ part decomposed from M can weaken the influence of the voiceprint feature (N) of the target speaker and enhance the proportion of the voiceprint feature of the original speaker, thereby driving the restored voiceprint feature to Orientation optimization close to the voiceprint features of the original speaker.

（2.2.4）通过残差结构连接，利用函数TDNN计算语音转换音频长度特征向量，其中与的维度大小相同，均为；(2.2.4) Through the residual structure connection, use the function TDNN to calculate the voice conversion audio length feature vector ,in and have the same dimensions, both ;

需要说明的是，实际中的优化方向不可能完全垂直。因此如果只将M⊥部分作为输出，可能会丢失相当一部分的有效信息，导致恢复结果不理想。因此通过步骤2.2.4强化伪装者的原声声纹。It should be noted that the actual optimization direction cannot be completely vertical. Therefore, if only the M⊥ part is used as the output, a considerable part of the effective information may be lost, resulting in an unsatisfactory recovery result. Therefore, step 2.2.4 is used to strengthen the original voiceprint of the pretender.

针对得到的伪装者原声的特征向量，通过维度归一化、数据池化和固定声纹长度，进行优化。For the obtained eigenvectors of the pretender's original voice, optimize them through dimension normalization, data pooling and fixed voiceprint length.

优选地，使用AAM-Softmax损失函数作为输出层，使得恢复出来的原始说话人声纹信息与其他嫌疑人之间有足够的区分度。Preferably, the AAM-Softmax loss function is used as the output layer, so that the recovered voiceprint information of the original speaker is sufficiently distinguishable from other suspects.

综上所述，利用本申请构建的模型可以识别伪装者的原声。In summary, the model constructed by this application can identify the original voice of the pretender.

为了说明本申请上述实施例的可行性，下面以构建并训练REVELIO声纹重构模型（其任务为通过输入的语音转换音频提取特征并重构原始说话人的声纹）为例，具体阐述语音转换的原始说话人声纹重构方法，包括如下步骤：In order to illustrate the feasibility of the above-mentioned embodiment of the present application, the following takes the construction and training of the REVELIO voiceprint reconstruction model (whose task is to extract features from the input voice conversion audio and reconstruct the voiceprint of the original speaker) as an example to elaborate on the voice The converted original speaker's voiceprint reconstruction method includes the following steps:

（1）数据集构建，该步骤包括以下子步骤：(1) Data set construction, this step includes the following sub-steps:

（1.1）搜集说话人语音数据集：搜集包含 LibriSpeech、、在内的开源说话人语音数据集，构成总人数为的训练数据集说话人集合，每个说话人包含若干条长度不一的音频数据，保证训练人数与训练数据的充足性； (1.1) Collect speaker voice data sets: collection includes LibriSpeech , , The open source speaker speech data set including the total number of people is The speaker set of the training data set , each speaker contains several pieces of audio data with different lengths to ensure the adequacy of the number of people trained and the training data;

（1.2）预处理说话人语音数据集：将所有说话人语音数据集的命名格式调整为统一规范，并使用工具将音频数据重采样为或等所需采样率下的格式，便于后续语音转换模型的训练；(1.2) Preprocessing speaker voice data sets: adjust the naming format of all speaker voice data sets to a unified standard, and use tool resamples the audio data into or at the desired sampling rate format, which is convenient for the training of the subsequent speech conversion model;

（1.3）训练语音转换模型：调研多种现有的主流语音转换方法，采用开源的预训练模型或自行训练以复现预期效果，共得到种语音转换模型；其中，VQVC：基于矢量量化（vectorquantization，VQ）的语音转换；(1.3) Training voice conversion model: investigate a variety of existing mainstream voice conversion methods, use open source pre-training models or self-training to reproduce the expected effect, a total of Speech Transformation Model ; Among them, VQVC: voice conversion based on vectorquantization (VQ);

VQVC+：基于矢量量化和U-Net结构的语音转换；VQVC+: Speech conversion based on vector quantization and U-Net structure;

AGAIN：基于激活函数引导（activation guidance）和自适应IN（adaptiveinstance normalization）的语音转换；AGAIN: voice conversion based on activation function guidance (activation guidance) and adaptive IN (adaptive instance normalization);

BNE：基于瓶颈特征提取器（bottle-neck feature extractor，BNE）和位置相关混合逻辑（mixture of logistic，MoL）注意力机制的序列到序列（sequence to sequence，Seq2seq）语音转换。BNE: Sequence to sequence (Seq2seq) speech conversion based on a bottle-neck feature extractor (BNE) and a mixture of logistic (MoL) attention mechanism.

（1.4）生成语音转换数据集：将语音数据集中的说话人依次作为原始说话人和目标说话人，对于个训练完毕的语音转换模型各自生成语音转换数据集。(1.4) Generate a speech conversion dataset: the speakers in the speech dataset are used as the original speakers in turn and target speaker ,for Each trained voice conversion model generates a voice conversion data set .

（2）模型构建，该步骤包括以下子步骤：(2) Model construction, this step includes the following sub-steps:

（2.1）特征提取模块构建：从语音转换音频中获得声纹的初步表征，具体包括以下子步骤：(2.1) Construction of feature extraction module: Obtain a preliminary representation of the voiceprint from the speech-to-speech audio, specifically including the following sub-steps:

（2.1.1）构建特征频谱：输入语音转换音频和目标说话人证据音频，Mel滤波器组将二者的时域信息分别转换成维度为和的特征频谱；(2.1.1) Constructing Feature Spectrum: Input Speech Conversion Audio and target speaker evidence audio , the Mel filter bank converts the time domain information of the two into dimensions of and The characteristic spectrum of

（2.1.2）提取时域特征：构建一个模块（包含层、层和层）和三个-模块（包含层、--层、另一层和层），前者用于提取时域特征，后者用于建立全局通道的相关性；(2.1.2) Extract temporal features: construct a module (contains layer, layers and layer) and three - module (contains layer, - - layer, another layers and layer), the former is used to extract temporal features, and the latter is used to establish the correlation of global channels;

（2.1.3）构建特征图：将三个-模块的输出相连接，它们的维度大小均为，最终输出的语音转换音频和目标说话人证据音频的特征向量为和，它们的维度大小分别为和；(2.1.3) Construct feature map: combine three - The outputs of the modules are connected, and their dimensions are both , the final output of the speech-converted audio and target speaker evidence audio The eigenvector of is and , and their dimensions are and ;

（2.2）差分校正模块构建：驱使声纹往正交于目标说话人声纹的方向调整以减小目标说话人信息的影响，具体包括以下子步骤：(2.2) Construction of the differential correction module: drive the voiceprint to be adjusted in the direction orthogonal to the target speaker's voiceprint to reduce the influence of the target speaker's information, specifically including the following sub-steps:

（2.2.1）计算目标说话人音频特征向量的平均向量：记为的第帧信息，在时轴上求平均；(2.2.1) Calculate the target speaker audio feature vector The mean vector of :remember for First Frame information, averaged on the time axis ;

（2.2.2）计算归一化方向向量：，其中常数用于防止错误，例如取1e-6；(2.2.2) Calculate the normalized direction vector : , where the constant used to prevent Wrong, for example, take 1e-6;

（2.2.4）通过残差结构连接，计算，其中与的维度大小相同，均为；(2.2.4) Connected by the residual structure, calculate ,in and have the same dimensions, both ;

（2.3）维度归一化模块构建：将不同时间长度的输入统一为固定维度大小的输出，具体包括以下子步骤：(2.3) Dimension normalization module construction: Unify the input of different time lengths into an output of fixed dimension size, specifically including the following sub-steps:

（2.3.1）数据池化：首先将通过一个包含层、层和层的模块，再将输出通过注意力机制层进行池化，对所有与通道数和帧内容相关的数据进行综合表征，表征结果的维数为；(2.3.1) Data pooling: first put through a containing layer, layers and Layer module, and then the output is pooled through the attention mechanism layer to comprehensively represent all the data related to the number of channels and frame content. The dimension of the representation result is ;

（2.3.2）固定声纹长度：将上述结果进一步通过一个包含层和层的模块，最终使得原本与语音转换音频长度相关的标准化为固定长度的声纹特征向量；(2.3.2) Fixed voiceprint length: further pass the above results through a layers and layer module, which ultimately makes the originally related to the voice conversion audio length normalized to a fixed length The voiceprint feature vector of ;

（2.4）声纹增强模块构建：使用-损失函数作为输出层，使得恢复出来的原始说话人声纹信息与其他嫌疑人之间有足够的区分度，输出结果维数为。(2.4) Voiceprint enhancement module construction: use - The loss function is used as the output layer, so that the restored original speaker's voiceprint information is sufficiently distinguishable from other suspects, and the output dimension is .

（3）模型训练，该步骤包括以下子步骤：(3) Model training, this step includes the following sub-steps:

（3.1）初始化模型：预训练-模型，并将其特征提取器中的参数作为本模型中特征提取器部分的初始化参数；(3.1) Initialize the model: pre-training - model, and use the parameters in its feature extractor as the initialization parameters of the feature extractor part of this model;

（3.2）训练模型：定义训练过程中包含的所有可能的原始说话人身份标签集合为，将语音转换音频以及对应的目标说话人证据音频作为模型输入，模型的输出，由此训练模型对原始说话人身份标签进行正确分类的能力；(3.2) Training model: Define the set of all possible original speaker identity labels included in the training process as , convert speech to audio and the corresponding target speaker evidence audio As model input, model output , thus training the model's ability to correctly classify the original speaker identity labels;

（3.3）迭代收敛：根据模型输出的分类结果计算损失值，回传梯度并更新参数，反复迭代直至损失值收敛到一定阈值以完成训练。(3.3) Iterative convergence: Calculate the loss value according to the classification results output by the model, return the gradient and update the parameters, and iterate repeatedly until the loss value converges to a certain threshold to complete the training.

本申请实施例提供了一种基于语音转换的声纹重构模型构建装置，其特在于，包括：训练集构建模块、向量构建模块和数据处理模块；The embodiment of the present application provides a speech conversion-based voiceprint reconstruction model construction device, which is characterized in that it includes: a training set construction module, a vector construction module and a data processing module;

在本申请实施例中，所述训练集构建模块用于采集多个原声数据和多个语音转换模型；利用所述多个原声数据构建所述第一训练集；利用任一所述语音转换模型将任一所述原声数据对应的声纹转换成另一个所述原声语音数据对应的声纹，得到所述第二训练集的一个训练样本。In the embodiment of the present application, the training set construction module is used to collect a plurality of original sound data and a plurality of speech conversion models; use the plurality of original sound data to construct the first training set; use any of the speech conversion models Converting the voiceprint corresponding to any one of the original sound data into another voiceprint corresponding to the original voice data to obtain a training sample of the second training set.

在本申请实施例中，所述向量构建模块用于确定所述第一训练集中的各训练样本对应的语音向量；确定各所述训练样本对应的语音向量的平均向量为所述第一语音向量。In the embodiment of the present application, the vector construction module is used to determine the speech vector corresponding to each training sample in the first training set; determine the average vector of the speech vector corresponding to each of the training samples as the first speech vector .

在本申请实施例中，所述数据处理模块用于以所述第二语音向量的方向为坐标轴，创建坐标系；在所述坐标系下，分解所述第一语音向量为正交分量和平行分量；根据所述平行分量和所述第一语音向量，确定所述伪装者原声的特征向量。In the embodiment of the present application, the data processing module is configured to use the direction of the second speech vector as a coordinate axis to create a coordinate system; under the coordinate system, decompose the first speech vector into orthogonal components and a parallel component; according to the parallel component and the first speech vector, determine the feature vector of the original voice of the pretender.

本申请实施例提供了一种存储介质，包括：An embodiment of the present application provides a storage medium, including:

用于存储计算机可执行指令，所述计算机可执行指令在被执行时实现上述实施例中所述的方法。It is used to store computer-executable instructions, and the computer-executable instructions implement the methods described in the above-mentioned embodiments when executed.

上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下，在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外，在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中，多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of this specification. Other implementations are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or may be advantageous in certain embodiments.

在20世纪30年代，对于一个技术的改进可以很明显地区分是硬件上的改进（例如，对二极管、晶体管、开关等电路结构的改进）还是软件上的改进（对于方法流程的改进）。然而，随着技术的发展，当今的很多方法流程的改进已经可以视为硬件电路结构的直接改进。设计人员几乎都通过将改进的方法流程编程到硬件电路中来得到相应的硬件电路结构。因此，不能说一个方法流程的改进就不能用硬件实体模块来实现。例如，可编程逻辑器件（Programmable Logic Device，PLD）（例如现场可编程门阵列（Field Programmable GateArray，FPGA））就是这样一种集成电路，其逻辑功能由用户对器件编程来确定。由设计人员自行编程来把一个数字系统“集成”在一片PLD上，而不需要请芯片制造厂商来设计和制作专用的集成电路芯片。而且，如今，取代手工地制作集成电路芯片，这种编程也多半改用“逻辑编译器（logic compiler）”软件来实现，它与程序开发撰写时所用的软件编译器相类似，而要编译之前的原始代码也得用特定的编程语言来撰写，此称之为硬件描述语言（Hardware Description Language，HDL），而HDL也并非仅有一种，而是有许多种，如ABEL（Advanced Boolean Expression Language）、AHDL（Altera Hardware DescriptionLanguage）、Confluence、CUPL（Cornell University Programming Language）、HDCal、JHDL（Java Hardware Description Language）、Lava、Lola、MyHDL、PALASM、RHDL（RubyHardware Description Language）等，目前最普遍使用的是VHDL（Very-High-SpeedIntegrated Circuit Hardware Description Language）与Verilog。本领域技术人员也应该清楚，只需要将方法流程用上述几种硬件描述语言稍作逻辑编程并编程到集成电路中，就可以很容易得到实现该逻辑方法流程的硬件电路。In the 1930s, improvements to a technology could be clearly distinguished as improvements in hardware (for example, improvements in circuit structures such as diodes, transistors, switches, etc.) or improvements in software (improvement in method flow). However, with the development of technology, the improvement of many current method flows can be regarded as the direct improvement of the hardware circuit structure. Designers almost always get the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose logic functions are determined by programming the device by the user. It is programmed by the designer to "integrate" a digital system on a PLD, instead of asking a chip manufacturer to design and make a dedicated integrated circuit chip. Moreover, nowadays, instead of making integrated circuit chips by hand, this kind of programming is mostly realized by "logic compiler" software, which is similar to the software compiler used when writing programs. The original code must also be written in a specific programming language, which is called a hardware description language (Hardware Description Language, HDL), and HDL is not only one, but there are many, such as ABEL (Advanced Boolean Expression Language) , AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., currently the most commonly used is VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog. It should also be clear to those skilled in the art that only a little logical programming of the method flow in the above-mentioned hardware description languages and programming into an integrated circuit can easily obtain a hardware circuit for realizing the logic method flow.

控制器可以按任何适当的方式实现，例如，控制器可以采取例如微处理器或处理器以及存储可由该（微）处理器执行的计算机可读程序代码（例如软件或固件）的计算机可读介质、逻辑门、开关、专用集成电路（Application Specific Integrated Circuit，ASIC）、可编程逻辑控制器和嵌入微控制器的形式，控制器的例子包括但不限于以下微控制器：ARC 625D、Atmel AT91SAM、Microchip PIC18F26K20 以及Silicone Labs C8051F320，存储器控制器还可以被实现为存储器的控制逻辑的一部分。本领域技术人员也知道，除了以纯计算机可读程序代码方式实现控制器以外，完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件，而对其内包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至，可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。The controller may be implemented in any suitable way, for example, the controller may take the form of a microprocessor or a processor and a computer readable medium storing computer readable program code (such as software or firmware) executable by the (micro)processor , logic gates, switches, Application Specific Integrated Circuit (ASIC), programmable logic controllers, and embedded microcontrollers, examples of controllers include but are not limited to the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320, the memory controller can also be implemented as part of the control logic of the memory. Those skilled in the art also know that, in addition to realizing the controller in a purely computer-readable program code mode, it is entirely possible to make the controller use logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded The same function can be realized in the form of a microcontroller or the like. Therefore, such a controller can be regarded as a hardware component, and the devices included in it for realizing various functions can also be regarded as structures within the hardware component. Or even, means for realizing various functions can be regarded as a structure within both a software module realizing a method and a hardware component.

上述实施例阐明的系统、装置、模块或单元，具体可以由计算机芯片或实体实现，或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的，计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。The systems, devices, modules, or units described in the above embodiments can be specifically implemented by computer chips or entities, or by products with certain functions. A typical implementing device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Combinations of any of these devices.

为了描述的方便，描述以上装置时以功能分为各种单元分别描述。当然，在实施本说明书实施例时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above devices, functions are divided into various units and described separately. Of course, when implementing the embodiments of this specification, the functions of each unit can be implemented in one or more software and/or hardware.

本领域内的技术人员应明白，本说明书一个或多个实施例可提供为方法、系统或计算机程序产品。因此，本说明书一个或多个实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本说明书可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质（包括但不限于磁盘存储器、CD-ROM、光学存储器等）上实施的计算机程序产品的形式。Those skilled in the art should understand that one or more embodiments of this specification may be provided as a method, system or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this description may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本说明书是参照根据本说明书实施例的方法、设备（系统）、和计算机程序产品的流程图和／或方框图来描述的。应理解可由计算机程序指令实现流程图和／或方框图中的每一流程和／或方框、以及流程图和／或方框图中的流程和／或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和／或方框图一个方框或多个方框中指定的功能的装置。The specification is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the specification. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and combinations of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a Means for realizing the functions specified in one or more steps of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和／或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和／或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart flow or flows and/or block diagram block or blocks.

在一个典型的配置中，计算设备包括一个或多个处理器（CPU）、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

内存可能包括计算机可读介质中的非永久性存储器，随机存取存储器（RAM）和/或非易失性内存等形式，如只读存储器（ROM）或闪存（flash RAM）。内存是计算机可读介质的示例。Memory may include non-permanent storage in computer-readable media, in the form of random access memory (RAM) and/or nonvolatile memory such as read-only memory (ROM) or flash RAM. Memory is an example of computer readable media.

计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存（PRAM）、静态随机存取存储器（SRAM）、动态随机存取存储器（DRAM）、其他类型的随机存取存储器（RAM）、只读存储器（ROM）、电可擦除可编程只读存储器（EEPROM）、快闪记忆体或其他内存技术、只读光盘只读存储器（CD-ROM）、数字多功能光盘（DVD）或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。按照本文中的界定，计算机可读介质不包括暂存电脑可读媒体（transitory media），如调制的数据信号和载波。Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information. Information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridge, tape magnetic disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include transitory computer readable media, such as modulated data signals and carrier waves.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes Other elements not expressly listed, or elements inherent in the process, method, commodity, or apparatus are also included. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

本说明书一个或多个实施例可以在由计算机执行的计算机可执行指令的一般上下文中描述，例如程序模块。一般地，程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本说明书的一个或多个实施例，在这些分布式计算环境中，由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中，程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。One or more embodiments of this specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the present specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant parts, refer to part of the description of the method embodiment.

以上所述仅为本文件的实施例而已，并不用于限制本文件。对于本领域技术人员来说，本文件可以有各种更改和变化。凡在本文件的精神和原理之内所作的任何修改、等同替换、改进等，均应包含在本文件的权利要求范围之内。The above description is only an embodiment of this document, and is not intended to limit this document. Various modifications and changes to this document will occur to those skilled in the art. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this document shall be included within the scope of the claims of this document.

Claims

1. A voiceprint reconstruction model construction method based on voice conversion is characterized by comprising the following steps:

constructing a first training set and a second training set based on voice conversion, wherein the first training set is used for indicating the original sound of a camouflage person, and the second training set is used for indicating the sound obtained by the camouflage person through voice conversion;

constructing a first voice vector according to the first training set;

constructing a second speech vector according to the second training set;

according to the first voice vector and the second voice vector, determining a feature vector of the original sound of the camouflage person through a vector decomposition method;

the method for determining the feature vector of the original sound of the camouflage person through a vector decomposition method according to the first voice vector and the second voice vector comprises the following steps:

creating a coordinate system by taking the direction of the second voice vector as a coordinate axis;

decomposing the first speech vector into orthogonal components and parallel components in the coordinate system;

determining a feature vector of the camouflage sound according to the parallel component and the first voice vector;

specifically, the audio feature vector of the target speaker is calculated

Average vector of +.>

: record->

Is->

Is>

Frame information, average +.>

；

Calculating normalized direction vectors

：/>

Length of->

Wherein the constant->

For preventing errors;

decomposing feature vectors of speech converted audio

: parallel component->

Orthogonal component

Wherein->

Is of size +.>

Is a unit matrix of (a).

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the constructing a first training set and a second training set based on the speech conversion includes:

collecting a plurality of acoustic data and a plurality of voice conversion models;

constructing the first training set using the plurality of acoustic data;

and converting the voiceprint corresponding to any one of the original voice data into the voiceprint corresponding to the other one of the original voice data by using any one of the voice conversion models to obtain one training sample of the second training set.

3. The method of claim 1, wherein the step of determining the position of the substrate comprises,

constructing a first speech vector according to the first training set, including:

determining a voice vector corresponding to each training sample in the first training set;

and determining an average vector of the voice vectors corresponding to the training samples as the first voice vector.

4. The method of claim 1, wherein the step of determining the position of the substrate comprises,

constructing a second speech vector according to the second training set, including:

determining a voice vector corresponding to each training sample in the second training set;

and determining an average vector of the voice vectors corresponding to the training samples as the second voice vector.

5. The voiceprint reconstruction model construction device based on voice conversion is characterized by comprising: the training set comprises a training set construction module, a vector construction module and a data processing module;

the training set construction module is used for constructing a first training set and a second training set based on voice conversion, wherein the first training set is used for indicating the original sound of a camouflage person, and the second training set is used for indicating the sound obtained by the camouflage person through voice conversion;

the vector construction module is used for constructing a first voice vector according to the first training set; constructing a second speech vector according to the second training set;

the data processing module is used for determining a characteristic vector of the original sound of the camouflage person through a vector decomposition method according to the first voice vector and the second voice vector;

the data processing module is used for creating a coordinate system by taking the direction of the second voice vector as a coordinate axis; decomposing the first speech vector into orthogonal components and parallel components in the coordinate system; determining a feature vector of the camouflage sound according to the parallel component and the first voice vector;

specifically, the audio feature vector of the target speaker is calculated

Average vector of +.>

: record->

Is->

Is>

Frame information, average +.>

；

Calculating normalized direction vectors

：/>

Length of->

Wherein the constant->

For preventing errors;

decomposing feature vectors of speech converted audio

: parallel component->

Orthogonal component

Wherein->

Is of size +.>

Is a unit matrix of (a).

6. The apparatus of claim 5, wherein the device comprises a plurality of sensors,

the training set construction module is used for collecting a plurality of acoustic data and a plurality of voice conversion models; constructing the first training set using the plurality of acoustic data; and converting the voiceprint corresponding to any one of the original voice data into the voiceprint corresponding to the other one of the original voice data by using any one of the voice conversion models to obtain one training sample of the second training set.

7. The apparatus of claim 5, wherein the device comprises a plurality of sensors,

the vector construction module is used for determining a voice vector corresponding to each training sample in the first training set; and determining an average vector of the voice vectors corresponding to the training samples as the first voice vector.

8. A storage medium, comprising:

for storing computer-executable instructions which, when executed, implement the method of any of claims 1-4.