CN113056784B

CN113056784B - Voice information processing method and device, storage medium and electronic equipment

Info

Publication number: CN113056784B
Application number: CN201980076330.XA
Authority: CN
Inventors: 叶青
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd; Shenzhen Huantai Technology Co Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd; Shenzhen Huantai Technology Co Ltd
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2024-10-29
Anticipated expiration: 2039-01-29
Also published as: WO2020154883A1; CN113056784A

Abstract

This embodiment discloses a method for processing voice information, which includes collecting voice information, extracting target voice feature information and inputting it into a preset model to obtain target voiceprint parameters, obtaining the voice to be recognized, extracting the first voiceprint parameters of the voice to be recognized, matching the first voiceprint parameters with the target voiceprint parameters, obtaining identification information according to the matching result, and marking the identification information into the playback video. The processing accuracy of voice information is improved.

Description

Voice information processing method, device, storage medium and electronic device

技术领域Technical Field

本发明涉及语音处理领域，特别涉及一种语音信息的处理方法、装置、存储介质及电子设备。The present invention relates to the field of speech processing, and in particular to a speech information processing method, device, storage medium and electronic equipment.

背景技术Background Art

随着信息技术的发展，用户使用的数据早已不局限于文本与图片，其中视频已成为信息传输中的主要媒介。With the development of information technology, the data used by users is no longer limited to text and pictures. Video has become the main medium for information transmission.

目前，为了帮助用户更好的理解视频的内容，利用语音合成技术在视频中添加字幕已经成为常规的选择，同时在视频中添加字幕也能加速不同语言视频之间的分享。但是，现有添加的字幕仅带有语音中的文字内容，导致在一些视频中，仅仅依靠文字内容难以判断说话人的身份从而影响用户对视频内容的理解。At present, in order to help users better understand the content of the video, adding subtitles to the video using speech synthesis technology has become a common choice. At the same time, adding subtitles to the video can also speed up the sharing of videos in different languages. However, the existing subtitles only contain the text content in the speech, which makes it difficult to determine the identity of the speaker in some videos based on the text content alone, thus affecting the user's understanding of the video content.

发明内容Summary of the invention

本申请实施例提供的一种语音信息的处理方法、装置、存储介质及电子设备，可以提升语音信息的处理准确性。The embodiments of the present application provide a method, device, storage medium and electronic device for processing voice information, which can improve the accuracy of voice information processing.

第一方面，本申请实施例了提供了一种语音信息的处理方法，包括：In a first aspect, an embodiment of the present application provides a method for processing voice information, including:

采集目标用户的语音信息，提取出所述语音信息的目标语音特征信息；Collecting voice information of a target user and extracting target voice feature information of the voice information;

将目标语音特征信息输入预设模型，以得到目标声纹参数；Input the target speech feature information into the preset model to obtain the target voiceprint parameters;

获取播放视频中的待识别语音信息，并提取出所述待识别语音信息的第一声纹参数；Acquire the voice information to be recognized in the played video, and extract the first voiceprint parameter of the voice information to be recognized;

将所述第一声纹参数与目标声纹参数进行匹配，根据匹配结果获取相匹配的目标声纹参数的标识信息，并将所述标识信息标识至所述播放视频中。The first voiceprint parameter is matched with the target voiceprint parameter, identification information of the matched target voiceprint parameter is obtained according to the matching result, and the identification information is marked in the played video.

第二方面，本申请实施例了提供了的一种语音信息的处理装置，包括：In a second aspect, an embodiment of the present application provides a voice information processing device, comprising:

采集单元，用于采集目标用户的语音信息，提取出所述语音信息的目标语音特征信息；A collection unit, used to collect voice information of a target user and extract target voice feature information of the voice information;

输入单元，用于将目标语音特征信息输入预设模型，以得到目标声纹参数；An input unit, used to input target speech feature information into a preset model to obtain target voiceprint parameters;

获取单元，用于获取播放视频中的待识别语音信息，并提取出所述待识别语音信息的第一声纹参数；An acquisition unit, used for acquiring voice information to be recognized in the played video, and extracting a first voiceprint parameter of the voice information to be recognized;

匹配单元，用于将所述第一声纹参数与目标声纹参数进行匹配，根据匹配结果获取相匹配的目标声纹参数的标识信息，并将所述标识信息标识至所述播放视频中。A matching unit is used to match the first voiceprint parameter with a target voiceprint parameter, obtain identification information of the matched target voiceprint parameter according to the matching result, and mark the identification information into the played video.

第三方面，本申请实施例提供的存储介质，其上存储有计算机程序，当所述计算机程序在计算机上运行时，使得所述计算机执行如本申请任一实施例提供的语音信息的处理方法。In a third aspect, the storage medium provided in the embodiments of the present application stores a computer program thereon, and when the computer program runs on a computer, the computer executes a method for processing voice information as provided in any embodiment of the present application.

第四方面，本申请实施例提供的电子设备，包括处理器和存储器，所述存储器有计算机程序，所述处理器通过调用所述计算机程序，用于执行步骤：In a fourth aspect, an electronic device provided by an embodiment of the present application includes a processor and a memory, wherein the memory has a computer program, and the processor executes the steps by calling the computer program:

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

下面结合附图，通过对本申请的具体实施方式详细描述，将使本申请的技术方案及其它有益效果显而易见。The technical solution and other beneficial effects of the present application will be made apparent by describing in detail the specific implementation methods of the present application in conjunction with the accompanying drawings.

图1是本申请实施例提供的语音信息的处理方法的流程示意图。FIG1 is a flow chart of a method for processing voice information provided in an embodiment of the present application.

图2为本申请实施例提供的语音信息的处理方法的另一流程示意图。FIG. 2 is another flowchart of the method for processing voice information provided in an embodiment of the present application.

图3为本申请实施例提供的语音信息的处理装置的模块示意图。FIG. 3 is a module diagram of a device for processing voice information provided in an embodiment of the present application.

图4为本申请实施例提供的语音信息的处理装置的另一模块示意图。FIG. 4 is a schematic diagram of another module of the voice information processing device provided in an embodiment of the present application.

图5为本申请实施例提供的电子设备的结构示意图。FIG5 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.

图6为本申请实施例提供的电子设备的另一结构示意图。FIG. 6 is another schematic diagram of the structure of the electronic device provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

请参照图式，其中相同的组件符号代表相同的组件，本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例，其不应被视为限制本申请未在此详述的其它具体实施例。Please refer to the drawings, where the same component symbols represent the same components, and the principles of the present application are illustrated by implementing them in an appropriate computing environment. The following description is based on the illustrated specific embodiments of the present application, which should not be considered as limiting other specific embodiments of the present application that are not described in detail herein.

本文所使用的术语「模块」可看做为在该运算系统上执行的软件对象。本文该的不同组件、模块、引擎及服务可看做为在该运算系统上的实施对象。而本文该的装置及方法优选的以软件的方式进行实施，当然也可在硬件上进行实施，均在本申请保护范围之内。The term "module" used herein can be regarded as a software object executed on the computing system. The different components, modules, engines and services described herein can be regarded as implementation objects on the computing system. The devices and methods described herein are preferably implemented in software, but can also be implemented in hardware, all within the scope of protection of this application.

本申请实施例提供一种语音信息的处理方法，该语音信息的处理方法的执行主体可以是本申请实施例提供的语音信息的处理装置，或者集成了该语音信息的处理装置的电子设备，其中该语音信息的处理装置可以采用硬件或者软件的方式实现。其中，电子设备可以是智能手机、平板电脑、掌上电脑(PDA，Personal Digital Assistant)等。The embodiment of the present application provides a method for processing voice information, and the execution subject of the method for processing voice information can be a voice information processing device provided in the embodiment of the present application, or an electronic device integrating the voice information processing device, wherein the voice information processing device can be implemented in hardware or software. Among them, the electronic device can be a smart phone, a tablet computer, a PDA (Personal Digital Assistant), etc.

以下进行具体分析说明。The following is a detailed analysis and explanation.

本发明实施例提供一种视频语音的处理方法，包括：The embodiment of the present invention provides a method for processing video and voice, comprising:

采集目标用户的语音信息，提取出该语音信息的目标语音特征信息；Collecting the target user's voice information and extracting the target voice feature information of the voice information;

获取播放视频中的待识别语音信息，并提取出该待识别语音信息的第一声纹参数；Acquire the voice information to be recognized in the played video, and extract the first voiceprint parameter of the voice information to be recognized;

将该第一声纹参数与目标声纹参数进行匹配，根据匹配结果获取相匹配的目标声纹参数的标识信息，并将该标识信息标识至该播放视频中。The first voiceprint parameter is matched with the target voiceprint parameter, identification information of the matched target voiceprint parameter is obtained according to the matching result, and the identification information is marked in the played video.

在一种实施方式中，将目标语音特征信息输入预设模型的步骤之前，还可以包括：通过预设算法对背景数据进行训练，以生成包含有每一目标用户相应的共同语音特征信息的预设模型，该背景数据包括每一目标用户的语音信息。In one embodiment, before the step of inputting the target voice feature information into the preset model, it can also include: training the background data through a preset algorithm to generate a preset model containing the common voice feature information corresponding to each target user, and the background data includes the voice information of each target user.

在一种实施方式中，将目标语音特征信息输入预设模型，以得到目标声纹参数的步骤，可以包括：将该目标语音特征信息输入预设模型，以得到与该共同语音特征信息相应的目标差异特征信息；根据该目标差异特征信息确定出第二声纹参数；对该第二声纹参数进行信道补偿，以得到相应的目标声纹参数。In one embodiment, the step of inputting the target voice feature information into a preset model to obtain target voiceprint parameters may include: inputting the target voice feature information into a preset model to obtain target difference feature information corresponding to the common voice feature information; determining a second voiceprint parameter based on the target difference feature information; and performing channel compensation on the second voiceprint parameter to obtain a corresponding target voiceprint parameter.

在一种实施方式中，对该第二声纹参数进行信道补偿的步骤，可以包括：利用线性鉴别分析的方法对该第二声纹参数进行信道补偿。In one implementation, the step of performing channel compensation on the second voiceprint parameter may include: performing channel compensation on the second voiceprint parameter using a linear discriminant analysis method.

在一种实施方式中，将该第一声纹参数与目标声纹参数进行匹配，根据匹配结果获取相匹配的目标声纹参数的标识信息的步骤，可以包括：将该第一声纹参数与目标声纹参数进行匹配，生成相应的匹配值；当匹配值大于预设阈值时，获取相匹配的目标声纹参数的标识信息。In one embodiment, the step of matching the first voiceprint parameter with the target voiceprint parameter and obtaining identification information of the matching target voiceprint parameter based on the matching result may include: matching the first voiceprint parameter with the target voiceprint parameter to generate a corresponding matching value; when the matching value is greater than a preset threshold, obtaining identification information of the matching target voiceprint parameter.

在一种实施方式中，获取相匹配的目标声纹参数的标识信息的步骤，可以包括：将该匹配值进行排序处理，获取大于预设阈值的匹配值中的最大匹配值，根据该最大匹配值获取相匹配的目标声纹参数；根据该目标声纹参数获取该相应的标识信息。In one embodiment, the step of obtaining identification information of the matching target voiceprint parameters may include: sorting the matching values, obtaining the maximum matching value among the matching values greater than a preset threshold, and obtaining the matching target voiceprint parameters based on the maximum matching value; and obtaining the corresponding identification information based on the target voiceprint parameters.

在一种实施方式中，将该标识信息标识至该播放视频中的步骤，包括：将该待识别语音信息输入语音识别模型，以生成相应的文本信息。将该标识信息与该文本信息相结合，以生成该待识别语音信息相应的字幕信息；将该字幕信息标识至该播放视频中。In one embodiment, the step of marking the identification information in the playing video includes: inputting the voice information to be recognized into a voice recognition model to generate corresponding text information; combining the identification information with the text information to generate subtitle information corresponding to the voice information to be recognized; and marking the subtitle information in the playing video.

本申请实施例提供一种语音信息的处理方法，如图1所示，图1为本申请实施例提供的语音信息的处理方法的流程示意图，该语音信息的处理方法可以包括以下步骤：The present application embodiment provides a method for processing voice information, as shown in FIG1, which is a flow chart of the method for processing voice information provided by the present application embodiment. The method for processing voice information may include the following steps:

在步骤S101中，采集目标用户的语音信息，提取出语音信息的目标语音特征信息。In step S101, voice information of a target user is collected, and target voice feature information of the voice information is extracted.

其中，目标用户可以是指视频中的主要说话人，可以理解的是，在访谈、电影、综艺节目等类型的视频中，绝大部分情况下的说话人都集中在有限个数的角色中。例如，在访谈类的视频中，目标用户即为主持人以及访谈嘉宾，在电影或电视剧类的视频中，目标用户即为戏份权重较大的演员，或者在偶像组合的音乐短片(Music Video，MV)视频中，目标用户即为偶像组合的所有成员。The target user may refer to the main speaker in the video. It is understandable that in interviews, movies, variety shows and other types of videos, the speakers are mostly concentrated in a limited number of roles. For example, in interview videos, the target users are the host and the interview guests, in movie or TV series videos, the target users are the actors with more important roles, or in the music video (MV) of an idol group, the target users are all the members of the idol group.

其中，目标用户的语音信息是指经过标注后的语音信息，此时目标用户的语音信息中包含目标用户的标识信息。进一步的，标识信息可以是指目标用户的身份信息，例如姓名、性别、年龄、称号等等个人信息。同时，目标语音特征信息指目标语音声纹特征信息，可以理解的是，因为人在讲话时使用的发声器官如：舌、牙齿、喉头、肺、鼻腔在尺寸和形态方面每个人的差异很大，导致每个人的声纹均存在差异。故声纹特征信息是每个人特有的特征，如同每个人有自己独一无二的指纹一样。进一步的，目标语音特征信息包括目标语音信息的梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficient，MFCC)。Among them, the voice information of the target user refers to the voice information after annotation, and the voice information of the target user contains the identification information of the target user. Further, the identification information may refer to the identity information of the target user, such as name, gender, age, title and other personal information. At the same time, the target voice feature information refers to the target voice voiceprint feature information. It can be understood that because the vocal organs used by people when speaking, such as tongue, teeth, larynx, lungs, and nasal cavity, vary greatly in size and shape, each person's voiceprint is different. Therefore, the voiceprint feature information is a unique feature of each person, just like everyone has his or her own unique fingerprint. Further, the target voice feature information includes the Mel-Frequency Cepstral Coefficient (MFCC) of the target voice information.

在一些实施方式中，为保证语音信息的目标语音特征信息的稳定性可以对语音信息进行去静音与去噪声的处理，生成处理后的语音信息；提取处理后的语音信息的目标语音特征信息并使用特征均值方差归一化与特征弯曲对目标语音特征信息进行处理。In some embodiments, in order to ensure the stability of the target speech feature information of the speech information, the speech information can be de-silenced and de-noised to generate processed speech information; the target speech feature information of the processed speech information is extracted and the target speech feature information is processed using feature mean variance normalization and feature bending.

在步骤S102中，将目标语音特征信息输入预设模型，以得到目标声纹参数。In step S102, the target speech feature information is input into a preset model to obtain target voiceprint parameters.

其中，预设模型可以指通用背景模型(Universal Background Model，简称UBM模型)，通过将目标语音特征信息即目标声纹特征输入UBM模型，以得到包含目标用户的标识信息的目标声纹参数，其中，不同的目标声纹参数对应着不同的目标用户的标识信息，即通过每一段语音信息的目标声纹参数可以确定每一段语音信息的目标用户。同时，若不同的语音片段输出的目标声纹参数相同时，即可认定不同的语音片段的说话人为同一用户。另外，将目标语音特征信息输入预设模型，以得到目标声纹参数的过程即为根据目标声纹参数建立声纹模型的过程。可理解的是，不同的目标声纹参数分别对应不同目标用户的声纹模型。Among them, the preset model can refer to the Universal Background Model (UBM model for short), and the target voiceprint parameters containing the identification information of the target user are obtained by inputting the target speech feature information, i.e., the target voiceprint features, into the UBM model, wherein different target voiceprint parameters correspond to the identification information of different target users, i.e., the target user of each voice message can be determined by the target voiceprint parameters of each voice message. At the same time, if the target voiceprint parameters output by different voice segments are the same, it can be determined that the speakers of different voice segments are the same user. In addition, the process of inputting the target speech feature information into the preset model to obtain the target voiceprint parameters is the process of establishing a voiceprint model according to the target voiceprint parameters. It can be understood that different target voiceprint parameters correspond to the voiceprint models of different target users.

在一些实施方式中，在将目标语音特征信息输入预设模型的步骤之前，还可以包括：通过预设算法对背景数据进行训练，以生成包含有每一目标用户相应的共同语音特征信息的预设模型，该背景数据包括每一目标用户的语音信息。In some embodiments, before the step of inputting the target voice feature information into the preset model, it can also include: training the background data through a preset algorithm to generate a preset model containing the common voice feature information corresponding to each target user, and the background data includes the voice information of each target user.

此时，将目标语音特征信息输入预设模型，以得到目标声纹参数的步骤，可以包括：At this time, the step of inputting the target speech feature information into the preset model to obtain the target voiceprint parameters may include:

(1)将该目标语音特征信息输入预设模型，以得到与该共同语音特征信息相应的目标差异特征信息；(1) inputting the target speech feature information into a preset model to obtain target difference feature information corresponding to the common speech feature information;

(2)根据该目标差异特征信息确定出第二声纹参数；(2) determining a second voiceprint parameter based on the target difference feature information;

(3)对该第二声纹参数进行信道补偿，以得到相应的目标声纹参数。(3) Channel compensation is performed on the second voiceprint parameter to obtain a corresponding target voiceprint parameter.

其中，预设算法可以为EM算法，通过EM算法对背景数据进行训练，即对背景数据中的目标语音特征信息进行训练，以生成通用背景模型，并通过UBM模型获取每一目标用户相应的共同语音特征信息，此时，共同语音特征信息即为根据所有目标用户获取的对应的共同声纹特征。Among them, the preset algorithm can be an EM algorithm, and the background data is trained by the EM algorithm, that is, the target voice feature information in the background data is trained to generate a universal background model, and the corresponding common voice feature information of each target user is obtained through the UBM model. At this time, the common voice feature information is the corresponding common voiceprint features obtained from all target users.

进一步的，将目标语音特征信息输入UBM模型，根据目标语音特征信息与共同语音特征信息可计算得出与该共同语音特征信息相应的目标差异特征信息，并根据目标差异特征信息确定出每一语音信息对应的第二声纹参数，其中第二声纹参数中包含目标用户的标识信息。可理解的是，由于声纹的独特性，不同目标用户的目标语音特征信息是不同的，故根据获取与共同语音特征信息相应的目标差异特征信息来放大每一目标语音特征信息的差异性，从而相比于目标语音特征信息，根据目标差异特征信息能更准确确定每一语音信息对应的目标用户信息。Furthermore, the target voice feature information is input into the UBM model, and the target difference feature information corresponding to the common voice feature information can be calculated based on the target voice feature information and the common voice feature information, and the second voiceprint parameter corresponding to each voice information is determined based on the target difference feature information, wherein the second voiceprint parameter includes the identification information of the target user. It is understandable that due to the uniqueness of the voiceprint, the target voice feature information of different target users is different, so the difference of each target voice feature information is amplified by obtaining the target difference feature information corresponding to the common voice feature information, so that compared with the target voice feature information, the target user information corresponding to each voice information can be more accurately determined based on the target difference feature information.

另外，由于背景数据中的语音信息与待识别的语音信息采集于不同的传输信道，导致存在很大的信道差异，从而导致识别性能下降，影响识别率。故对第二声纹参数进行信道补偿使得第二声纹参数能够最小化类内差异，最大化类间差异，以得到易区分的低维目标声纹参数。In addition, since the voice information in the background data and the voice information to be recognized are collected in different transmission channels, there are large channel differences, which leads to a decrease in recognition performance and affects the recognition rate. Therefore, channel compensation is performed on the second voiceprint parameters so that the second voiceprint parameters can minimize the intra-class differences and maximize the inter-class differences, so as to obtain low-dimensional target voiceprint parameters that are easy to distinguish.

在步骤S103中，获取播放视频中的待识别语音信息，并提取出待识别语音信息的第一声纹参数。In step S103, the voice information to be recognized in the played video is obtained, and the first voiceprint parameter of the voice information to be recognized is extracted.

其中，获取播放视频中的待识别语音信息的方式可以包括实时获取正在播放的视频或者直播视频的待识别语音信息、或者获取本地存储的视频的待识别语音信息。另外，提取待识别语音信息的第一声纹参数的方法与上述提取目标声纹参数的过程相同，即为提取待识别语音信息中的第一语音特征信息，并将第一语音特征信息输入预设模型中，并根据预设模型中的每一目标用户对应的共同语音特征信息与第一语音特征信息计算得出与该共同语音特征信息相应的第一差异特征信息，并根据第一差异特征信息确定出待识别语音信息对应的第一声纹参数。Among them, the method of obtaining the voice information to be recognized in the playing video may include obtaining the voice information to be recognized of the video being played or the live video in real time, or obtaining the voice information to be recognized of the locally stored video. In addition, the method of extracting the first voiceprint parameter of the voice information to be recognized is the same as the process of extracting the target voiceprint parameter, that is, extracting the first voice feature information in the voice information to be recognized, and inputting the first voice feature information into the preset model, and calculating the first difference feature information corresponding to the common voice feature information and the first voice feature information corresponding to each target user in the preset model, and determining the first voiceprint parameter corresponding to the voice information to be recognized based on the first difference feature information.

在步骤S104中，将第一声纹参数与目标声纹参数进行匹配，根据匹配结果获取相匹配的目标声纹参数的标识信息，并将标识信息标识至播放视频中。In step S104, the first voiceprint parameter is matched with the target voiceprint parameter, identification information of the matched target voiceprint parameter is obtained according to the matching result, and the identification information is marked in the played video.

其中，将第一声纹参数与目标声纹参数进行匹配，并得到相应的匹配结果，根据匹配结果可确定第一声纹参数相匹配的目标声纹参数，其中，由于每一目标声纹参数分别对应着每一目标用户的信息，即目标声纹参数中包含相应的目标用户的标识信息，故可根据目标声纹参数确认第一声纹参数对应的目标用户的信息。Among them, the first voiceprint parameter is matched with the target voiceprint parameter, and a corresponding matching result is obtained. According to the matching result, the target voiceprint parameter that matches the first voiceprint parameter can be determined. Since each target voiceprint parameter corresponds to the information of each target user, that is, the target voiceprint parameter contains the identification information of the corresponding target user, the information of the target user corresponding to the first voiceprint parameter can be confirmed according to the target voiceprint parameter.

在一些实施方式中，将第一声纹参数与目标声纹参数进行匹配，根据匹配结果获取相匹配的目标声纹参数的标识信息的步骤，可以包括：In some implementations, the step of matching the first voiceprint parameter with the target voiceprint parameter and obtaining identification information of the matched target voiceprint parameter according to the matching result may include:

(1)将该第一声纹参数与目标声纹参数进行匹配，生成相应的匹配值；(1) matching the first voiceprint parameter with the target voiceprint parameter to generate a corresponding matching value;

(2)当匹配值大于预设阈值时，获取相匹配的目标声纹参数的标识信息。(2) When the matching value is greater than a preset threshold, identification information of the matched target voiceprint parameters is obtained.

其中，当匹配值大于预设阈值时，即第一声纹参数与相匹配的目标声纹参数相似度极高可认定第一声纹参数的说话人与相匹配的目标声纹参数所对应的目标用户为同一用户，故可获取相匹配的目标声纹参数的标识信息作为第一声纹参数对应的待识别语音的标识信息。Among them, when the matching value is greater than the preset threshold, that is, the first voiceprint parameter and the matching target voiceprint parameter are extremely similar, it can be determined that the speaker of the first voiceprint parameter and the target user corresponding to the matching target voiceprint parameter are the same user, so the identification information of the matching target voiceprint parameter can be obtained as the identification information of the speech to be recognized corresponding to the first voiceprint parameter.

另外，在一些实施方式中，将标识信息标识至播放视频中的步骤，可以包括：In addition, in some implementations, the step of marking the identification information into the played video may include:

(1.1)将该待识别语音信息输入语音识别模型，以生成相应的文本信息；(1.1) Inputting the speech information to be recognized into a speech recognition model to generate corresponding text information;

(2.1)将该标识信息与该文本信息相结合，以生成该待识别语音信息相应的字幕信息；(2.1) combining the identification information with the text information to generate subtitle information corresponding to the voice information to be recognized;

(3.1)将该字幕信息标识至该播放视频中。(3.1) The subtitle information is marked in the playing video.

其中，在将待识别语音信息输入至预设模型获取标识信息时，同步将待识别语音输入至语音识别模型获取文本信息，分别记录文本信息与标识信息所对应的时间信息，并根据时间信息将标识信息与文本信息相结合生成待识别语音信息的字幕信息，并根据时间信息将字幕信息标识至播放视频中。Among them, when the voice information to be recognized is input into the preset model to obtain identification information, the voice to be recognized is simultaneously input into the voice recognition model to obtain text information, and the time information corresponding to the text information and the identification information are recorded respectively, and the identification information and the text information are combined according to the time information to generate subtitle information of the voice information to be recognized, and the subtitle information is marked into the playback video according to the time information.

在一些实施方式中，可以将字幕信息以预设组合方式标识至播放视频的预设位置，例如，将标识信息与字幕信息并排组合标识至播放视频画面的下方位置。或者将字幕信息中的标识信息以特殊形式标识播放视频的第一区域，同时将标识信息以不同形式将文本信息标识至播放视频的第二区域。例如，将标识信息以小于文本信息的字号添加至播放视频画面的上端位置，将文本信息添加至播放视频画面的下端位置。In some embodiments, the subtitle information can be marked in a preset position of the playing video in a preset combination, for example, the identification information and the subtitle information are combined and marked side by side at the bottom of the playing video screen. Alternatively, the identification information in the subtitle information marks the first area of the playing video in a special form, and the identification information marks the text information in a second area of the playing video in a different form. For example, the identification information is added to the upper end of the playing video screen with a font size smaller than the text information, and the text information is added to the lower end of the playing video screen.

由上述可知，本实施例提供的一种语音信息的处理方法，通过采集目标用户的语音信息，提取出该语音信息的目标语音特征信息；将目标语音特征信息输入预设模型，以得到目标声纹参数；获取播放视频中的待识别语音信息，并提取出该待识别语音信息的第一声纹参数；并将该第一声纹参数与目标声纹参数进行匹配，根据匹配结果获取相匹配的目标声纹参数的标识信息，并将该标识信息标识至该播放视频中。以此可以实现将目标用户的标识信息例如身份信息标识至播放视频中，帮助用户在观看视频时能更好的理解视频的内容，以保证用户体验，同时通过声纹识别技术自动的将标识信息添加至播放视频中，大大减少了人工操作，节省了人力成本。As can be seen from the above, the present embodiment provides a method for processing voice information, which collects the voice information of the target user, extracts the target voice feature information of the voice information; inputs the target voice feature information into a preset model to obtain the target voiceprint parameter; obtains the voice information to be identified in the playback video, and extracts the first voiceprint parameter of the voice information to be identified; and matches the first voiceprint parameter with the target voiceprint parameter, obtains the identification information of the matching target voiceprint parameter according to the matching result, and identifies the identification information to the playback video. In this way, the identification information of the target user, such as identity information, can be identified to the playback video, which helps the user to better understand the content of the video when watching the video to ensure the user experience. At the same time, the identification information is automatically added to the playback video through voiceprint recognition technology, which greatly reduces manual operations and saves labor costs.

根据上述实施例所描述的方法，以下将举例作进一步详细说明。The method described in the above embodiment is further described in detail below with examples.

请参阅图2，图2为本申请实施例提供的语音信息的处理方法的另一流程示意图。Please refer to FIG. 2 , which is another flowchart of the method for processing voice information provided in an embodiment of the present application.

具体而言，该方法包括以下步骤：Specifically, the method comprises the following steps:

在步骤S201中，采集目标用户的语音信息，提取出语音信息的目标语音特征信息。In step S201, voice information of a target user is collected, and target voice feature information of the voice information is extracted.

其中，目标用户的语音信息是指经过标注后的语音信息，故目标用户的语音信息中包含目标用户的标识信息，进一步的，标识信息可以是指目标用户的身份信息，例如姓名、性别、年龄、称号等等个人信息。另外，目标语音特征信息指目标语音的声纹特征信息，由于声纹特征信息是每个人特有的特征，故可以根据声纹特征来区分语音信息对应的用户信息。The target user's voice information refers to the annotated voice information, so the target user's voice information contains the target user's identification information. Furthermore, the identification information may refer to the target user's identity information, such as name, gender, age, title, and other personal information. In addition, the target voice feature information refers to the voiceprint feature information of the target voice. Since the voiceprint feature information is a unique feature of each person, the user information corresponding to the voice information can be distinguished according to the voiceprint feature.

在步骤S202中，通过预设算法对背景数据进行训练，以生成包含有每一目标用户相应的共同语音特征信息的预设模型，背景数据包括每一目标用户的语音信息。In step S202, the background data is trained by a preset algorithm to generate a preset model containing common voice feature information corresponding to each target user, and the background data includes voice information of each target user.

在步骤S203中，将目标语音特征信息输入预设模型，以得到与共同语音特征信息相应的目标差异特征信息。In step S203, the target speech feature information is input into a preset model to obtain target difference feature information corresponding to the common speech feature information.

其中，将每一段语音目标特征信息输入预设模型，此时，根据每一段语音对应的目标语音特征信息与步骤S202得到的所有目标用户的共同语音特征信息可得出目标差异特征信息。The target feature information of each speech segment is input into a preset model. At this time, the target difference feature information can be obtained according to the target speech feature information corresponding to each speech segment and the common speech feature information of all target users obtained in step S202.

在步骤S204中，根据目标差异特征信息确定出第二声纹参数；In step S204, a second voiceprint parameter is determined according to the target difference feature information;

其中，将目标差异特征信息通过全因子空间(Total Variability Space(TVS)-based model)的变换，可以得到第二声纹参数。其中，可以通过EM算法估计全因子空间的全因子矩阵。The target difference feature information is transformed through the total factor space (Total Variability Space (TVS)-based model) to obtain the second voiceprint parameter. The total factor matrix of the total factor space can be estimated by the EM algorithm.

在步骤S205中，利用线性鉴别分析的方法对第二声纹参数进行信道补偿，以得到相应的目标声纹参数。In step S205, a linear discriminant analysis method is used to perform channel compensation on the second voiceprint parameters to obtain corresponding target voiceprint parameters.

其中，为了减少信道差异造成的识别精度下降问题，可以利用线性鉴别分析(LDA)的方法进行信道补偿，需要说明的是，LDA使用标签信息来寻找最优的投影方向，使得投影后的样本集具有最小的类内差异和最大的类间差异。当应用于声纹识别时，同一说话人的声纹参数的矢量代表一个类，最小类内差异就是减少信道引起的变化，最大化类间差异就是增大说话人之间的差异信息，从而经过线性鉴别分析的方法可以得到易区分的低维目标声纹参数。Among them, in order to reduce the problem of reduced recognition accuracy caused by channel differences, the linear discriminant analysis (LDA) method can be used for channel compensation. It should be noted that LDA uses label information to find the optimal projection direction so that the projected sample set has the smallest intra-class difference and the largest inter-class difference. When applied to voiceprint recognition, the vector of the voiceprint parameters of the same speaker represents a class. Minimizing the intra-class difference is to reduce the changes caused by the channel, and maximizing the inter-class difference is to increase the difference information between speakers. Therefore, the linear discriminant analysis method can be used to obtain low-dimensional target voiceprint parameters that are easy to distinguish.

另外，此时根据目标语音特征信息获取目标声纹参数的过程即为建立相应声纹模型的过程，此时声纹模型分别为每一目标用户对应的i-vector声纹模型。In addition, at this time, the process of obtaining the target voiceprint parameters according to the target speech feature information is the process of establishing the corresponding voiceprint model. At this time, the voiceprint model is the i-vector voiceprint model corresponding to each target user.

在步骤S206中，获取播放视频中的待识别语音信息，并提取出待识别语音信息的第一声纹参数。In step S206, the voice information to be recognized in the played video is obtained, and the first voiceprint parameter of the voice information to be recognized is extracted.

其中，提取待识别语音信息的语音特征信息，并将语音特征信息输入步骤S205中声纹模型，并根据UBM模型中的共同语音特征信息获取相应的目标差异特征信息；并将第一语音特征信息输入预设模型中，并根据预设模型中的每一目标用户对应的共同语音特征信息与第一语音特征信息计算得出与该共同语音特征信息相应的第一差异特征信息，并根据第一差异特征信息确定出待识别语音信息对应的第一声纹参数；并对该第一声纹参数进行信道补偿得到处理后的第一声纹参数。其中提取出待识别语音信息的第一声纹参数的步骤与上述提取目标声纹参数的步骤相同，在此不再赘述。Among them, the voice feature information of the voice information to be recognized is extracted, and the voice feature information is input into the voiceprint model in step S205, and the corresponding target difference feature information is obtained according to the common voice feature information in the UBM model; the first voice feature information is input into the preset model, and the first difference feature information corresponding to the common voice feature information is calculated according to the common voice feature information corresponding to each target user in the preset model and the first voice feature information, and the first voiceprint parameter corresponding to the voice information to be recognized is determined according to the first difference feature information; and the first voiceprint parameter is channel compensated to obtain the processed first voiceprint parameter. The step of extracting the first voiceprint parameter of the voice information to be recognized is the same as the step of extracting the target voiceprint parameter, which will not be repeated here.

在步骤S207中，将第一声纹参数与目标声纹参数进行匹配，生成相应的匹配值。In step S207, the first voiceprint parameter is matched with the target voiceprint parameter to generate a corresponding matching value.

其中，将第一声纹参数分别与目标用户的目标声纹参数进行相似度匹配，生成相应的匹配值。The first voiceprint parameters are respectively matched with the target voiceprint parameters of the target user for similarity to generate corresponding matching values.

在步骤S208中，当匹配值大于预设阈值时，将匹配值进行排序处理，获取大于预设阈值的匹配值中的最大匹配值，根据最大匹配值获取相匹配的目标声纹参数。In step S208, when the matching value is greater than the preset threshold, the matching values are sorted to obtain the maximum matching value among the matching values greater than the preset threshold, and the matching target voiceprint parameters are obtained according to the maximum matching value.

其中，当匹配值大于预设阈值如0.8时，即此时第一声纹参数与相应的目标声纹参数匹配成功，可认定为该匹配值对应的第一声纹参数对应的目标用户与目标声纹参数对应的目标用户大概率下为同一用户。若大于预设阈值的匹配值为多个时，将大于预设阈值的匹配值进行排序处理，获取其中的最大匹配值。此时则认定最大匹配值对应的第一声纹参数对应的目标用户与目标声纹参数对应的目标用户大概率下为同一人，获取最大匹配值相应的目标声纹参数。Among them, when the matching value is greater than a preset threshold value such as 0.8, that is, at this time, the first voiceprint parameter successfully matches the corresponding target voiceprint parameter, and it can be determined that the target user corresponding to the first voiceprint parameter corresponding to the matching value and the target user corresponding to the target voiceprint parameter are the same user with a high probability. If there are multiple matching values greater than the preset threshold, the matching values greater than the preset threshold are sorted and the maximum matching value is obtained. At this time, it is determined that the target user corresponding to the first voiceprint parameter corresponding to the maximum matching value and the target user corresponding to the target voiceprint parameter are the same person with a high probability, and the target voiceprint parameter corresponding to the maximum matching value is obtained.

另外，在一些实施方式中，当匹配值均小于预设阈值时，代表第一声纹参数与目标声纹参数均不匹配，即待识别语音对应的说话人与模型中的目标用户不匹配，此时声纹模型输入不匹配的匹配结果。In addition, in some embodiments, when the matching values are all less than the preset threshold, it means that the first voiceprint parameters do not match the target voiceprint parameters, that is, the speaker corresponding to the voice to be recognized does not match the target user in the model, and the voiceprint model inputs an unmatched matching result.

在步骤S209中，根据目标声纹参数获取相应的标识信息。In step S209, corresponding identification information is obtained according to the target voiceprint parameters.

其中，由于目标声纹参数中包含目标用户的标识信息，此时便可根据匹配成功的目标声纹参数获取对应的标识信息。Among them, since the target voiceprint parameters include the identification information of the target user, the corresponding identification information can be obtained according to the successfully matched target voiceprint parameters.

在步骤S210中，将待识别语音信息输入语音识别模型，以生成相应的文本信息。In step S210, the speech information to be recognized is input into a speech recognition model to generate corresponding text information.

其中，在将待识别语音信息输入声纹模型获取标识信息时，同步将待识别语音信息输入至语音识别模型以获取文本信息。When the voice information to be recognized is input into the voiceprint model to obtain identification information, the voice information to be recognized is simultaneously input into the voice recognition model to obtain text information.

在步骤S211中，将标识信息与文本信息相结合，以生成待识别语音信息相应的字幕信息。In step S211, the identification information is combined with the text information to generate subtitle information corresponding to the speech information to be recognized.

其中，在获取标识信息与文本信息时，分别记录文本信息与标识信息所对应的时间信息，并根据时间信息将标识信息与文本信息相结合生成待识别语音信息的字幕信息。When the identification information and the text information are obtained, the time information corresponding to the text information and the identification information are recorded respectively, and the identification information and the text information are combined according to the time information to generate the subtitle information of the voice information to be recognized.

在步骤S212中，将字幕信息标识至播放视频中。In step S212, the subtitle information is marked in the played video.

其中，根据时间信息将字幕信息标识至播放视频中的预设区域，以保证字幕信息与播放视频中的语音信息相同步。The subtitle information is marked to a preset area in the playing video according to the time information to ensure that the subtitle information is synchronized with the voice information in the playing video.

由上述可知，本实施例提供的一种语音信息的处理方法，通过采集目标用户的语音信息，提取出该语音信息的目标语音特征信息；As can be seen from the above, the present embodiment provides a method for processing voice information, which collects the voice information of the target user and extracts the target voice feature information of the voice information;

将目标语音特征信息输入预设模型，以得到目标声纹参数；获取播放视频中的待识别语音信息，并提取出该待识别语音信息的第一声纹参数；并将该第一声纹参数与目标声纹参数进行匹配，根据匹配结果获取相匹配的目标声纹参数的标识信息，并将该标识信息标识至该播放视频中。以此可以实现将目标用户的标识信息例如身份信息标识至播放视频中，帮助用户在观看视频时能更好的理解视频的内容，以保证用户体验。另外利用语音识别与声纹识别技术自动的为视频添加字幕信息，能够很大程度上减少人工标注操作，节省人力成本。Input the target voice feature information into the preset model to obtain the target voiceprint parameters; obtain the voice information to be identified in the playback video, and extract the first voiceprint parameters of the voice information to be identified; match the first voiceprint parameters with the target voiceprint parameters, obtain the identification information of the matching target voiceprint parameters according to the matching results, and mark the identification information into the playback video. In this way, the identification information of the target user, such as identity information, can be marked into the playback video, helping users to better understand the content of the video when watching the video to ensure user experience. In addition, using voice recognition and voiceprint recognition technology to automatically add subtitle information to the video can greatly reduce manual annotation operations and save labor costs.

为便于更好的实施本申请实施例提供的语音信息的处理方法，本申请实施例还提供一种基于上述语音信息的处理方法的装置。其中名词的含义与上述语音信息的处理方法中相同，具体实现细节可以参考方法实施例中的说明。In order to better implement the voice information processing method provided in the embodiment of the present application, the embodiment of the present application also provides a device based on the voice information processing method. The meaning of the nouns is the same as that in the voice information processing method, and the specific implementation details can refer to the description in the method embodiment.

本发明实施例提供一种视频语音的处理装置，包括：The embodiment of the present invention provides a video and voice processing device, including:

采集单元，用于采集目标用户的语音信息，提取出该语音信息的目标语音特征信息；A collection unit, used to collect voice information of a target user and extract target voice feature information of the voice information;

获取单元，用于获取播放视频中的待识别语音信息，并提取出该待识别语音信息的第一声纹参数；An acquisition unit, used to acquire the voice information to be recognized in the played video, and extract the first voiceprint parameter of the voice information to be recognized;

匹配单元，用于将该第一声纹参数与目标声纹参数进行匹配，根据匹配结果获取相匹配的目标声纹参数的标识信息，并将该标识信息标识至该播放视频中。The matching unit is used to match the first voiceprint parameter with the target voiceprint parameter, obtain identification information of the matched target voiceprint parameter according to the matching result, and mark the identification information into the played video.

在一实施方式中，该装置还可以包括：训练单元，用于通过预设算法对背景数据进行训练，以生成包含有每一目标用户相应的共同语音特征信息的预设模型，该背景数据包括每一目标用户的语音信息。In one embodiment, the device may further include: a training unit for training background data using a preset algorithm to generate a preset model containing common voice feature information corresponding to each target user, wherein the background data includes voice information of each target user.

在一实施方式中，输入单元，可以包括：输入子单元，用于将该目标语音特征信息输入预设模型，以得到与该共同语音特征信息相应的目标差异特征信息；确定子单元，用于根据该目标差异特征信息确定出第二声纹参数；处理子单元，用于对该第二声纹参数进行信道补偿，以得到相应的目标声纹参数。In one embodiment, the input unit may include: an input subunit, used to input the target voice feature information into a preset model to obtain target difference feature information corresponding to the common voice feature information; a determination subunit, used to determine a second voiceprint parameter based on the target difference feature information; and a processing subunit, used to perform channel compensation on the second voiceprint parameter to obtain a corresponding target voiceprint parameter.

在一实施方式中，匹配单元，可以包括：匹配子单元，用于将该第一声纹参数与目标声纹参数进行匹配，生成相应的匹配值；获取子单元，用于当匹配值大于预设阈值时，获取相匹配的目标声纹参数的标识信息。In one embodiment, the matching unit may include: a matching subunit, used to match the first voiceprint parameter with the target voiceprint parameter to generate a corresponding matching value; and an acquisition subunit, used to acquire identification information of the matched target voiceprint parameter when the matching value is greater than a preset threshold.

在一实施方式中，匹配单元，还可以包括：生成子单元，用于将该待识别语音信息输入语音识别模型，以生成相应的文本信息；结合子单元，用于将该标识信息与该文本信息相结合，以生成该待识别语音信息相应的字幕信息；标识子单元，用于将该字幕信息标识至该播放视频中。In one embodiment, the matching unit may also include: a generating subunit, used to input the voice information to be recognized into a voice recognition model to generate corresponding text information; a combining subunit, used to combine the identification information with the text information to generate subtitle information corresponding to the voice information to be recognized; and an identifying subunit, used to identify the subtitle information in the played video.

请参阅图3，图3为本申请实施例提供的语音信息的处理装置的模块示意图。具体而言，该语音信息的处理装置300包括：采集单元31、输入单元32、获取单元33以及匹配单元34。Please refer to Fig. 3, which is a module diagram of a voice information processing device provided in an embodiment of the present application. Specifically, the voice information processing device 300 includes: a collection unit 31, an input unit 32, an acquisition unit 33 and a matching unit 34.

采集单元31，用于采集目标用户的语音信息，提取出该语音信息的目标语音特征信息。The collecting unit 31 is used to collect the voice information of the target user and extract the target voice feature information of the voice information.

其中，采集单元31采集的目标用户的语音信息是指经过标注后的语音信息，故目标用户的语音信息中包含目标用户的标识信息，进一步的，标识信息可以是指目标用户的身份信息，例如姓名、性别、年龄、称号等等个人信息。Among them, the voice information of the target user collected by the collection unit 31 refers to the labeled voice information, so the voice information of the target user contains the identification information of the target user. Furthermore, the identification information can refer to the identity information of the target user, such as name, gender, age, title and other personal information.

另外，采集单元31提取的目标语音特征信息指目标语音声纹特征信息，可以理解的是，因为人在讲话时使用的发声器官如：舌、牙齿、喉头、肺、鼻腔在尺寸和形态方面每个人的差异很大，导致每个人的声纹均存在差异。故声纹特征信息是每个人特有的特征，如同每个人有自己独一无二的指纹一样。进一步的，声纹特征信息可以用梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficient，MFCC)来表示。In addition, the target speech feature information extracted by the acquisition unit 31 refers to the target speech voiceprint feature information. It is understandable that because the vocal organs used by people when speaking, such as tongue, teeth, larynx, lungs, and nasal cavity, vary greatly in size and shape, each person's voiceprint is different. Therefore, the voiceprint feature information is a unique feature of each person, just like everyone has their own unique fingerprint. Further, the voiceprint feature information can be represented by Mel-Frequency Cepstral Coefficient (MFCC).

输入单元32，用于将目标语音特征信息输入预设模型，以得到目标声纹参数。The input unit 32 is used to input the target speech feature information into a preset model to obtain target voiceprint parameters.

其中，输入单元32通过将目标用户的语音信息的目标语音特征信息输入预设模型，以得到该语音信息相应的调整后的语音特征信息，另外，由于预设模型中包含有每一目标用户相应的共同语音特征信息，故输入单元32可根据调整后的语音特征信息与共同语音特征信息确定出相应的目标声纹参数。Among them, the input unit 32 inputs the target voice feature information of the target user's voice information into the preset model to obtain the adjusted voice feature information corresponding to the voice information. In addition, since the preset model contains the common voice feature information corresponding to each target user, the input unit 32 can determine the corresponding target voiceprint parameters based on the adjusted voice feature information and the common voice feature information.

获取单元33，用于获取播放视频中的待识别语音信息，并提取出该待识别语音信息的第一声纹参数。The acquisition unit 33 is used to acquire the voice information to be recognized in the played video and extract the first voiceprint parameter of the voice information to be recognized.

其中，获取单元33中获取播放视频中的待识别语音信息的方式可以包括实时获取正在播放的视频或者直播视频的待识别语音信息、或者获取本地存储的视频的待识别语音信息。另外，获取单元33提取出该待识别语音信息的第一声纹参数的步骤与通过输入单元32获取目标声纹参数的步骤相同。The method of acquiring the voice information to be recognized in the playing video in the acquisition unit 33 may include acquiring the voice information to be recognized of the playing video or live video in real time, or acquiring the voice information to be recognized of the locally stored video. In addition, the step of extracting the first voiceprint parameter of the voice information to be recognized by the acquisition unit 33 is the same as the step of acquiring the target voiceprint parameter through the input unit 32.

匹配单元34，用于将该第一声纹参数与目标声纹参数进行匹配，根据匹配结果获取相匹配的目标声纹参数的标识信息，并将该标识信息标识至该播放视频中。The matching unit 34 is used to match the first voiceprint parameter with the target voiceprint parameter, obtain identification information of the matched target voiceprint parameter according to the matching result, and mark the identification information into the played video.

其中，匹配单元34将第一声纹参数与目标声纹参数进行匹配，并得到相应的匹配结果，根据匹配结果可确定第一声纹参数相匹配的目标声纹参数，其中，由于每一目标声纹参数分别对应着每一目标用户的信息，即目标声纹参数中包含相应的目标用户的标识信息，故可根据目标声纹参数确认第一声纹参数对应的目标用户的信息。Among them, the matching unit 34 matches the first voiceprint parameter with the target voiceprint parameter and obtains a corresponding matching result. According to the matching result, the target voiceprint parameter that matches the first voiceprint parameter can be determined. Among them, since each target voiceprint parameter corresponds to the information of each target user, that is, the target voiceprint parameter contains the identification information of the corresponding target user, the information of the target user corresponding to the first voiceprint parameter can be confirmed according to the target voiceprint parameter.

可一并参考图4，图4为本申请实施例提供的语音信息的处理装置的另一模块示意图。该语音信息的处理装置300还可以包括：训练单元35，用于通过预设算法对背景数据进行训练，以生成包含有每一目标用户相应的共同语音特征信息的预设模型，该背景数据包括每一目标用户的语音信息。Reference may also be made to FIG4 , which is another schematic diagram of a module of a voice information processing device provided in an embodiment of the present application. The voice information processing device 300 may also include: a training unit 35, which is used to train background data through a preset algorithm to generate a preset model containing common voice feature information corresponding to each target user, wherein the background data includes voice information of each target user.

其中，输入单元32，可以包括：输入子单元321，用于将该目标语音特征信息输入预设模型，以得到与该共同语音特征信息相应的目标差异特征信息；确定子单元322，用于根据该目标差异特征信息确定出第二声纹参数；处理子单元323，用于对该第二声纹参数进行信道补偿，以得到相应的目标声纹参数。Among them, the input unit 32 may include: an input subunit 321, used to input the target voice feature information into a preset model to obtain target difference feature information corresponding to the common voice feature information; a determination subunit 322, used to determine a second voiceprint parameter based on the target difference feature information; a processing subunit 323, used to perform channel compensation on the second voiceprint parameter to obtain a corresponding target voiceprint parameter.

其中，匹配单元34，可以包括：匹配子单元341，用于将该第一声纹参数与目标声纹参数进行匹配，生成相应的匹配值；获取子单元342，用于当匹配值大于预设阈值时，获取相匹配的目标声纹参数的标识信息。生成子单元343，用于将该待识别语音信息输入语音识别模型，以生成相应的文本信息；结合子单元344，用于将该标识信息与该文本信息相结合，以生成该待识别语音信息相应的字幕信息；标识子单元345，用于将该字幕信息标识至该播放视频中。The matching unit 34 may include: a matching subunit 341, used to match the first voiceprint parameter with the target voiceprint parameter to generate a corresponding matching value; an acquisition subunit 342, used to acquire identification information of the matched target voiceprint parameter when the matching value is greater than a preset threshold. A generation subunit 343, used to input the voice information to be recognized into a voice recognition model to generate corresponding text information; a combination subunit 344, used to combine the identification information with the text information to generate subtitle information corresponding to the voice information to be recognized; and an identification subunit 345, used to identify the subtitle information to the played video.

本申请实施例还提供一种电子设备。请参阅图5，电子设备500包括处理器501以及存储器502。其中，处理器501与存储器502电性连接。The embodiment of the present application further provides an electronic device. Referring to FIG5 , the electronic device 500 includes a processor 501 and a memory 502 . The processor 501 is electrically connected to the memory 502 .

该处理器500是电子设备500的控制中心，利用各种接口和线路连接整个电子设备的各个部分，通过运行或加载存储在存储器502内的计算机程序，以及调用存储在存储器502内的数据，执行电子设备500的各种功能并处理数据，从而对电子设备500进行整体监控。The processor 500 is the control center of the electronic device 500. It uses various interfaces and lines to connect various parts of the entire electronic device. It executes various functions of the electronic device 500 and processes data by running or loading computer programs stored in the memory 502 and calling data stored in the memory 502, thereby monitoring the electronic device 500 as a whole.

该存储器502可用于存储软件程序以及模块，处理器501通过运行存储在存储器502的计算机程序以及模块，从而执行各种功能应用以及数据处理。存储器502可主要包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需的计算机程序(比如声音播放功能、图像播放功能等)等；存储数据区可存储根据电子设备的使用所创建的数据等。此外，存储器502可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地，存储器502还可以包括存储器控制器，以提供处理器501对存储器502的访问。The memory 502 can be used to store software programs and modules. The processor 501 executes various functional applications and data processing by running the computer programs and modules stored in the memory 502. The memory 502 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, a computer program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the data storage area may store data created according to the use of the electronic device, etc. In addition, the memory 502 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other volatile solid-state storage devices. Accordingly, the memory 502 may also include a memory controller to provide the processor 501 with access to the memory 502.

在本申请实施例中，电子设备500中的处理器501会按照如下的步骤，将一个或一个以上的计算机程序的进程对应的指令加载到存储器502中，并由处理器501运行存储在存储器502中的计算机程序，从而实现各种功能，如下：In the embodiment of the present application, the processor 501 in the electronic device 500 will load instructions corresponding to the processes of one or more computer programs into the memory 502 according to the following steps, and the processor 501 will run the computer program stored in the memory 502, thereby realizing various functions, as follows:

采集目标用户的语音信息，提取出语音信息的目标语音特征信息；Collecting the target user's voice information and extracting the target voice feature information of the voice information;

获取播放视频中的待识别语音信息，并提取出待识别语音信息的第一声纹参数；Acquire the voice information to be recognized in the played video, and extract the first voiceprint parameter of the voice information to be recognized;

将该第一声纹参数与目标声纹参数进行匹配，根据匹配结果获取相匹配的目标声纹参数的标识信息，并将标识信息标识至该播放视频中。The first voiceprint parameter is matched with the target voiceprint parameter, identification information of the matched target voiceprint parameter is obtained according to the matching result, and the identification information is marked in the playback video.

在某些实施方式中，将目标语音特征信息输入预设模型之前，处理器501还可以具体执行以下步骤：In some implementations, before inputting the target speech feature information into the preset model, the processor 501 may further specifically perform the following steps:

通过预设算法对背景数据进行训练，以生成包含有每一目标用户相应的共同语音特征信息的预设模型，该背景数据包括每一目标用户的语音信息。The background data is trained by a preset algorithm to generate a preset model containing common voice feature information corresponding to each target user, and the background data includes voice information of each target user.

在某些实施方式中，将目标语音特征信息输入预设模型，以得到目标声纹参数时，处理器501可以具体执行以下步骤：In some implementations, when inputting the target speech feature information into a preset model to obtain the target voiceprint parameters, the processor 501 may specifically perform the following steps:

将该目标语音特征信息输入预设模型，以得到与该共同语音特征信息相应的目标差异特征信息；Inputting the target speech feature information into a preset model to obtain target difference feature information corresponding to the common speech feature information;

根据该目标差异特征信息确定出第二声纹参数；Determining a second voiceprint parameter according to the target difference characteristic information;

利用线性鉴别分析的方法对第二声纹参数进行信道补偿，以得到相应的目标声纹参数。The second voiceprint parameters are channel compensated using a linear discriminant analysis method to obtain corresponding target voiceprint parameters.

在某些实施方式中，将第一声纹参数与目标声纹参数进行匹配，根据匹配结果获取相匹配的目标声纹参数的标识信息时，处理器501可以具体执行以下步骤：In some implementations, when matching the first voiceprint parameter with the target voiceprint parameter and acquiring identification information of the matched target voiceprint parameter according to the matching result, the processor 501 may specifically perform the following steps:

将该第一声纹参数与目标声纹参数进行匹配，生成相应的匹配值；Matching the first voiceprint parameter with the target voiceprint parameter to generate a corresponding matching value;

当匹配值大于预设阈值时，获取相匹配的目标声纹参数的标识信息。When the matching value is greater than a preset threshold, identification information of the matching target voiceprint parameters is obtained.

其中，在某些实施方式中，获取相匹配的目标声纹参数的标识信息时，处理器501可以具体执行以下步骤：In some implementations, when obtaining identification information of the matching target voiceprint parameters, the processor 501 may specifically perform the following steps:

将该匹配值进行排序处理，获取大于预设阈值的匹配值中的最大匹配值，根据该最大匹配值获取相匹配的目标声纹参数；The matching values are sorted to obtain a maximum matching value among the matching values that are greater than a preset threshold, and a matching target voiceprint parameter is obtained according to the maximum matching value;

根据该目标声纹参数获取该相应的标识信息。The corresponding identification information is obtained according to the target voiceprint parameter.

在某些实施方式中，将标识信息标识至该播放视频中时，处理器501可以具体执行以下步骤：In some implementations, when the identification information is added to the played video, the processor 501 may specifically perform the following steps:

将待识别语音信息输入语音识别模型，以生成相应的文本信息；Inputting the speech information to be recognized into the speech recognition model to generate corresponding text information;

将该标识信息与文本信息相结合，以生成待识别语音信息相应的字幕信息；Combining the identification information with the text information to generate subtitle information corresponding to the voice information to be recognized;

将字幕信息标识至该播放视频中。Add subtitle information to the playing video.

请一并参阅图6，在某些实施方式中，电子设备500还可以包括：显示器503、射频电路504、音频电路505以及电源506。其中，其中，显示器503、射频电路504、音频电路505以及电源506分别与处理器501电性连接。Please refer to FIG6 , in some embodiments, the electronic device 500 may further include: a display 503, a radio frequency circuit 504, an audio circuit 505, and a power supply 506. The display 503, the radio frequency circuit 504, the audio circuit 505, and the power supply 506 are electrically connected to the processor 501, respectively.

该显示器503可以用于显示由用户输入的信息或提供给用户的信息以及各种图形用户接口，这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示器503可以包括显示面板，在某些实施方式中，可以采用液晶显示器(Liquid CrystalDisplay，LCD)、或者有机发光二极管(Organic Light-Emitting Diode，OLED)等形式来配置显示面板。The display 503 may be used to display information input by a user or information provided to a user and various graphical user interfaces, which may be composed of graphics, text, icons, videos, and any combination thereof. The display 503 may include a display panel, and in some embodiments, the display panel may be configured in the form of a liquid crystal display (LCD) or an organic light-emitting diode (OLED).

该射频电路504可以用于收发射频信号，以通过无线通信与网络设备或其他电子设备建立无线通讯，与网络设备或其他电子设备之间收发信号。The radio frequency circuit 504 can be used to send and receive radio frequency signals, so as to establish wireless communication with a network device or other electronic devices through wireless communication, and to send and receive signals between the network device or other electronic devices.

该音频电路505可以用于通过扬声器、传声器提供用户与电子设备之间的音频接口。The audio circuit 505 can be used to provide an audio interface between a user and an electronic device through a speaker and a microphone.

该电源506可以用于给电子设备500的各个部件供电。在一些实施例中，电源506可以通过电源管理系统与处理器501逻辑相连，从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。The power supply 506 can be used to supply power to various components of the electronic device 500. In some embodiments, the power supply 506 can be logically connected to the processor 501 through a power management system, so that the power management system can manage charging, discharging, power consumption and other functions.

尽管图6中未示出，电子设备500还可以包括摄像头、蓝牙模块等，在此不再赘述。Although not shown in FIG. 6 , the electronic device 500 may further include a camera, a Bluetooth module, etc., which will not be described in detail here.

本申请实施例还提供一种存储介质，该存储介质存储有计算机程序，当该计算机程序在计算机上运行时，使得该计算机执行上述任一实施例中的语音信息的处理方法，比如：采集目标用户的语音信息，提取出该语音信息的目标语音特征信息；将目标语音特征信息输入预设模型，以得到目标声纹参数；获取播放视频中的待识别语音信息，并提取出该待识别语音信息的第一声纹参数；将该第一声纹参数与目标声纹参数进行匹配，根据匹配结果获取相匹配的目标声纹参数的标识信息，并将该标识信息标识至该播放视频中。An embodiment of the present application also provides a storage medium, which stores a computer program. When the computer program runs on a computer, the computer executes the voice information processing method in any of the above embodiments, such as: collecting voice information of a target user and extracting target voice feature information of the voice information; inputting the target voice feature information into a preset model to obtain a target voiceprint parameter; obtaining voice information to be recognized in a played video, and extracting a first voiceprint parameter of the voice information to be recognized; matching the first voiceprint parameter with the target voiceprint parameter, obtaining identification information of the matched target voiceprint parameter according to the matching result, and marking the identification information in the played video.

在本申请实施例中，存储介质可以是磁碟、光盘、只读存储器(Read Only Memory，ROM，)、或者随机存取记忆体(Random Access Memory，RAM)等。In the embodiment of the present application, the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the above embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.

需要说明的是，对本申请实施例的语音信息的处理方法而言，本领域普通测试人员可以理解实现本申请实施例的语音信息的处理方法的全部或部分流程，是可以通过计算机程序来控制相关的硬件来完成，该计算机程序可存储于一计算机可读取存储介质中，如存储在电子设备的存储器中，并被该电子设备内的至少一个处理器执行，在执行过程中可包括如语音信息的处理方法的实施例的流程。其中，该的存储介质可为磁碟、光盘、只读存储器、随机存取记忆体等。It should be noted that, for the voice information processing method of the embodiment of the present application, ordinary testers in the field can understand that all or part of the process of implementing the voice information processing method of the embodiment of the present application can be completed by controlling the relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium, such as stored in the memory of an electronic device, and executed by at least one processor in the electronic device, and the execution process may include the process of the embodiment of the voice information processing method. Among them, the storage medium can be a disk, an optical disk, a read-only memory, a random access memory, etc.

对本申请实施例的语音信息的处理装置而言，其各功能模块可以集成在一个处理芯片中，也可以是各个模块单独物理存在，也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。该集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中，该存储介质譬如为只读存储器，磁盘或光盘等。For the voice information processing device of the embodiment of the present application, each functional module can be integrated into a processing chip, or each module can exist physically separately, or two or more modules can be integrated into one module. The above-mentioned integrated module can be implemented in the form of hardware or in the form of a software functional module. If the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium, such as a read-only memory, a disk or an optical disk.

以上对本申请实施例所提供的一种语音信息的处理方法、装置、存储介质及电子设备进行了详细介绍，本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本申请的限制。The above is a detailed introduction to a voice information processing method, device, storage medium and electronic device provided in the embodiments of the present application. Specific examples are used in this article to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method of the present application and its core idea; at the same time, for technical personnel in this field, according to the ideas of the present application, there will be changes in the specific implementation methods and application scope. In summary, the content of this specification should not be understood as a limitation on the present application.

Claims

1. A processing method of voice information includes:

collecting voice information of a target user, and extracting target voice characteristic information of the voice information, wherein the target user is a main speaker in a playing video;

Training background data through a preset algorithm to generate a preset model containing corresponding common voice feature information of each target user, wherein the background data comprises the voice information of each target user, and the common voice feature information is corresponding common voice print features acquired according to all target users;

inputting target voice characteristic information into a preset model to obtain target voiceprint parameters, wherein the target voice characteristic information is input into the preset model to obtain target difference characteristic information corresponding to the common voice characteristic information, and determining second voiceprint parameters according to the target difference characteristic information; performing channel compensation on the second voiceprint parameters to obtain corresponding target voiceprint parameters;

Acquiring voice information to be recognized in a playing video, extracting first voiceprint parameters of the voice information to be recognized, wherein first voice characteristic information of the voice information to be recognized is extracted, the first voice characteristic information is input into the preset model, first difference characteristic information is calculated according to the common voice characteristic information and the first voice characteristic information of all target users, the first voiceprint parameters of the voice information to be recognized are determined according to the first difference characteristic information, and channel compensation is carried out on the first voiceprint parameters to obtain processed first voiceprint parameters;

And matching the first voiceprint parameters with the target voiceprint parameters, acquiring identification information of the matched target voiceprint parameters according to a matching result, and identifying the identification information into the playing video.

2. The method of claim 1, wherein the step of channel compensating the second voice-print parameter comprises:

And carrying out channel compensation on the second voice channel parameters by using a linear discriminant analysis method.

3. The method of claim 1, wherein the step of matching the first voiceprint parameter with a target voiceprint parameter and obtaining identification information of the matched target voiceprint parameter according to a matching result includes:

Matching the first voiceprint parameters with target voiceprint parameters to generate corresponding matching values;

and when the matching value is larger than a preset threshold value, acquiring the identification information of the matched target voiceprint parameters.

4. A method according to claim 3, wherein the step of obtaining identification information of the matching target voiceprint parameters comprises:

Sorting the matching values to obtain a maximum matching value in the matching values larger than a preset threshold value, and obtaining matched target voiceprint parameters according to the maximum matching value;

and acquiring the corresponding identification information according to the target voiceprint parameters.

5. The method of claim 1, wherein the step of identifying the identification information into the play video comprises:

inputting the voice information to be recognized into a voice recognition model to generate corresponding text information;

Combining the identification information with the text information to generate subtitle information corresponding to the voice information to be recognized;

and identifying the subtitle information into the playing video.

6. A processing apparatus for voice information, comprising:

the system comprises an acquisition unit, a video processing unit and a video processing unit, wherein the acquisition unit is used for acquiring voice information of a target user and extracting target voice characteristic information of the voice information, and the target user is a main speaker in a playing video;

The training unit is used for training the background data through a preset algorithm to generate a preset model containing the corresponding common voice characteristic information of each target user, wherein the background data comprises the voice information of each target user, and the common voice characteristic information is the corresponding common voice characteristic acquired according to all the target users;

the input unit is used for inputting target voice characteristic information into a preset model to obtain target voiceprint parameters, wherein the target voice characteristic information is input into the preset model to obtain target difference characteristic information corresponding to the common voice characteristic information, and the input unit comprises a determination subunit and a processing subunit, and the determination subunit is used for determining second voiceprint parameters according to the target difference characteristic information; the processing subunit is configured to perform channel compensation on the second voiceprint parameter to obtain a corresponding target voiceprint parameter;

The device comprises an acquisition unit, a channel compensation unit and a channel compensation unit, wherein the acquisition unit is used for acquiring voice information to be recognized in a play video, extracting first voiceprint parameters of the voice information to be recognized, extracting first voice characteristic information of the voice information to be recognized, inputting the first voice characteristic information into the preset model, calculating first difference characteristic information according to the common voice characteristic information and the first voice characteristic information of all target users, determining the first voiceprint parameters of the voice information to be recognized according to the first difference characteristic information, and performing channel compensation on the first voiceprint parameters to obtain processed first voiceprint parameters;

The matching unit is used for matching the first voiceprint parameters with the target voiceprint parameters, acquiring the identification information of the matched target voiceprint parameters according to a matching result, and identifying the identification information into the playing video.

7. The apparatus of claim 6, wherein the matching unit comprises:

The matching subunit is used for matching the first voiceprint parameters with the target voiceprint parameters to generate corresponding matching values;

and the acquisition subunit is used for acquiring the identification information of the matched target voiceprint parameters when the matching value is larger than a preset threshold value.

8. The apparatus of claim 6, wherein the matching unit further comprises:

the generation subunit is used for inputting the voice information to be recognized into a voice recognition model so as to generate corresponding text information;

a combining subunit, configured to combine the identification information with the text information, so as to generate subtitle information corresponding to the voice information to be identified;

And the identification subunit is used for identifying the subtitle information into the playing video.

9. A storage medium having a computer program stored thereon, wherein the computer program, when run on a computer, causes the computer to perform the method of processing speech information according to any of claims 1 to 5.

10. An electronic device comprising a processor and a memory, the memory having a computer program, wherein the processor is configured to perform the steps of:

11. The electronic device of claim 10, wherein the processor is configured to perform the steps by invoking the computer program:

12. The electronic device of claim 10, wherein the processor is configured to perform the steps by invoking the computer program:

13. The electronic device of claim 12, the processor being operative to perform the steps of:

14. The electronic device of claim 10, wherein the processor is configured to perform the steps by invoking the computer program:

and identifying the subtitle information into the playing video.