[go: up one dir, main page]

CN115440229A - Audio processing method, device, processor and system - Google Patents

Audio processing method, device, processor and system Download PDF

Info

Publication number
CN115440229A
CN115440229A CN202211119654.7A CN202211119654A CN115440229A CN 115440229 A CN115440229 A CN 115440229A CN 202211119654 A CN202211119654 A CN 202211119654A CN 115440229 A CN115440229 A CN 115440229A
Authority
CN
China
Prior art keywords
audio
duration
role
threshold
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211119654.7A
Other languages
Chinese (zh)
Inventor
李志杰
李健
陈明
武卫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sinovoice Technology Co Ltd
Original Assignee
Beijing Sinovoice Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sinovoice Technology Co Ltd filed Critical Beijing Sinovoice Technology Co Ltd
Priority to CN202211119654.7A priority Critical patent/CN115440229A/en
Publication of CN115440229A publication Critical patent/CN115440229A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Telephone Function (AREA)

Abstract

本申请提供了一种音频处理方法、装置、处理器和系统,该方法包括:获取至少一个音频片段,并采用声纹识别模型对至少一个音频片段进行声纹识别,得到第一识别结果;在第一识别结果表征至少一个音频片段为非目标静音片段且至少一个音频片段的时长大于或等于第一时长阈值的情况下,获取第一识别结果中的最高识别分数;在至少一个音频片段的音频时长大于或者等于第二时长阈值且最高识别分数小于分数阈值的情况下,确定至少一个音频片段对应的角色为未知角色;将未知角色注册至声纹识别的模型库中。该方法通过未知角色分离算法,实现了语音角色分离的技术效果,解决了进行角色分离时通常需要提前注册说话者声纹的技术问题。

Figure 202211119654

The present application provides an audio processing method, device, processor and system, the method comprising: acquiring at least one audio clip, and using a voiceprint recognition model to perform voiceprint recognition on at least one audio clip to obtain a first recognition result; When the first recognition result indicates that at least one audio segment is a non-target mute segment and the duration of at least one audio segment is greater than or equal to the first duration threshold, obtain the highest recognition score in the first recognition result; in the audio of at least one audio segment When the duration is greater than or equal to the second duration threshold and the highest recognition score is less than the score threshold, it is determined that the character corresponding to at least one audio clip is an unknown character; and the unknown character is registered in the voiceprint recognition model library. The method achieves the technical effect of voice role separation through an unknown role separation algorithm, and solves the technical problem that the speaker's voiceprint usually needs to be registered in advance when performing role separation.

Figure 202211119654

Description

音频处理方法、装置、处理器和系统Audio processing method, device, processor and system

技术领域technical field

本申请涉及数据处理领域,具体而言,涉及一种音频处理方法、装置、处理器和系统。The present application relates to the field of data processing, in particular, to an audio processing method, device, processor and system.

背景技术Background technique

目前,语音会议系统和笔录审讯系统都需要用到角色分离系统,用于将多个说话者的语音进行分离,并根据分离结果进行语音转写或说话者角色展示。At present, both the audio conferencing system and the recorded interrogation system need to use a role separation system, which is used to separate the voices of multiple speakers, and perform voice transcription or speaker role display based on the separation results.

但是,目前的角色分离技术在进行声纹角色分离时,通常需要提前注册说话者的声纹,在实际应用场景中,其易用性差,准备工作成本高。However, the current role separation technology usually needs to register the speaker's voiceprint in advance when performing voiceprint role separation. In actual application scenarios, its ease of use is poor and the cost of preparation is high.

在背景技术部分中公开的以上信息只是用来加强对本文所描述技术的背景技术的理解,因此,背景技术中可能包含某些信息,这些信息对于本领域技术人员来说并未形成在本国已知的现有技术。The above information disclosed in the Background section is only to enhance the understanding of the background of the technology described herein, therefore, the Background may contain certain information which is not formed in the country for those skilled in the art. known prior art.

发明内容Contents of the invention

本申请的主要目的在于提供一种音频处理方法、装置、处理器和系统,以解决现有技术中进行声纹角色分离时需要提前注册说话者声纹的问题。The main purpose of the present application is to provide an audio processing method, device, processor and system to solve the problem in the prior art that the speaker's voiceprint needs to be registered in advance when performing voiceprint role separation.

根据本发明实施例的一个方面,提供了一种音频处理方法,包括:获取至少一个音频片段,并采用声纹识别模型对所述至少一个音频片段进行声纹识别,得到第一识别结果;在所述第一识别结果表征所述至少一个音频片段为非目标静音片段且所述至少一个音频片段的时长大于或等于第一时长阈值的情况下,获取所述第一识别结果中的最高识别分数;在所述至少一个音频片段的音频时长大于或者等于第二时长阈值且所述最高识别分数小于分数阈值的情况下,确定所述至少一个音频片段对应的角色为未知角色,所述第二时长阈值大于所述第一时长阈值;将所述未知角色注册至所述声纹识别的模型库中。According to an aspect of an embodiment of the present invention, an audio processing method is provided, including: acquiring at least one audio clip, and performing voiceprint recognition on the at least one audio clip using a voiceprint recognition model to obtain a first recognition result; When the first recognition result indicates that the at least one audio segment is a non-target mute segment and the duration of the at least one audio segment is greater than or equal to a first duration threshold, obtain the highest recognition score in the first recognition result ; When the audio duration of the at least one audio clip is greater than or equal to the second duration threshold and the highest recognition score is less than the score threshold, determine that the role corresponding to the at least one audio clip is an unknown character, and the second duration The threshold is greater than the first duration threshold; and the unknown character is registered in the voiceprint recognition model library.

可选地,在所述至少一个音频片段的音频时长大于等于第二时长阈值且所述最高识别分数小于所述分数阈值的情况下,确定所述至少一个音频片段对应的角色为未知角色,包括:第一确定步骤,在所述至少一个音频片段的音频时长小于所述第二时长阈值且所述最高识别分数小于所述分数阈值的情况下,确定所述至少一个音频片段对应的角色为候选未知角色;第二确定步骤,获取所述至少一个音频片段的后续音频片段,得到第一更新音频片段,并对所述第一更新音频片段进行所述声纹识别,得到第二识别结果,在所述第二识别结果表征所述第一更新音频片段的音频时长大于或者等于所述第二时长阈值且所述最高识别分数大于所述分数阈值的情况下,将所述候选未知角色更新为已知角色,在所述第二识别结果表征所述第一更新音频片段的音频时长大于或者等于所述第二时长阈值且所述最高识别分数小于等于所述分数阈值的情况下,将所述候选未知角色更新为所述未知角色;在所述第二识别结果表征所述第一更新音频片段的音频时长小于所述第二时长阈值情况下,重复执行所述第二确定步骤,直到确定所述第一更新音频片段对应的角色为所述已知角色或者所述未知角色为止。Optionally, when the audio duration of the at least one audio segment is greater than or equal to a second duration threshold and the highest recognition score is less than the score threshold, determining that the character corresponding to the at least one audio segment is an unknown character includes : a first determining step, when the audio duration of the at least one audio segment is less than the second duration threshold and the highest recognition score is less than the score threshold, determine that the role corresponding to the at least one audio segment is a candidate Unknown role; the second determining step is to obtain the subsequent audio segment of the at least one audio segment to obtain a first updated audio segment, and perform the voiceprint recognition on the first updated audio segment to obtain a second recognition result, and When the second recognition result indicates that the audio duration of the first updated audio clip is greater than or equal to the second duration threshold and the highest recognition score is greater than the score threshold, the candidate unknown character is updated as already If the second recognition result indicates that the audio duration of the first updated audio segment is greater than or equal to the second duration threshold and the highest recognition score is less than or equal to the score threshold, the candidate The unknown character is updated to the unknown character; when the second recognition result indicates that the audio duration of the first updated audio clip is less than the second duration threshold, the second determining step is repeatedly executed until it is determined that the The character corresponding to the first updated audio clip is the known character or the unknown character.

可选地,在所述至少一个音频片段的音频时长小于所述第二时长阈值且所述最高识别分数小于所述分数阈值的情况下,确定所述至少一个音频片段对应的角色为候选未知角色,包括:在所述至少一个音频片段的音频时长小于第一时长阈值且所述最高识别分数小于第一分数阈值的情况下,确定所述至少一个音频片段对应的角色为所述候选未知角色;在所述至少一个音频片段的音频时长大于或者等于所述第一时长阈值且小于第三时长阈值以及所述最高识别分数大于或者等于所述第一分数阈值且小于第二分数阈值的情况下,确定所述至少一个音频片段对应的角色为所述候选未知角色,所述第一时长阈值小于所述第三时长阈值,所述第一分数阈值小于所述第二分数阈值;在所述至少一个音频片段的音频时长大于或者等于所述第三时长阈值且小于第二时长阈值以及所述最高识别分数大于或等于所述第二分数阈值且小于第三分数阈值的情况下,确定所述至少一个音频片段对应的角色为所述候选未知角色,所述第三时长阈值小于所述第二时长阈值,所述第二分数阈值小于所述第三分数阈值。Optionally, when the audio duration of the at least one audio segment is less than the second duration threshold and the highest recognition score is less than the score threshold, determine that the character corresponding to the at least one audio segment is a candidate unknown character , including: when the audio duration of the at least one audio segment is less than a first duration threshold and the highest recognition score is less than a first score threshold, determining that the character corresponding to the at least one audio segment is the candidate unknown character; When the audio duration of the at least one audio segment is greater than or equal to the first duration threshold and less than a third duration threshold and the highest recognition score is greater than or equal to the first score threshold and less than a second score threshold, determining that the character corresponding to the at least one audio clip is the candidate unknown character, the first duration threshold is less than the third duration threshold, and the first score threshold is less than the second score threshold; in the at least one When the audio duration of the audio clip is greater than or equal to the third duration threshold and less than the second duration threshold and the highest recognition score is greater than or equal to the second score threshold and less than the third score threshold, determine the at least one The character corresponding to the audio clip is the candidate unknown character, the third duration threshold is smaller than the second duration threshold, and the second score threshold is smaller than the third score threshold.

可选地,在所述至少一个音频片段的音频时长大于或者等于所述第二时长阈值且所述最高识别分数小于所述分数阈值的情况下,确定所述至少一个音频片段对应的角色为未知角色,包括:在所述至少一个音频片段的音频时长大于或者等于所述第二时长阈值且所述最高识别分数小于所述分数阈值的情况下,确定所述至少一个音频片段对应的角色为所述未知角色;在所述至少一个音频片段的音频时长大于或者等于所述第二时长阈值且所述最高识别分数大于或者等于所述分数阈值的情况下,确定所述至少一个音频片段对应的角色为已知角色。Optionally, when the audio duration of the at least one audio segment is greater than or equal to the second duration threshold and the highest recognition score is smaller than the score threshold, it is determined that the role corresponding to the at least one audio segment is unknown The role includes: when the audio duration of the at least one audio segment is greater than or equal to the second duration threshold and the highest recognition score is smaller than the score threshold, determining that the role corresponding to the at least one audio segment is the The unknown role; when the audio duration of the at least one audio segment is greater than or equal to the second duration threshold and the highest recognition score is greater than or equal to the score threshold, determine the role corresponding to the at least one audio segment is a known role.

可选地,获取至少一个音频片段,并采用声纹识别模型对所述至少一个音频片段进行声纹识别,得到所述第一识别结果,包括:第三确定步骤,在所述第一识别结果表征所述至少一个音频片段为所述目标静音片段情况下,获取所述至少一个音频片段的时长,在所述至少一个音频片段的时长大于第四时长阈值的情况下,确定所述至少一个音频片段为所述目标静音片段,且对应的角色为空;第四确定步骤,在所述至少一个音频片段的时长小于或者等于所述第四时长阈值的情况下,获取所述至少一个音频片段的后续音频片段,得到第二更新音频片段,并对所述第二更新音频片段进行所述声纹识别,得到第三识别结果,在所述第三识别结果表征所述第二更新音频时长大于所述第四时长阈值的情况下,确定所述至少一个音频片段为所述目标静音片段;在所述第三识别结果表征所述第二更新音频片段的音频时长小于等于所述第四时长阈值的情况下,重复执行所述第四确定步骤,直到确定所述第二更新音频片段为所述目标静音片段或所述非目标静音片段为止。Optionally, acquiring at least one audio clip, and using a voiceprint recognition model to perform voiceprint recognition on the at least one audio clip to obtain the first recognition result includes: a third determining step, where the first recognition result In the case of characterizing that the at least one audio segment is the target silence segment, acquiring the duration of the at least one audio segment, and determining that the at least one audio segment is longer than a fourth duration threshold when the duration of the at least one audio segment is greater than a fourth duration threshold The segment is the target mute segment, and the corresponding role is empty; the fourth determining step is to obtain the at least one audio segment when the duration of the at least one audio segment is less than or equal to the fourth duration threshold Subsequent audio clips, obtain a second updated audio clip, and perform the voiceprint recognition on the second updated audio clip to obtain a third recognition result, where the third recognition result indicates that the duration of the second updated audio is longer than the specified In the case of the fourth duration threshold, determine that the at least one audio segment is the target silent segment; when the third recognition result indicates that the audio duration of the second updated audio segment is less than or equal to the fourth duration threshold In some cases, the fourth determining step is repeatedly executed until it is determined that the second updated audio segment is the target silent segment or the non-target silent segment.

可选地,在所述至少一个音频片段的时长大于所述第四时长阈值的情况下,确定所述至少一个音频片段为所述目标静音片段,包括:在所述至少一个音频片段的时长大于第二时长阈值的情况下,确定所述至少一个音频片段为所述目标静音片段;在所述至少一个音频片段的时长小于或者等于所述第二时长阈值且大于所述第四时长阈值的情况下,确定所述至少一个音频片段为所述目标静音片段,所述第二时长阈值大于所述第四时长阈值。Optionally, when the duration of the at least one audio segment is greater than the fourth duration threshold, determining the at least one audio segment as the target mute segment includes: when the duration of the at least one audio segment is greater than In the case of the second duration threshold, determine that the at least one audio segment is the target mute segment; in the case where the duration of the at least one audio segment is less than or equal to the second duration threshold and greater than the fourth duration threshold Next, it is determined that the at least one audio segment is the target silence segment, and the second duration threshold is greater than the fourth duration threshold.

可选地,所述方法还包括:第五确定步骤,在历史角色不为空的情况下,确定历史角色与当前角色是否相同,其中,所述当前角色为当前的所述至少一个音频片段对应的角色,所述历史角色为所述至少一个音频片段之前的音频片段对应的角色;第六确定步骤,在所述历史角色与所述当前角色相同的情况下,确定未发生角色切换;第七确定步骤,在所述历史角色与所述当前角色不相同的情况下,确定所述至少一个音频片段的时长是否大于或者等于第二时长阈值,在所述至少一个音频片段的时长大于或者等于所述第二时长阈值的情况下,确定发生所述角色切换;在所述至少一个音频片段的时长小于所述第二时长阈值的情况下,获取所述至少一个音频片段的后续音频片段,得到第三更新音频片段,依次重复执行所述第五确定步骤至所述第七确定步骤至少一次,直到确定发生所述角色切换或者未发生所述角色切换为止,重复执行的过程中,所述当前角色为所述第三更新音频片段对应的角色。Optionally, the method further includes: a fifth determining step, if the historical role is not empty, determine whether the historical role is the same as the current role, wherein the current role corresponds to the current at least one audio segment The role of the historical role is the role corresponding to the audio clip before the at least one audio clip; the sixth determination step, in the case that the historical role is the same as the current role, it is determined that no role switching has occurred; the seventh The determining step is to determine whether the duration of the at least one audio segment is greater than or equal to a second duration threshold if the historical role is different from the current role, and if the duration of the at least one audio segment is greater than or equal to the In the case of the second duration threshold, it is determined that the role switching occurs; in the case that the duration of the at least one audio segment is less than the second duration threshold, the subsequent audio segment of the at least one audio segment is acquired, and the second audio segment is obtained. 3. Update the audio clip, and repeat the fifth determination step to the seventh determination step at least once until it is determined that the role switching occurs or the role switching does not occur. During the repeated execution, the current role The role corresponding to the third update audio segment.

根据本发明实施例的另一方面,提供了一种音频处理装置,所述处理装置包括:第一获取单元,用于获取至少一个音频片段,并采用声纹识别模型对所述至少一个音频片段进行声纹识别,得到第一识别结果;第二获取单元,用于在所述第一识别结果表征所述至少一个音频片段为非目标静音片段且所述至少一个音频片段的时长大于或等于第一时长阈值的情况下,获取所述第一识别结果中的最高识别分数;确定单元,用于在所述至少一个音频片段的音频时长大于或者等于第二时长阈值且所述最高识别分数小于分数阈值的情况下,确定所述至少一个音频片段对应的角色为未知角色,所述第二时长阈值大于所述第一时长阈值;注册单元,用于将所述未知角色注册至所述声纹识别的模型库中。According to another aspect of the embodiments of the present invention, there is provided an audio processing device, the processing device includes: a first acquisition unit, configured to acquire at least one audio segment, and use a voiceprint recognition model to identify the at least one audio segment Perform voiceprint recognition to obtain a first recognition result; a second acquisition unit, configured to indicate that the at least one audio segment is a non-target silent segment and the duration of the at least one audio segment is greater than or equal to the first recognition result In the case of a duration threshold, the highest recognition score in the first recognition result is obtained; the determination unit is configured to, when the audio duration of the at least one audio segment is greater than or equal to the second duration threshold and the highest recognition score is less than the score In the case of a threshold, it is determined that the role corresponding to the at least one audio clip is an unknown character, and the second duration threshold is greater than the first duration threshold; a registration unit is configured to register the unknown character to the voiceprint recognition in the model library.

根据本申请的再一方面,提供了一种处理器,所述处理器用于运行程序,其中,所述程序运行时执行任一种所述的处理方法。According to still another aspect of the present application, a processor is provided, and the processor is used to run a program, wherein any one of the processing methods is executed when the program is running.

根据本申请的再一方面,提供了一种音频处理系统,所述处理系统包括:声纹识别系统、一个或多个处理器,存储器以及一个或多个程序,其中,所述一个或多个程序被存储在所述存储器中,并且被配置为由所述一个或多个处理器执行,所述一个或多个程序包括用于执行任一种所述的处理方法。According to still another aspect of the present application, an audio processing system is provided, and the processing system includes: a voiceprint recognition system, one or more processors, memory and one or more programs, wherein the one or more A program is stored in the memory and is configured to be executed by the one or more processors, the one or more programs including for performing any one of the processing methods described.

在本发明实施例中,采用声纹识别的方式,通过未知角色分离算法,达到了自动识别音频中存在的多个未知角色的目的,从而实现了语音角色分离的技术效果,进而解决了进行角色分离时通常需要提前注册说话者声纹的技术问题。In the embodiment of the present invention, the method of voiceprint recognition is adopted, and through the unknown role separation algorithm, the purpose of automatically identifying multiple unknown roles in the audio is achieved, thereby realizing the technical effect of voice role separation, and further solving the problem of character identification. Separation usually requires the technical issue of registering the speaker's voiceprint in advance.

附图说明Description of drawings

构成本申请的一部分的说明书附图用来提供对本申请的进一步理解,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The accompanying drawings constituting a part of the present application are used to provide further understanding of the present application, and the schematic embodiments and descriptions of the present application are used to explain the present application, and do not constitute improper limitations to the present application. In the attached picture:

图1示出了根据本申请的一种音频处理方法的一个实施例的流程示意图;Fig. 1 shows a schematic flow chart of an embodiment of an audio processing method according to the present application;

图2示出了根据本申请的一种音频处理方法的又一个实施例的未知角色检测的流程示意图;Fig. 2 shows a schematic flow chart of unknown character detection according to yet another embodiment of an audio processing method of the present application;

图3示出了根据本申请的一种音频处理方法的又一个实施例的静音检测流程示意图;FIG. 3 shows a schematic flow diagram of a silence detection process according to yet another embodiment of an audio processing method of the present application;

图4示出了根据本申请的一种音频处理方法的再一个实施例的角色切换流程示意图;FIG. 4 shows a schematic diagram of a role switching process according to another embodiment of an audio processing method of the present application;

图5示出了根据本申请的一种音频处理装置的一个实施例的示意图。Fig. 5 shows a schematic diagram of an embodiment of an audio processing device according to the present application.

具体实施方式detailed description

需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other. The present application will be described in detail below with reference to the accompanying drawings and embodiments.

为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the solution of the present application, the technical solution in the embodiment of the application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiment of the application. Obviously, the described embodiment is only It is an embodiment of a part of the application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the scope of protection of this application.

需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first" and "second" in the description and claims of the present application and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It should be understood that the data so used may be interchanged under appropriate circumstances for the embodiments of the application described herein. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.

为了便于描述,以下对本申请实施例涉及的部分名词或术语进行说明:For ease of description, some nouns or terms involved in the embodiments of the present application are described below:

声纹识别:一项提取说话人声音特征和说话内容信息,自动核验说话人身份的技术;Voiceprint recognition: a technology that extracts the speaker's voice characteristics and speech content information, and automatically verifies the speaker's identity;

自动语音识别技术:一种将人的语音转换成文本的技术。Automatic Speech Recognition Technology: A technology that converts human speech into text.

正如背景技术中所说的,现有技术中的进行声纹角色分离时,通常提前注册说话者的声纹才能进行声纹角色分离的问题,为了解决上述问题,本申请的一种典型的实施方式中,提供了一种音频处理方法、装置、处理器和系统。As mentioned in the background technology, when performing voiceprint role separation in the prior art, the problem of voiceprint role separation can only be performed by registering the speaker’s voiceprint in advance. In order to solve the above problems, a typical implementation of this application In the manner, an audio processing method, device, processor and system are provided.

图1是根据本申请实施例的音频处理方法的流程图。如图1所示,该方法包括以下步骤:Fig. 1 is a flowchart of an audio processing method according to an embodiment of the present application. As shown in Figure 1, the method includes the following steps:

步骤S101,获取至少一个音频片段,并采用声纹识别模型对至少一个音频片段进行声纹识别,得到第一识别结果。Step S101, acquire at least one audio clip, and use a voiceprint recognition model to perform voiceprint recognition on the at least one audio clip to obtain a first recognition result.

在获取音频片段的过程中,可以通过普通麦克风音频采集、阵列麦克风音频采集、手拉手麦克风音频采集、电脑扬声器、手机录音以及网络音频采集中的任意一种或者多种方式来获取,使语音采集机制具备多样性,从而应用于不同的场景。In the process of obtaining audio clips, it can be obtained by any one or more methods of ordinary microphone audio collection, array microphone audio collection, hand-in-hand microphone audio collection, computer speakers, mobile phone recording, and network audio collection. Mechanisms are diverse and can be applied to different scenarios.

上述的至少一个音频片段可以为一个音频片段,也可以为多个音频片段,在不同的应用场景中,音频片段的数量可能不同。The aforementioned at least one audio segment may be one audio segment, or may be multiple audio segments. In different application scenarios, the number of audio segments may be different.

本申请的声纹识别模型可以为现有技术中的任何可行的声纹识别模型,具体可以为模板模型,也可以为随机模型,其中,模板模型即非参数模型,将训练特征参数和测试的特征参数进行比较,两者之间的失真作为相似度,例如VQ(vector quantization)模型即矢量量化模型和DTW(dynamic time warping)模型即动态时间规整法模型;其中VQ模型通过聚类、量化的方法生成码本,识别时对测试数据进行量化编码,以失真度的大小作为判决的标准,DTW模型通过将输入待识别的特征矢量序列与训练时提取的特征矢量进行比较,通过最优路径匹配的方法来进行识别,随机模型即参数模型,用一个概率密度函数来模拟说话人,训练过程用于预测概率密度函数的参数,匹配过程通过计算相应模型的测试语句的相似度来完成,例如GMM模型即高斯混合模型,它是与文本无关的说话人识别中效果最好也是最常用的模型之一,HMM模型即隐马尔科夫模型是用来描述一个含有隐含未知参数的马尔科夫过程的统计模型。更为具体地,该声纹识别模型包括训练阶段和测试阶段,其中,训练阶段包括训练语音、特征提取、模型训练、模型库四部分;测试阶段包括测试语音、特征提取以及打分判决。The voiceprint recognition model of the present application can be any feasible voiceprint recognition model in the prior art, specifically it can be a template model, and it can also be a random model. The characteristic parameters are compared, and the distortion between the two is used as the similarity. For example, the VQ (vector quantization) model is the vector quantization model and the DTW (dynamic time warping) model is the dynamic time warping model; where the VQ model is clustered and quantized. The method generates a codebook, quantizes and encodes the test data during recognition, and uses the degree of distortion as the judgment standard. The DTW model compares the input feature vector sequence to be recognized with the feature vector extracted during training, and matches the optimal path. The random model is a parameter model, which uses a probability density function to simulate the speaker. The training process is used to predict the parameters of the probability density function. The matching process is completed by calculating the similarity of the test sentences of the corresponding model, such as GMM The model is the Gaussian mixture model, which is one of the best and most commonly used models in text-independent speaker recognition. The HMM model, the hidden Markov model, is used to describe a Markov process with hidden unknown parameters. statistical model. More specifically, the voiceprint recognition model includes a training phase and a testing phase, wherein the training phase includes four parts: training voice, feature extraction, model training, and model library; the testing phase includes testing voice, feature extraction, and scoring judgment.

步骤S102,在第一识别结果表征至少一个音频片段为非目标静音片段且至少一个音频片段的时长大于或等于第一时长阈值的情况下,获取第一识别结果中的最高识别分数;Step S102, when the first recognition result indicates that at least one audio segment is a non-target silent segment and the duration of at least one audio segment is greater than or equal to the first duration threshold, obtaining the highest recognition score in the first recognition result;

上述的第一识别结果至少包括表征至少一个音频片段是否为非目标静音片段的信息、至少一个音频片段的时长信息以及对应的至少一个音频片段的识别分数。上述步骤中,在至少一个音频片段是非目标静音片段的情况下,说明该至少一个音频片段不是静音片段,也就是说,在至少一个音频片段不是静音片段的情况下,才获取最高识别分数,因为,至少一个音频片段是静音片段的情况下,该至少一个静音片段没有对应任何的角色,所以也不涉及未知角色的识别。另外,若至少一个音频片段的时长太短,小于第一时长阈值,则确定出的角色可能会不准确,因此,获取最高识别分数的另一个前提是至少一个音频片段的时长大于或者等于第一时长阈值。The above-mentioned first recognition result at least includes information indicating whether at least one audio segment is a non-target silent segment, duration information of at least one audio segment, and a corresponding recognition score of at least one audio segment. In the above steps, if at least one audio segment is a non-target silent segment, it means that the at least one audio segment is not a silent segment, that is, the highest recognition score is obtained only when at least one audio segment is not a silent segment, because , when at least one audio segment is a silent segment, the at least one silent segment does not correspond to any character, so identification of an unknown character is not involved. In addition, if the duration of at least one audio segment is too short and less than the first duration threshold, the determined character may be inaccurate. Therefore, another prerequisite for obtaining the highest recognition score is that the duration of at least one audio segment is greater than or equal to the first duration threshold. duration threshold.

另外,上述的最高识别分数是指至少一个音频片段与声纹识别模型中的模型库中的角色匹配后得到的最高分数。In addition, the above-mentioned highest recognition score refers to the highest score obtained after at least one audio segment is matched with a role in the model library in the voiceprint recognition model.

步骤S103,在至少一个音频片段的音频时长大于或者等于第二时长阈值且最高识别分数小于分数阈值的情况下,确定至少一个音频片段对应的角色为未知角色,第二时长阈值大于第一时长阈值;Step S103, when the audio duration of at least one audio clip is greater than or equal to the second duration threshold and the highest recognition score is less than the score threshold, determine that the character corresponding to the at least one audio clip is an unknown character, and the second duration threshold is greater than the first duration threshold ;

上述步骤中,若音频片段太短,则不能准确地确定对应的角色是否为未知角色,因此,要同时满足至少一个音频片段的音频时长大于或者等于第二时长阈值且最高识别分数小于分数阈值的情况下,才认为模型库内没有与该音频匹配对应的角色,即,认为该音频对应的角色为未知角色。In the above steps, if the audio clip is too short, it cannot be accurately determined whether the corresponding character is an unknown character. Therefore, the audio duration of at least one audio clip is greater than or equal to the second duration threshold and the highest recognition score is less than the score threshold. In this case, it is considered that there is no character matching the audio in the model library, that is, the character corresponding to the audio is considered to be an unknown character.

步骤S104,将未知角色注册至声纹识别模型的库中。Step S104, registering the unknown character into the library of the voiceprint recognition model.

上述音频处理方法中,通过比较音频片段的时长和最高识别分数与对应阈值的关系,可以确定出音频对应的角色是否为未知角色,在确定是未知角色的情况下,将未知角色注册到声纹识别的模型库中,这样无需提前注册,后续就可以进行声纹角色分离,从而解决了现有技术中需要进行提前注册才能进行声纹角色分离的问题,该方案相比现有技术需要提前注册的方案来说,其易用性较强,适用场景较广。In the above audio processing method, by comparing the duration of the audio clip and the relationship between the highest recognition score and the corresponding threshold, it can be determined whether the character corresponding to the audio is an unknown character, and if it is determined to be an unknown character, register the unknown character in the voiceprint In the identified model library, voiceprint role separation can be performed later without pre-registration, thereby solving the problem in the prior art that requires pre-registration to perform voiceprint role separation. Compared with the prior art, this solution requires pre-registration For the solution, it is easy to use and applicable to a wide range of scenarios.

本申请的一种具体的实施例中,在包括上述步骤S101至S104的基础上,还对具体上述步骤S103进行细化,该步骤具体包括:步骤S1031,第一确定步骤,在上述至少一个音频片段的音频时长小于上述时长阈值且上述最高识别分数小于上述分数阈值的情况下,确定上述至少一个音频片段对应的角色为候选未知角色,即先确定可能为未知角色的至少一个音频片段;步骤S1032,第二确定步骤,获取上述至少一个音频片段的后续音频片段,得到第一更新音频片段,并对上述第一更新音频片段进行上述声纹识别,得到第二识别结果,在上述第二识别结果表征上述第一更新音频片段的音频时长大于或者等于上述时长阈值且上述最高识别分数大于上述分数阈值的情况下,将上述候选未知角色更新为已知角色,在上述第二识别结果表征上述第一更新音频片段的音频时长大于或者等于上述时长阈值且上述最高识别分数小于等于上述分数阈值的情况下,将上述候选未知角色更新为上述未知角色,即确定了可能的未知角色后,再获取后续的音频片段,对后续的音频片段进行识别,根据后续音频片段的识别结果确定候选未知角色是否为未知角色;步骤S1033,在上述第二识别结果表征上述第一更新音频片段的音频时长小于上述时长阈值情况下,重复执行上述第二确定步骤,直到确定上述第一更新音频片段对应的角色为上述已知角色或者上述未知角色为止。该方法中,通过先确定候选未知角色,之后再获取后续音频片段加以确定,使得确定得到的是否为未知角色的结果更加准确。In a specific embodiment of the present application, on the basis of including the above-mentioned steps S101 to S104, the above-mentioned step S103 is further refined, and this step specifically includes: Step S1031, the first determining step, in the above-mentioned at least one When the audio duration of the segment is less than the above-mentioned duration threshold and the above-mentioned highest recognition score is less than the above-mentioned score threshold, determine that the character corresponding to the at least one audio segment is a candidate unknown character, that is, first determine at least one audio segment that may be an unknown character; step S1032 , the second determining step is to obtain the subsequent audio segment of the at least one audio segment, obtain the first updated audio segment, and perform the above-mentioned voiceprint recognition on the first updated audio segment to obtain a second recognition result, in the above-mentioned second recognition result When the audio duration representing the first updated audio segment is greater than or equal to the duration threshold and the highest recognition score is greater than the score threshold, update the candidate unknown character to a known character, and when the second recognition result represents the first When the audio duration of the updated audio clip is greater than or equal to the above-mentioned duration threshold and the above-mentioned highest recognition score is less than or equal to the above-mentioned score threshold, update the above-mentioned candidate unknown character to the above-mentioned unknown character, that is, after determining the possible unknown character, obtain subsequent Audio clips, identifying subsequent audio clips, and determining whether the candidate unknown character is an unknown character according to the recognition results of the subsequent audio clips; step S1033, when the second recognition result indicates that the audio duration of the first updated audio clip is less than the above duration threshold In some cases, the above second determining step is repeatedly executed until it is determined that the character corresponding to the above first updated audio segment is the above known character or the above unknown character. In this method, the candidate unknown character is determined first, and then the subsequent audio clip is obtained for determination, so that the result of determining whether the character is an unknown character is more accurate.

本申请的一种实施例中,在包括上述步骤S1031至S1033的基础上,还对具体上述步骤S1031进行细化,图2是根据本申请实施例的未知角色检测的流程图,如图2所示,该方法包括以下步骤:上述至少一个音频片段的音频时长小于上述时长阈值且上述最高识别分数小于上述分数阈值的情况下,确定上述至少一个音频片段对应的角色为候选未知角色,包括:在上述至少一个音频片段的音频时长小于第一时长阈值且上述最高识别分数小于第一分数阈值的情况下,确定上述至少一个音频片段对应的角色为上述候选未知角色;在上述至少一个音频片段的音频时长大于或者等于上述第一时长阈值且小于第三时长阈值以及上述最高识别分数大于或者等于上述第一分数阈值且小于第二分数阈值的情况下,确定上述至少一个音频片段对应的角色为上述候选未知角色,上述第一时长阈值小于上述第三时长阈值,上述第一分数阈值小于上述第二分数阈值;在上述至少一个音频片段的音频时长大于或者等于上述第三时长阈值且小于第二时长阈值以及上述最高识别分数大于或等于上述第二分数阈值且小于第三分数阈值的情况下,确定上述至少一个音频片段对应的角色为上述候选未知角色,上述第三时长阈值小于上述第二时长阈值,上述第二分数阈值小于上述第三分数阈值。In one embodiment of the present application, on the basis of including the above-mentioned steps S1031 to S1033, the specific above-mentioned step S1031 is further refined. FIG. 2 is a flow chart of unknown character detection according to the embodiment of the present application, as shown in FIG. 2 The method includes the following steps: when the audio duration of the at least one audio segment is less than the duration threshold and the highest recognition score is less than the score threshold, determining that the character corresponding to the at least one audio segment is a candidate unknown character includes: When the audio duration of the at least one audio segment is less than the first duration threshold and the highest recognition score is less than the first score threshold, it is determined that the character corresponding to the at least one audio segment is the candidate unknown character; When the duration is greater than or equal to the first duration threshold and less than the third duration threshold and the highest recognition score is greater than or equal to the first score threshold and less than the second score threshold, determine that the role corresponding to the at least one audio segment is the candidate Unknown character, the first duration threshold is less than the third duration threshold, the first score threshold is less than the second score threshold; the audio duration of the at least one audio segment is greater than or equal to the third duration threshold and less than the second duration threshold And when the above-mentioned highest recognition score is greater than or equal to the above-mentioned second score threshold and smaller than the third score threshold, it is determined that the character corresponding to the above-mentioned at least one audio segment is the above-mentioned candidate unknown character, and the above-mentioned third duration threshold is less than the above-mentioned second duration threshold, The second score threshold is smaller than the third score threshold.

上述步骤中为了提高模型识别的灵敏度,设置了多种不同的时长阈值和分数阈值,本申请设置了三种时长阈值和三种分数阈值。上述步骤中,通过比较音频片段的时长和最高识别分数与不同的对应阈值的关系,可以先确定出候选未知角色即可能的未知角色,再等待后续确定,这样可以提高识别模型的准确率。因此,有三种情况可以认为该音频对应的角色为候选未知角色即可能的未知角色,第一种情况:至少一个音频片段的音频时长小于第一时长阈值且上述最高识别分数小于第一分数阈值;第二种情况:至少一个音频片段的音频时长大于或者等于上述第一时长阈值小于第三时长阈值上述且最高识别分数大于或者等于上述第一分数阈值且小于第二分数阈值;第三种情况:至少一个音频片段的音频时长大于或者等于上述第三时长阈值且小于第二时长阈值以及上述最高识别分数大于或等于上述第二分数阈值且小于第三分数阈值。本申请的一种实施例中,在包括上述步骤S101至S104的基础上,还对具体上述步骤S103进行细化,该步骤具体包括:在上述至少一个音频片段的音频时长大于或者等于上述第二时长阈值且上述最高识别分数小于上述分数阈值的情况下,确定上述至少一个音频片段对应的角色为上述未知角色,即同时满足至少一个音频片段的音频时长大于或者等于上述第二时长阈值且上述最高识别分数小于上述分数阈值的情况下,可以确定至少一个音频片段对应的角色为未知角色;在上述至少一个音频片段的音频时长大于或者等于上述第二时长阈值且上述最高识别分数大于或者等于上述分数阈值的情况下,确定上述至少一个音频片段对应的角色为已知角色,即同时满足至少一个音频片段的音频时长大于或者等于上述第二时长阈值且上述最高识别分数大于上述分数阈值的情况下,可以确定至少一个音频片段对应的角色为已知角色。In the above steps, in order to improve the sensitivity of model identification, a variety of different duration thresholds and score thresholds are set. This application sets three duration thresholds and three score thresholds. In the above steps, by comparing the duration of the audio clip and the relationship between the highest recognition score and different corresponding thresholds, the candidate unknown character, that is, the possible unknown character, can be determined first, and then wait for subsequent determination, which can improve the accuracy of the recognition model. Therefore, there are three situations in which the character corresponding to the audio can be considered as a candidate unknown character, that is, a possible unknown character. In the first case: the audio duration of at least one audio segment is less than the first duration threshold and the above-mentioned highest recognition score is less than the first score threshold; The second case: the audio duration of at least one audio clip is greater than or equal to the above-mentioned first duration threshold and less than the third duration threshold and the highest recognition score is greater than or equal to the above-mentioned first score threshold and less than the second score threshold; the third case: The audio duration of at least one audio segment is greater than or equal to the third duration threshold and less than the second duration threshold and the highest recognition score is greater than or equal to the second score threshold and less than the third score threshold. In an embodiment of the present application, on the basis of including the above-mentioned steps S101 to S104, the above-mentioned step S103 is further refined, and this step specifically includes: when the audio duration of the above-mentioned at least one audio segment is greater than or equal to the above-mentioned second When the duration threshold and the above-mentioned highest recognition score are less than the above-mentioned score threshold, it is determined that the character corresponding to the above-mentioned at least one audio segment is the above-mentioned unknown character, that is, the audio duration of at least one audio segment is greater than or equal to the above-mentioned second duration threshold and the above-mentioned highest When the recognition score is less than the aforementioned score threshold, it may be determined that the character corresponding to at least one audio segment is an unknown character; when the audio duration of the at least one audio segment is greater than or equal to the aforementioned second duration threshold and the aforementioned highest recognition score is greater than or equal to the aforementioned score In the case of threshold value, it is determined that the character corresponding to the at least one audio segment is a known character, that is, when the audio duration of at least one audio segment is greater than or equal to the second duration threshold and the highest recognition score is greater than the score threshold, It may be determined that the character corresponding to the at least one audio segment is a known character.

通过以上的步骤即可确定至少一个音频片段对应的角色为已知或未知,实现了对未知角色的识别。在确定音频对应的角色为已知角色或未知角色后,还可以继续调用自动语音识别技术,将语音转换为文字或调用机器翻译技术等。此外还可以对音频角色名称、音频转文字的结果进行展示,进而实现音频与处理结果对比回听、音频片段选取、音频角色名称修改等功能。Through the above steps, it can be determined whether the character corresponding to at least one audio clip is known or unknown, and the recognition of the unknown character is realized. After it is determined that the character corresponding to the audio is a known character or an unknown character, the automatic speech recognition technology can also be called to convert the speech into text or call the machine translation technology. In addition, it can also display the audio role name and the result of audio conversion to text, and then realize the functions of audio and processing result comparison and playback, audio clip selection, and audio role name modification.

在实际声纹识别过程中可以对音频进行静音检测,其目的是识别音频中的静音片段,在音频为非静音的情况下才继续后续的角色分离和角色切换识别,以达到提高音频识别效率的作用。本申请的一种实施例中,在包括上述步骤S101至S104的基础上,对具体上述步骤S101进行细化,图3是根据本申请实施例的静音检测的流程图。如图3所示,该方法包括以下步骤:步骤S1011,第三确定步骤,在上述第一识别结果表征上述至少一个音频片段为上述目标静音片段情况下,获取上述至少一个音频片段的时长,在上述至少一个音频片段的时长大于第四时长阈值的情况下,确定上述至少一个音频片段为上述目标静音片段,且对应的角色为空,即认为至少一个音频片段为静音且没有对应的角色;步骤S1012,第四确定步骤,在上述至少一个音频片段的时长小于或者等于上述第四时长阈值的情况下,获取上述至少一个音频片段的后续音频片段,得到第二更新音频片段,并对上述第二更新音频片段进行上述声纹识别,得到第三识别结果,在上述第三识别结果表征上述更新音频时长大于上述第四时长阈值的情况下,确定上述至少一个音频片段为上述目标静音片段,即确定了静音片段后,再获取后续的音频片段,对后续的音频片段进行识别,根据后续音频片段的识别结果确定静音片段或非静音片段;步骤S1013,在上述第三识别结果表征上述第二更新音频片段的音频时长小于等于上述第四时长阈值的情况下,重复执行上述第四确定步骤,直到确定上述第二更新音频片段为上述目标静音片段或上述非目标静音片段为止。该方法中,通过先确定目标静音片段,之后再获取后续音频片段加以确定,使得确定得到的是否为目标静音片段的结果更加准确。In the actual voiceprint recognition process, the audio can be muted. The purpose is to identify the silent segment in the audio. Only when the audio is not silent can the subsequent role separation and role switching recognition be continued, so as to improve the efficiency of audio recognition. effect. In an embodiment of the present application, on the basis of including the above steps S101 to S104, the above step S101 is refined. FIG. 3 is a flow chart of silence detection according to the embodiment of the present application. As shown in Figure 3, the method includes the following steps: step S1011, the third determining step, when the above-mentioned first recognition result indicates that the above-mentioned at least one audio segment is the above-mentioned target mute segment, obtain the duration of the above-mentioned at least one audio segment, and When the duration of the at least one audio segment is greater than the fourth duration threshold, it is determined that the at least one audio segment is the target mute segment, and the corresponding role is empty, that is, at least one audio segment is considered to be mute and there is no corresponding role; step S1012, the fourth determination step, in the case that the duration of the at least one audio segment is less than or equal to the fourth duration threshold, acquire the subsequent audio segment of the at least one audio segment, obtain the second updated audio segment, and perform the above-mentioned second Update the audio segment to perform the above-mentioned voiceprint recognition to obtain a third recognition result, and if the above-mentioned third recognition result indicates that the duration of the updated audio is greater than the fourth duration threshold, determine that the at least one audio segment is the target silent segment, that is, determine After the mute segment is obtained, the subsequent audio segment is obtained, and the subsequent audio segment is identified, and the mute segment or the non-mute segment is determined according to the recognition result of the subsequent audio segment; Step S1013, the above-mentioned second update audio is represented by the above-mentioned third recognition result If the audio duration of the segment is less than or equal to the fourth duration threshold, the fourth determination step is repeatedly executed until it is determined that the second updated audio segment is the target silent segment or the non-target silent segment. In this method, the target silent segment is determined first, and then the subsequent audio segment is obtained for determination, so that the result of determining whether the target silent segment is obtained is more accurate.

本申请的一种实施例中,在包括上述步骤S1011至S1013的基础上,对具体上述步骤S1011进行细化,该步骤具体包括:上述至少一个音频片段的时长大于上述第四时长阈值的情况下,确定上述至少一个音频片段为上述目标静音片段,包括:在上述至少一个音频片段的时长大于第三时长阈值的情况下,确定上述至少一个音频片段为上述目标静音片段,即确定了目标静音片段后,满足时长大于第三时长阈值的情况下,可以确定为目标静音片段;在上述至少一个音频片段的时长小于或者等于上述第三时长阈值且大于上述第四时长阈值的情况下,确定上述至少一个音频片段为上述目标静音片段,上述第三时长阈值大于上述第四时长阈值,即确定了目标静音片段后,满足时长小于或者等于上述第三时长阈值且大于上述第四时长阈值的情况下,可以确定为目标静音片段。通过以上的步骤,第四时长阈值的设置可以实现排除音频中说话人停顿的静音情况,可以使声纹识别结果更加准确。In an embodiment of the present application, on the basis of including the above-mentioned steps S1011 to S1013, the specific above-mentioned step S1011 is refined, and this step specifically includes: when the duration of the above-mentioned at least one audio segment is greater than the above-mentioned fourth duration threshold Determining the at least one audio segment as the target mute segment includes: determining the at least one audio segment as the target mute segment when the duration of the at least one audio segment is greater than a third duration threshold, that is, determining the target mute segment Finally, if the duration is greater than the third duration threshold, it can be determined as the target silence segment; if the duration of the at least one audio segment is less than or equal to the third duration threshold and greater than the fourth duration threshold, determine the at least one audio segment. An audio segment is the above-mentioned target mute segment, and the above-mentioned third duration threshold is greater than the above-mentioned fourth duration threshold, that is, after the target mute segment is determined, if the duration is less than or equal to the above-mentioned third duration threshold and greater than the above-mentioned fourth duration threshold, Can be identified as the target silent segment. Through the above steps, the setting of the fourth duration threshold can eliminate the silence in which the speaker pauses in the audio, and can make the voiceprint recognition result more accurate.

在实际使用过程中,还存在说话人发生变化的情况,因此还需要对音频进行角色切换检测。本申请的另一种实施例中,在包括上述步骤S101至S104的基础上,图4是根据本申请实施例的角色切换的流程图,如图4所示,该方法包括以下步骤:第五确定步骤,在历史角色不为空的情况下,确定历史角色与当前角色是否相同,其中,上述当前角色为当前的上述至少一个音频片段对应的角色,上述历史角色为上述至少一个音频片段之前的音频片段对应的角色,即先确定至少一个音频片段与之前的音频片段对应的角色是否相同;第六确定步骤,在上述历史角色与上述当前角色相同的情况下,确定未发生角色切换,即满足至少一个音频片段与之前的音频片段对应的角色相同的情况下,则没有发生角色切换;第七确定步骤,在上述历史角色与上述当前角色不相同的情况下,确定上述至少一个音频片段的时长是否大于或者等于第三时长阈值,在上述至少一个音频片段的时长大于或者等于上述第三时长阈值的情况下,确定发生上述角色切换,即确定至少一个音频片段与之前的音频片段对应的角色不相同的情况下,满足时长大于或者等于上述第三时长阈值的情况下,确定发生了角色切换;在上述至少一个音频片段的时长小于上述第三时长阈值的情况下,获取上述至少一个音频片段的后续音频片段,得到第三更新音频片段,依次重复执行上述第五确定步骤至上述第七确定步骤至少一次,直到确定发生上述角色切换或者未发生上述角色切换为止,重复执行的过程中,上述当前角色为上述第三更新音频片段对应的角色,即确定了发生角色切换后,再获取后续的音频片段,对后续的音频片段进行识别,根据后续音频片段的识别结果确定是否发生角色切换。该方法中,通过先确定是否发生角色切换,之后在获取后续音频片段加以确定,使得确定得到的是否发生角色切换的结果更加准确。通过这种方法,可以实现对音频中是否发生角色切换进行了准确识别。In the actual use process, there are still cases where the speaker changes, so it is also necessary to perform role switching detection on the audio. In another embodiment of the present application, on the basis of including the above steps S101 to S104, FIG. 4 is a flow chart of role switching according to the embodiment of the present application. As shown in FIG. 4, the method includes the following steps: fifth In the determining step, if the historical role is not empty, determine whether the historical role is the same as the current role, wherein the current role is the role corresponding to the current at least one audio segment, and the historical role is the one before the at least one audio segment The role corresponding to the audio clip, that is, first determine whether the role corresponding to at least one audio clip is the same as that of the previous audio clip; the sixth determination step, in the case that the above-mentioned historical role is the same as the above-mentioned current role, determine that no role switching occurs, that is, satisfy In the case where at least one audio clip corresponds to the same role as the previous audio clip, there is no role switching; the seventh determining step is to determine the duration of the at least one audio clip when the above-mentioned historical role is different from the above-mentioned current role Whether it is greater than or equal to the third duration threshold, in the case that the duration of the at least one audio segment is greater than or equal to the third duration threshold, it is determined that the role switching occurs, that is, it is determined that the role corresponding to at least one audio segment is different from the previous audio segment In the same case, if the duration is greater than or equal to the third duration threshold, it is determined that a role switch has occurred; when the duration of the at least one audio segment is less than the third duration threshold, obtain the at least one audio segment. Subsequent audio clips, to obtain the third updated audio clip, repeat the above-mentioned fifth determination step to the above-mentioned seventh determination step at least once until it is determined that the above-mentioned role switching occurs or the above-mentioned role switching does not occur. During the repeated execution process, the above-mentioned current The role is the role corresponding to the above-mentioned third updated audio segment, that is, after determining that the role switch occurs, the subsequent audio segment is obtained, the subsequent audio segment is identified, and whether the role switch occurs is determined according to the identification result of the subsequent audio segment. In this method, by first determining whether the role switching occurs, and then determining whether the role switching occurs after acquiring subsequent audio clips, the result of determining whether the role switching occurs is more accurate. Through this method, it is possible to accurately identify whether a role switch occurs in the audio.

需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。It should be noted that the steps shown in the flowcharts of the accompanying drawings may be performed in a computer system, such as a set of computer-executable instructions, and that although a logical order is shown in the flowcharts, in some cases, The steps shown or described may be performed in an order different than here.

本申请实施例还提供了一种音频处理装置,需要说明的是,本申请实施例的音频处理装置可以用于执行本申请实施例所提供的用于音频处理方法。以下对本申请实施例提供的音频处理装置进行介绍。The embodiment of the present application also provides an audio processing device. It should be noted that the audio processing device in the embodiment of the present application can be used to implement the audio processing method provided in the embodiment of the present application. The audio processing device provided by the embodiment of the present application is introduced below.

图5是根据本申请实施例的音频处理装置的示意图。如图5所示,该装置包括:Fig. 5 is a schematic diagram of an audio processing device according to an embodiment of the present application. As shown in Figure 5, the device includes:

第一获取单元10,用于获取至少一个音频片段,并采用声纹识别模型对至少一个音频片段进行声纹识别,得到第一识别结果。The first obtaining unit 10 is configured to obtain at least one audio segment, and perform voiceprint recognition on the at least one audio segment by using a voiceprint recognition model to obtain a first recognition result.

在获取音频片段的过程中,可以通过普通麦克风音频采集、阵列麦克风音频采集、手拉手麦克风音频采集、电脑扬声器、手机录音以及网络音频采集中的任意一种或者多种方式来获取,使语音采集机制具备多样性,从而应用于不同的场景。In the process of obtaining audio clips, it can be obtained by any one or more methods of ordinary microphone audio collection, array microphone audio collection, hand-in-hand microphone audio collection, computer speakers, mobile phone recording, and network audio collection. Mechanisms are diverse and can be applied to different scenarios.

上述的至少一个音频片段可以为一个音频片段,也可以为多个音频片段,在不同的应用场景中,音频片段的数量可能不同。The aforementioned at least one audio segment may be one audio segment, or may be multiple audio segments. In different application scenarios, the number of audio segments may be different.

本申请的声纹识别模型可以为现有技术中的任何可行的声纹识别模型,具体可以为模板模型,也可以为随机模型,其中,模板模型即非参数模型,将训练特征参数和测试的特征参数进行比较,两者之间的失真作为相似度,例如VQ(vector quantization)模型即矢量量化模型和DTW(dynamic time warping)模型即动态时间规整法模型;其中VQ模型通过聚类、量化的装置生成码本,识别时对测试数据进行量化编码,以失真度的大小作为判决的标准,DTW模型通过将输入待识别的特征矢量序列与训练时提取的特征矢量进行比较,通过最优路径匹配的装置来进行识别,随机模型即参数模型,用一个概率密度函数来模拟说话人,训练过程用于预测概率密度函数的参数,匹配过程通过计算相应模型的测试语句的相似度来完成,例如GMM模型即高斯混合模型,它是与文本无关的说话人识别中效果最好也是最常用的模型之一,HMM模型即隐马尔科夫模型是用来描述一个含有隐含未知参数的马尔科夫过程的统计模型。更为具体地,该声纹识别模型包括训练阶段和测试阶段,其中,训练阶段包括训练语音、特征提取、模型训练、模型库四部分;测试阶段包括测试语音、特征提取以及打分判决。The voiceprint recognition model of the present application can be any feasible voiceprint recognition model in the prior art, specifically it can be a template model, and it can also be a random model. The characteristic parameters are compared, and the distortion between the two is used as the similarity. For example, the VQ (vector quantization) model is the vector quantization model and the DTW (dynamic time warping) model is the dynamic time warping model; where the VQ model is clustered and quantized. The device generates a codebook, quantizes and encodes the test data during recognition, and uses the degree of distortion as the criterion for judgment. The DTW model compares the input feature vector sequence to be recognized with the feature vector extracted during training, and uses the optimal path matching The random model is a parameter model, which uses a probability density function to simulate the speaker. The training process is used to predict the parameters of the probability density function. The matching process is completed by calculating the similarity of the test sentences of the corresponding model, such as GMM The model is the Gaussian mixture model, which is one of the best and most commonly used models in text-independent speaker recognition. The HMM model, the hidden Markov model, is used to describe a Markov process with hidden unknown parameters. statistical model. More specifically, the voiceprint recognition model includes a training phase and a testing phase, wherein the training phase includes four parts: training voice, feature extraction, model training, and model library; the testing phase includes testing voice, feature extraction, and scoring judgment.

第二获取单元20,用于在第一识别结果表征至少一个音频片段为非目标静音片段且上述至少一个音频片段的时长大于或等于第一时长阈值的情况下,获取第一识别结果中的最高识别分数;The second obtaining unit 20 is configured to obtain the highest value in the first recognition result when the first recognition result indicates that at least one audio segment is a non-target silent segment and the duration of the at least one audio segment is greater than or equal to the first duration threshold recognition score;

上述的第一识别结果至少包括表征至少一个音频片段是否为非目标静音片段的信息、至少一个音频片段的时长信息以及对应的至少一个音频片段的识别分数。上述单元中,在至少一个音频片段是非目标静音片段的情况下,说明该至少一个音频片段不是静音片段,也就是说,在至少一个音频片段不是静音片段的情况下,才获取最高识别分数,因为,至少一个音频片段是静音片段的情况下,该至少一个静音片段没有对应任何的角色,所以也不涉及未知角色的识别。另外,若至少一个音频片段的时长太短,小于第一时长阈值,则确定出的角色可能会不准确,因此,获取最高识别分数的另一个前提是至少一个音频片段的时长大于或者等于第一时长阈值。The above-mentioned first recognition result at least includes information indicating whether at least one audio segment is a non-target silent segment, duration information of at least one audio segment, and a corresponding recognition score of at least one audio segment. In the above unit, if at least one audio segment is a non-target silent segment, it means that the at least one audio segment is not a silent segment, that is, the highest recognition score is obtained only when at least one audio segment is not a silent segment, because , when at least one audio segment is a silent segment, the at least one silent segment does not correspond to any character, so identification of an unknown character is not involved. In addition, if the duration of at least one audio segment is too short and less than the first duration threshold, the determined character may be inaccurate. Therefore, another prerequisite for obtaining the highest recognition score is that the duration of at least one audio segment is greater than or equal to the first duration threshold. duration threshold.

另外,上述的最高识别分数是指至少一个音频片段与声纹识别模型中的模型库中的角色匹配后得到的最高分数。In addition, the above-mentioned highest recognition score refers to the highest score obtained after at least one audio segment is matched with a role in the model library in the voiceprint recognition model.

确定单元30,用于在上述至少一个音频片段的音频时长大于或者等于上述第二时长阈值且上述最高识别分数小于分数阈值的情况下,确定上述至少一个音频片段对应的角色为未知角色,上述第二时长阈值大于上述第一时长阈值;The determining unit 30 is configured to determine that the character corresponding to the at least one audio segment is an unknown character when the audio duration of the at least one audio segment is greater than or equal to the second duration threshold and the highest recognition score is less than the score threshold. The second duration threshold is greater than the first duration threshold;

上述单元中,若音频片段太短,则不能准确地确定对应的角色是否为未知角色,因此,要同时满足至少一个音频片段的音频时长大于或者等于第二时长阈值且最高识别分数小于分数阈值的情况下,才认为模型库内没有与该音频匹配对应的角色,即,认为该音频对应的角色为未知角色。In the above unit, if the audio clip is too short, it cannot be accurately determined whether the corresponding character is an unknown character. Therefore, the audio duration of at least one audio clip is greater than or equal to the second duration threshold and the highest recognition score is less than the score threshold. In this case, it is considered that there is no character matching the audio in the model library, that is, the character corresponding to the audio is considered to be an unknown character.

注册单元40,用于将上述未知角色注册至上述声纹识别模型的库中。The registration unit 40 is configured to register the above-mentioned unknown character into the above-mentioned voiceprint recognition model library.

上述音频处理装置中,通过比较音频片段的时长和最高识别分数与对应阈值的关系,可以确定出音频对应的角色是否为未知角色,在确定是未知角色的情况下,将未知角色注册到声纹识别的模型库中,这样无需提前注册,后续就可以进行声纹角色分离,从而解决了现有技术中需要进行提前注册才能进行声纹角色分离的问题,该方案相比现有技术需要提前注册的方案来说,其易用性较强,适用场景较广。In the above audio processing device, by comparing the duration of the audio segment and the relationship between the highest recognition score and the corresponding threshold, it can be determined whether the character corresponding to the audio is an unknown character, and if it is determined to be an unknown character, register the unknown character in the voiceprint In the identified model library, voiceprint role separation can be performed later without pre-registration, thereby solving the problem in the prior art that requires pre-registration to perform voiceprint role separation. Compared with the prior art, this solution requires pre-registration For the solution, it is easy to use and applicable to a wide range of scenarios.

本申请的一种具体的实施例中,在包括上述第一获取单元、第二获取单元、确定单元和注册单元的基础上,还对具体上述确定单元进行细化,确定单元包括第一确定模块,第二确定模块和第三确定模块,其中第一确定模块用于在上述至少一个音频片段的音频时长小于上述时长阈值且上述最高识别分数小于上述分数阈值的情况下,确定上述至少一个音频片段对应的角色为候选未知角色,即先确定可能为未知角色的至少一个音频片段;第二确定模块用于获取上述至少一个音频片段的后续音频片段,得到第一更新音频片段,并对上述第一更新音频片段进行上述声纹识别,得到第二识别结果,在上述第二识别结果表征上述第一更新音频片段的音频时长大于或者等于上述时长阈值且上述最高识别分数大于上述分数阈值的情况下,将上述候选未知角色更新为已知角色,在上述第二识别结果表征上述第一更新音频片段的音频时长大于或者等于上述时长阈值且上述最高识别分数小于等于上述分数阈值的情况下,将上述候选未知角色更新为上述未知角色,即确定了可能的未知角色后,再获取后续的音频片段,对后续的音频片段进行识别,根据后续音频片段的识别结果确定候选未知角色是否为未知角色;第三确定模块用于在上述第二识别结果表征上述第一更新音频片段的音频时长小于上述时长阈值情况下,重复执行上述第二确定模块,用于直到确定上述第一更新音频片段对应的角色为上述已知角色或者上述未知角色为止。该装置中,通过先确定候选未知角色,之后再获取后续音频片段加以确定,使得确定得到的是否为未知角色的结果更加准确。In a specific embodiment of the present application, on the basis of including the first acquisition unit, the second acquisition unit, the determination unit and the registration unit, the specific determination unit is further refined, and the determination unit includes a first determination module , a second determination module and a third determination module, wherein the first determination module is configured to determine the at least one audio segment when the audio duration of the at least one audio segment is less than the duration threshold and the highest recognition score is less than the score threshold The corresponding role is a candidate unknown character, that is, first determine at least one audio clip that may be an unknown character; the second determination module is used to obtain the subsequent audio clip of the at least one audio clip, obtain the first updated audio clip, and perform the above-mentioned first Updating the audio segment to perform the above-mentioned voiceprint recognition to obtain a second recognition result, when the second recognition result indicates that the audio duration of the first updated audio segment is greater than or equal to the above-mentioned duration threshold and the above-mentioned highest recognition score is greater than the above-mentioned score threshold, Updating the above-mentioned candidate unknown character to a known character, when the above-mentioned second recognition result indicates that the audio duration of the above-mentioned first updated audio segment is greater than or equal to the above-mentioned duration threshold and the above-mentioned highest recognition score is less than or equal to the above-mentioned score threshold, the above-mentioned candidate The unknown character is updated to the above unknown character, that is, after the possible unknown character is determined, the subsequent audio clip is obtained, the subsequent audio clip is identified, and whether the candidate unknown character is an unknown character is determined according to the recognition result of the subsequent audio clip; the third The determination module is used to repeatedly execute the second determination module until it is determined that the role corresponding to the first updated audio segment is the above-mentioned Known roles or the aforementioned unknown roles. In the device, by first determining the candidate unknown character, and then obtaining subsequent audio clips for determination, the result of determining whether the unknown character is determined is more accurate.

本申请的一种实施例中,在包括上述第一确定模块、第二确定模块和第三确定模块的基础上,还对具体上述第一确定模块进行细化,该模块具体包括第一确定子模块、第二确定子模块和第三确定子模块,其中第一确定子模块用于在上述至少一个音频片段的音频时长小于第一时长阈值且上述最高识别分数小于第一分数阈值的情况下,确定上述至少一个音频片段对应的角色为上述候选未知角色;第二确定子模块用于在上述至少一个音频片段的音频时长大于或者等于上述第一时长阈值且小于第三时长阈值以及上述最高识别分数大于或者等于上述第一分数阈值且小于第二分数阈值的情况下,确定上述至少一个音频片段对应的角色为上述候选未知角色,上述第一时长阈值小于上述第三时长阈值,上述第一分数阈值小于上述第二分数阈值;第三确定子模块用于在上述至少一个音频片段的音频时长大于或者等于上述第三时长阈值且小于第二时长阈值以及上述最高识别分数大于或等于上述第二分数阈值且小于第三分数阈值的情况下,确定上述至少一个音频片段对应的角色为上述候选未知角色,上述第三时长阈值小于上述第二时长阈值,上述第二分数阈值小于上述第三分数阈值。In an embodiment of the present application, on the basis of including the first determination module, the second determination module and the third determination module, the above-mentioned first determination module is further refined, and the module specifically includes the first determination module module, a second determination sub-module and a third determination sub-module, wherein the first determination sub-module is used for when the audio duration of the at least one audio segment is less than the first duration threshold and the highest recognition score is less than the first score threshold, Determining that the character corresponding to the at least one audio segment is the candidate unknown character; the second determining submodule is used to determine when the audio duration of the at least one audio segment is greater than or equal to the first duration threshold and less than the third duration threshold and the highest recognition score When it is greater than or equal to the first score threshold and less than the second score threshold, it is determined that the character corresponding to the at least one audio segment is the candidate unknown character, the first duration threshold is less than the third duration threshold, and the first score threshold Less than the above-mentioned second score threshold; the third determination submodule is used to determine that the audio duration of the at least one audio segment is greater than or equal to the above-mentioned third duration threshold and less than the second duration threshold and the above-mentioned highest recognition score is greater than or equal to the above-mentioned second score threshold and is less than the third score threshold, determine that the character corresponding to the at least one audio segment is the candidate unknown character, the third duration threshold is smaller than the second duration threshold, and the second score threshold is smaller than the third score threshold.

上述单元中为了提高模型识别的灵敏度,设置了多种不同的时长阈值和分数阈值,本申请设置了三种时长阈值和三种分数阈值。上述装置中,通过比较音频片段的时长和最高识别分数与不同的对应阈值的关系,可以先确定出候选未知角色即可能的未知角色,再等待后续确定,这样可以提高识别模型的准确率。因此,有三种情况可以认为该音频对应的角色为候选未知角色即可能的未知角色,第一种情况:至少一个音频片段的音频时长小于第一时长阈值且最高识别分数小于第一分数阈值;第二种情况:至少一个音频片段的音频时长大于或者等于第一时长阈值小于第三时长阈值且最高识别分数大于或者等于上述第一分数阈值且小于第二分数阈值;第三种情况:至少一个音频片段的音频时长大于或者等于上述第三时长阈值且小于第二时长阈值以及上述最高识别分数大于或等于上述第二分数阈值且小于第三分数阈值。In order to improve the sensitivity of model recognition in the above units, various different duration thresholds and score thresholds are set. This application sets three duration thresholds and three score thresholds. In the above device, by comparing the duration of the audio clip and the relationship between the highest recognition score and different corresponding thresholds, it is possible to first determine the candidate unknown character, that is, the possible unknown character, and then wait for the subsequent determination, which can improve the accuracy of the recognition model. Therefore, there are three situations in which it can be considered that the role corresponding to the audio is a candidate unknown role, that is, a possible unknown role. In the first case: the audio duration of at least one audio segment is less than the first duration threshold and the highest recognition score is less than the first score threshold; the second Two cases: the audio duration of at least one audio clip is greater than or equal to the first duration threshold and less than the third duration threshold and the highest recognition score is greater than or equal to the first score threshold and less than the second score threshold; the third case: at least one audio The audio duration of the segment is greater than or equal to the third duration threshold and less than the second duration threshold and the highest recognition score is greater than or equal to the second score threshold and less than the third score threshold.

本申请的一种实施例中,在包括上述第一获取单元、第二获取单元、确定单元和注册单元的基础上,还对具体上述确定单元进行细化,该单元具体包括:第四确定模块,用于在上述至少一个音频片段的音频时长大于或者等于上述第二时长阈值且上述最高识别分数小于上述分数阈值的情况下,确定上述至少一个音频片段对应的角色为上述未知角色,即同时满足至少一个音频片段的音频时长大于或者等于上述第二时长阈值且上述最高识别分数小于上述分数阈值的情况下,可以确定至少一个音频片段对应的角色为未知角色;第五确定模块,用于在上述至少一个音频片段的音频时长大于或者等于上述第二时长阈值且上述最高识别分数大于或者等于上述分数阈值的情况下,确定上述至少一个音频片段对应的角色为已知角色,即同时满足至少一个音频片段的音频时长大于或者等于上述第二时长阈值且上述最高识别分数大于上述分数阈值的情况下,可以确定至少一个音频片段对应的角色为已知角色。In an embodiment of the present application, on the basis of including the first acquisition unit, the second acquisition unit, the determination unit and the registration unit, the specific determination unit is further refined, and the unit specifically includes: a fourth determination module , for determining that the character corresponding to the at least one audio segment is the unknown character when the audio duration of the at least one audio segment is greater than or equal to the second duration threshold and the highest recognition score is smaller than the score threshold, that is, simultaneously satisfying When the audio duration of at least one audio clip is greater than or equal to the above-mentioned second duration threshold and the above-mentioned highest recognition score is less than the above-mentioned score threshold, it can be determined that the role corresponding to at least one audio clip is an unknown character; When the audio duration of at least one audio clip is greater than or equal to the second duration threshold and the highest recognition score is greater than or equal to the above score threshold, it is determined that the role corresponding to the at least one audio segment is a known role, that is, at least one audio segment is satisfied at the same time. When the audio duration of the segment is greater than or equal to the second duration threshold and the highest recognition score is greater than the score threshold, it may be determined that the character corresponding to at least one audio segment is a known character.

通过以上的单元即可确定至少一个音频片段对应的角色为已知或未知,实现了对未知角色的识别。在确定音频对应的角色为已知角色或未知角色后,还可以继续调用自动语音识别技术,将语音转换为文字或调用机器翻译技术等。此外还可以对音频角色名称、音频转文字的结果进行展示,进而实现音频与处理结果对比回听、音频片段选取、音频角色名称修改等功能。Through the above units, it can be determined whether the character corresponding to at least one audio segment is known or unknown, and the recognition of the unknown character is realized. After it is determined that the character corresponding to the audio is a known character or an unknown character, the automatic speech recognition technology can also be called to convert the speech into text or call the machine translation technology. In addition, it can also display the audio role name and the result of audio conversion to text, and then realize the functions of audio and processing result comparison and playback, audio clip selection, and audio role name modification.

在实际声纹识别过程中可以对音频进行静音检测,其目的是识别音频中的静音片段,在音频为非静音的情况下才继续后续的角色分离和角色切换识别,以达到提高音频识别效率的作用。本申请的一种实施例中,在包括上述第一获取单元、第二获取单元、确定单元和注册单元的基础上,对具体上述第一获取单元进行细化,第一获取单元包括第六确定模块、第七确定模块和第八确定模块,其中,第六确定模块用于在上述第一识别结果表征上述至少一个音频片段为上述目标静音片段情况下,获取上述至少一个音频片段的时长,在上述至少一个音频片段的时长大于第四时长阈值的情况下,确定上述至少一个音频片段为上述目标静音片段,且对应的角色为空,即认为至少一个音频片段为静音且没有对应的角色;第七确定模块用于在上述至少一个音频片段的时长小于或者等于上述第四时长阈值的情况下,获取上述至少一个音频片段的后续音频片段,得到第二更新音频片段,并对上述第二更新音频片段进行上述声纹识别,得到第三识别结果,在上述第三识别结果表征上述更新音频时长大于上述第四时长阈值的情况下,确定上述至少一个音频片段为上述目标静音片段,即确定了静音片段后,再获取后续的音频片段,对后续的音频片段进行识别,根据后续音频片段的识别结果确定静音片段或非静音片段;第八确定模块用于在上述第三识别结果表征上述第二更新音频片段的音频时长小于等于上述第四时长阈值的情况下,重复执行上述第四确定模块,直到确定上述第二更新音频片段为上述目标静音片段或上述非目标静音片段为止。该装置中,通过先确定目标静音片段,之后再获取后续音频片段加以确定,使得确定得到的是否为目标静音片段的结果更加准确。In the actual voiceprint recognition process, the audio can be muted. The purpose is to identify the silent segment in the audio. Only when the audio is not silent can the subsequent role separation and role switching recognition be continued, so as to improve the efficiency of audio recognition. effect. In an embodiment of the present application, on the basis of including the first acquisition unit, the second acquisition unit, the determination unit and the registration unit, the above-mentioned first acquisition unit is refined, and the first acquisition unit includes the sixth determination unit module, a seventh determination module, and an eighth determination module, wherein the sixth determination module is configured to obtain the duration of the at least one audio segment when the first recognition result indicates that the at least one audio segment is the target mute segment, and When the duration of the at least one audio segment is greater than the fourth duration threshold, it is determined that the at least one audio segment is the target mute segment, and the corresponding role is empty, that is, at least one audio segment is considered to be mute and has no corresponding role; The seventh determination module is used to obtain the subsequent audio segment of the at least one audio segment when the duration of the at least one audio segment is less than or equal to the fourth duration threshold, obtain a second updated audio segment, and perform an update on the second updated audio segment. Perform the above-mentioned voiceprint recognition on the fragment to obtain a third recognition result, and if the above-mentioned third recognition result indicates that the duration of the above-mentioned updated audio is greater than the above-mentioned fourth duration threshold, determine that the above-mentioned at least one audio fragment is the above-mentioned target mute segment, that is, determine the mute After the segment, obtain the subsequent audio segment, identify the subsequent audio segment, and determine the mute segment or the non-mute segment according to the recognition result of the subsequent audio segment; the eighth determination module is used to represent the above-mentioned second update in the above-mentioned third recognition result If the audio duration of the audio segment is less than or equal to the fourth duration threshold, the fourth determination module is repeatedly executed until it is determined that the second updated audio segment is the target silent segment or the non-target silent segment. In the device, the target silent segment is determined first, and then the subsequent audio segment is obtained for determination, so that the result of determining whether the target silent segment is obtained is more accurate.

本申请的一种实施例中,在包括上述第六确定模块、第七确定模块和第八确定模块的基础上,对具体上述第六确定模块进行细化,该模块包括:第四确定子模块和第五确定子模块,其中,第四确定子模块用于在上述至少一个音频片段的时长大于第三时长阈值的情况下,确定上述至少一个音频片段为上述目标静音片段,即确定了目标静音片段后,满足时长大于第三时长阈值的情况下,可以确定为目标静音片段;第五确定子模块用于在上述至少一个音频片段的时长小于或者等于上述第三时长阈值且大于上述第四时长阈值的情况下,确定上述至少一个音频片段为上述目标静音片段,上述第三时长阈值大于上述第四时长阈值,即确定了目标静音片段后,满足时长小于或者等于上述第三时长阈值且大于上述第四时长阈值的情况下,可以确定为目标静音片段。通过以上的装置,第四时长阈值的设置可以实现排除音频中说话人停顿的静音情况,可以使声纹识别结果更加准确。In one embodiment of the present application, on the basis of including the above-mentioned sixth determination module, seventh determination module and eighth determination module, the above-mentioned sixth determination module is refined, and this module includes: a fourth determination sub-module and a fifth determining submodule, wherein the fourth determining submodule is configured to determine that the at least one audio segment is the target mute segment when the duration of the at least one audio segment is greater than the third duration threshold, that is, the target mute is determined After the segment, if the duration is greater than the third duration threshold, it can be determined as the target mute segment; the fifth determining submodule is used to determine when the duration of the at least one audio segment is less than or equal to the third duration threshold and greater than the fourth duration In the case of a threshold, it is determined that the at least one audio segment is the target silence segment, and the third duration threshold is greater than the fourth duration threshold, that is, after the target silence segment is determined, the duration is less than or equal to the third duration threshold and greater than the above In the case of the fourth duration threshold, it may be determined as the target silence segment. Through the above device, the setting of the fourth duration threshold can eliminate the silence of the speaker in the audio, and can make the voiceprint recognition result more accurate.

在实际使用过程中,还存在说话人发生变化的情况,因此还需要对音频进行角色切换检测。本申请的另一种实施例中,在包括上述第一获取单元、第二获取单元、确定单元和注册单元基础上,还包括第九确定模块、第十确定模块、第十一确定模块和第十二确定模块,其中,第九确定模块用于在历史角色不为空的情况下,确定历史角色与当前角色是否相同,其中,上述当前角色为当前的上述至少一个音频片段对应的角色,上述历史角色为上述至少一个音频片段之前的音频片段对应的角色,即先确定至少一个音频片段与之前的音频片段对应的角色是否相同;第十确定模块用于在上述历史角色与上述当前角色相同的情况下,确定未发生角色切换,即满足至少一个音频片段与之前的音频片段对应的角色相同的情况下,则没有发生角色切换;第十一确定模块用于在上述历史角色与上述当前角色不相同的情况下,确定上述至少一个音频片段的时长是否大于或者等于第三时长阈值,在上述至少一个音频片段的时长大于或者等于上述第三时长阈值的情况下,确定发生上述角色切换,即确定至少一个音频片段与之前的音频片段对应的角色不相同的情况下,满足时长大于或者等于上述第三时长阈值的情况下,确定发生了角色切换;第十二确定模块用于在上述至少一个音频片段的时长小于上述第三时长阈值的情况下,获取上述至少一个音频片段的后续音频片段,得到第三更新音频片段,依次重复执行上述第九确定模块至上述第十一确定模块至少一次,直到确定发生上述角色切换或者未发生上述角色切换为止,重复执行的过程中,上述当前角色为上述第三更新音频片段对应的角色,即确定了发生角色切换后,再获取后续的音频片段,对后续的音频片段进行识别,根据后续音频片段的识别结果确定是否发生角色切换。该装置中,通过先确定是否发生角色切换,之后在获取后续音频片段加以确定,使得确定得到的是否发生角色切换的结果更加准确。通过这种装置,可以实现对音频中是否发生角色切换进行了准确识别。In the actual use process, there are still cases where the speaker changes, so it is also necessary to perform role switching detection on the audio. In another embodiment of the present application, on the basis of including the first acquisition unit, the second acquisition unit, the determination unit and the registration unit, it also includes a ninth determination module, a tenth determination module, an eleventh determination module and a Twelve determination modules, wherein the ninth determination module is used to determine whether the historical role is the same as the current role when the historical role is not empty, wherein the current role is the role corresponding to the current at least one audio segment, and the above-mentioned The historical role is the role corresponding to the audio clip before the at least one audio clip, that is, first determine whether the role corresponding to the at least one audio clip is the same as the previous audio clip; In this case, it is determined that role switching has not occurred, that is, if at least one audio segment is the same as the role corresponding to the previous audio segment, then no role switching has occurred; In the same case, it is determined whether the duration of the at least one audio segment is greater than or equal to the third duration threshold, and if the duration of the at least one audio segment is greater than or equal to the third duration threshold, it is determined that the above-mentioned role switching occurs, that is, determining When the role corresponding to at least one audio clip is not the same as that of the previous audio clip, if the duration is greater than or equal to the above-mentioned third duration threshold, it is determined that a role switch has occurred; When the duration of the segment is less than the third duration threshold, acquire the subsequent audio segment of the at least one audio segment to obtain the third updated audio segment, and repeat the ninth determination module to the eleventh determination module at least once until Until the above-mentioned role switching occurs or does not occur, in the process of repeated execution, the above-mentioned current role is the role corresponding to the above-mentioned third updated audio segment, that is, after the role switching is determined, the subsequent audio segment is obtained, and the subsequent Identify the audio clips, and determine whether role switching occurs according to the recognition results of the subsequent audio clips. In the device, by first determining whether role switching occurs, and then determining whether the role switching occurs after obtaining subsequent audio clips, the result of determining whether role switching occurs is more accurate. Through this device, it is possible to accurately identify whether role switching occurs in the audio.

上述音频处理装置包括处理器和存储器,上述第一获取单元、第二获取单元、确定单元和注册单元等均作为程序单元存储在存储器中,由处理器执行存储在存储器中的上述程序单元来实现相应的功能。The above-mentioned audio processing device includes a processor and a memory, and the above-mentioned first acquisition unit, second acquisition unit, determination unit and registration unit are all stored in the memory as program units, and the processor executes the above-mentioned program units stored in the memory to realize corresponding function.

处理器中包含内核,由内核去存储器中调取相应的程序单元。内核可以设置一个或以上,通过调整内核参数来实现识别音频对应的未知角色。The processor includes a kernel, and the kernel fetches corresponding program units from the memory. The kernel can be set to one or more, and the unknown character corresponding to the audio can be recognized by adjusting the kernel parameters.

存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM),存储器包括至少一个存储芯片。Memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), memory includes at least one memory chip.

本发明实施例提供了一种处理器,上述处理器用于运行程序,其中,上述程序运行时执行上述音频处理方法。An embodiment of the present invention provides a processor, where the processor is used to run a program, wherein the audio processing method is executed when the program is running.

本发明实施例提供了一种设备,设备包括处理器、存储器及存储在存储器上并可在处理器上运行的程序,处理器执行程序时实现至少以下步骤:An embodiment of the present invention provides a device. The device includes a processor, a memory, and a program stored on the memory and operable on the processor. When the processor executes the program, at least the following steps are implemented:

步骤S101,获取至少一个音频片段,并采用声纹识别模型对上述至少一个音频片段进行声纹识别,得到第一识别结果;Step S101, acquiring at least one audio clip, and using a voiceprint recognition model to perform voiceprint recognition on the at least one audio clip to obtain a first recognition result;

步骤S102,在上述第一识别结果表征上述至少一个音频片段为非目标静音片段且上述至少一个音频片段的时长大于或等于第一时长阈值的情况下,获取上述第一识别结果中的最高识别分数;Step S102, when the above-mentioned first recognition result indicates that the above-mentioned at least one audio segment is a non-target silent segment and the duration of the above-mentioned at least one audio segment is greater than or equal to a first duration threshold, obtain the highest recognition score in the above-mentioned first recognition result ;

步骤S103,在上述至少一个音频片段的音频时长大于或者等于上述时长阈值且上述最高识别分数小于分数阈值的情况下,确定上述至少一个音频片段对应的角色为未知角色;Step S103, when the audio duration of the at least one audio segment is greater than or equal to the duration threshold and the highest recognition score is less than the score threshold, determine that the character corresponding to the at least one audio segment is an unknown character;

步骤S104,将上述未知角色注册至上述声纹识别模型的库中。Step S104, registering the above-mentioned unknown character into the above-mentioned voiceprint recognition model library.

本文中的设备可以是服务器、PC、PAD、手机等。The devices in this article can be servers, PCs, PADs, mobile phones, etc.

本申请还提供了一种计算机程序产品,当在数据处理设备上执行时,适于执行初始化有至少如下方法步骤的程序:The present application also provides a computer program product, which, when executed on a data processing device, is adapted to execute a program initialized with at least the following method steps:

步骤S101,获取至少一个音频片段,并采用声纹识别模型对上述至少一个音频片段进行声纹识别,得到第一识别结果;Step S101, acquiring at least one audio clip, and using a voiceprint recognition model to perform voiceprint recognition on the at least one audio clip to obtain a first recognition result;

步骤S102,在上述第一识别结果表征上述至少一个音频片段为非目标静音片段且上述至少一个音频片段的时长大于或等于第一时长阈值的情况下,获取上述第一识别结果中的最高识别分数;Step S102, when the above-mentioned first recognition result indicates that the above-mentioned at least one audio segment is a non-target silent segment and the duration of the above-mentioned at least one audio segment is greater than or equal to a first duration threshold, obtain the highest recognition score in the above-mentioned first recognition result ;

步骤S103,在上述至少一个音频片段的音频时长大于或者等于上述时长阈值且上述最高识别分数小于分数阈值的情况下,确定上述至少一个音频片段对应的角色为未知角色;Step S103, when the audio duration of the at least one audio segment is greater than or equal to the duration threshold and the highest recognition score is less than the score threshold, determine that the character corresponding to the at least one audio segment is an unknown character;

步骤S104,将上述未知角色注册至上述声纹识别模型的库中。Step S104, registering the above-mentioned unknown character into the above-mentioned voiceprint recognition model library.

在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present invention, the descriptions of each embodiment have their own emphases, and for parts not described in detail in a certain embodiment, reference may be made to relevant descriptions of other embodiments.

在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be realized in other ways. Wherein, the device embodiments described above are only illustrative. For example, the division of the above-mentioned units can be a logical function division. In actual implementation, there may be another division method, for example, multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of units or modules may be in electrical or other forms.

上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例上述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the above integrated units are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the above-mentioned methods in various embodiments of the present invention. The aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes. .

从以上的描述中,可以看出,本申请上述的实施例实现了如下技术效果:From the above description, it can be seen that the above-mentioned embodiments of the present application have achieved the following technical effects:

1)、本申请的音频处理方法中,首先获取至少一个音频片段,并采用声纹识别模型对至少一个音频片段进行声纹识别,得到第一识别结果,然后,在第一识别结果表征至少一个音频片段为非目标静音片段且至少一个音频片段的时长大于或等于第一时长阈值的情况下,获取第一识别结果中的最高识别分数;在至少一个音频片段的音频时长大于或者等于第二时长阈值且最高识别分数小于分数阈值的情况下,确定至少一个音频片段对应的角色为未知角色,第二时长阈值大于第一时长阈值;最后,将未知角色注册至声纹识别模型的库中。通过识别分数阈值和音频时长阈值的设置,并对音频进行判断,实现了对音频内出现的未知角色进行检测,并将检测到的未知角色注册到声纹识别模型库中。1), in the audio processing method of the present application, first obtain at least one audio segment, and use a voiceprint recognition model to perform voiceprint recognition on at least one audio segment to obtain a first recognition result, and then characterize at least one audio segment in the first recognition result When the audio segment is a non-target silent segment and the duration of at least one audio segment is greater than or equal to the first duration threshold, obtain the highest recognition score in the first recognition result; if the audio duration of at least one audio segment is greater than or equal to the second duration threshold and the highest recognition score is less than the score threshold, it is determined that the role corresponding to at least one audio clip is an unknown character, and the second duration threshold is greater than the first duration threshold; finally, the unknown character is registered in the library of the voiceprint recognition model. By setting the recognition score threshold and the audio duration threshold, and judging the audio, the unknown characters appearing in the audio are detected, and the detected unknown characters are registered in the voiceprint recognition model library.

2)、本申请的音频处理装置该装置中,第一获取单元用于获取至少一个音频片段,并采用声纹识别模型对至少一个音频片段进行声纹识别,得到第一识别结果,至少一个音频片段可以为一个音频片段,也可以为多个音频片段,在不同的应用场景中,音频片段的数量可能不同。第二获取单元用于在第一识别结果表征至少一个音频片段为非目标静音片段且至少一个音频片段的时长大于或等于第一时长阈值的情况下,获取第一识别结果中的最高识别分数;在至少一个音频片段是非目标静音片段的情况下,说明该至少一个音频片段不是静音片段,也就是说,在至少一个音频片段不是静音片段的情况下,才获取最高识别分数。另外,若至少一个音频片段的时长太短,小于第一时长阈值,则确定出的角色可能会不准确,因此,获取最高识别分数的另一个前提是至少一个音频片段的时长大于或者等于第一时长阈值。确定单元用于在至少一个音频片段的音频时长大于或者等于上述第二时长阈值且最高识别分数小于分数阈值的情况下,确定上述至少一个音频片段对应的角色为未知角色,第二时长阈值大于第一时长阈值;若音频片段太短,则不能准确地确定对应的角色是否为未知角色。注册单元用于将上述未知角色注册至上述声纹识别模型的库中。上述音频处理装置中,通过比较音频片段的时长和最高识别分数与对应阈值的关系,可以确定出音频对应的角色是否为未知角色,在确定是未知角色的情况下,将未知角色注册到声纹识别的模型库中,这样无需提前注册,后续就可以进行声纹角色分离,从而解决了现有技术中需要进行提前注册才能进行声纹角色分离的问题,该方案相比现有技术需要提前注册的方案来说,其易用性较强,适用场景较广。2) In the audio processing device of the present application, the first acquisition unit is used to acquire at least one audio clip, and use the voiceprint recognition model to perform voiceprint recognition on at least one audio clip to obtain a first recognition result, at least one audio clip A segment may be one audio segment or multiple audio segments, and in different application scenarios, the number of audio segments may be different. The second acquisition unit is used to obtain the highest recognition score in the first recognition result when the first recognition result indicates that at least one audio segment is a non-target silent segment and the duration of the at least one audio segment is greater than or equal to the first duration threshold; If at least one audio segment is a non-target silent segment, it means that the at least one audio segment is not a silent segment, that is, the highest recognition score is obtained only when at least one audio segment is not a silent segment. In addition, if the duration of at least one audio segment is too short and less than the first duration threshold, the determined character may be inaccurate. Therefore, another prerequisite for obtaining the highest recognition score is that the duration of at least one audio segment is greater than or equal to the first duration threshold. duration threshold. The determination unit is used to determine that the character corresponding to the at least one audio segment is an unknown character when the audio duration of at least one audio segment is greater than or equal to the second duration threshold and the highest recognition score is less than the score threshold, and the second duration threshold is greater than the second duration threshold. A duration threshold; if the audio clip is too short, it cannot be accurately determined whether the corresponding character is an unknown character. The registration unit is used to register the above-mentioned unknown character into the above-mentioned voiceprint recognition model library. In the above audio processing device, by comparing the duration of the audio clip and the relationship between the highest recognition score and the corresponding threshold, it can be determined whether the character corresponding to the audio is an unknown character, and if it is determined to be an unknown character, register the unknown character in the voiceprint In the identified model library, voiceprint role separation can be performed later without pre-registration, thereby solving the problem in the prior art that requires pre-registration to perform voiceprint role separation. Compared with the prior art, this solution requires pre-registration For the solution, it is easy to use and applicable to a wide range of scenarios.

以上上述仅为本申请的优选实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The foregoing are only preferred embodiments of the present application, and are not intended to limit the present application. For those skilled in the art, various modifications and changes may be made to the present application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this application shall be included within the protection scope of this application.

Claims (10)

1. An audio processing method, comprising:
acquiring at least one audio clip, and performing voiceprint recognition on the at least one audio clip by adopting a voiceprint recognition model to obtain a first recognition result;
under the condition that the first identification result represents that the at least one audio segment is a non-target mute segment and the duration of the at least one audio segment is greater than or equal to a first duration threshold, acquiring the highest identification score in the first identification result;
determining that the role corresponding to the at least one audio clip is an unknown role under the condition that the audio duration of the at least one audio clip is greater than or equal to a second duration threshold and the highest identification score is less than a score threshold, wherein the second duration threshold is greater than the first duration threshold;
registering the unknown character into the voiceprint recognized model library.
2. The method of claim 1, wherein determining that the corresponding role of the at least one audio piece is an unknown role in the case that the audio duration of the at least one audio piece is greater than or equal to a second duration threshold and the highest identification score is less than the score threshold comprises:
a first determination step, in which a role corresponding to the at least one audio clip is determined to be a candidate unknown role under the condition that the audio duration of the at least one audio clip is smaller than the second duration threshold and the highest identification score is smaller than the score threshold;
a second determining step, in which a subsequent audio clip of the at least one audio clip is obtained to obtain a first updated audio clip, the voiceprint recognition is performed on the first updated audio clip to obtain a second recognition result, the candidate unknown role is updated to be the known role under the condition that the second recognition result represents that the audio duration of the first updated audio clip is greater than or equal to the second duration threshold and the highest recognition score is greater than the score threshold, and the candidate unknown role is updated to be the unknown role under the condition that the second recognition result represents that the audio duration of the first updated audio clip is greater than or equal to the second duration threshold and the highest recognition score is less than or equal to the score threshold;
and under the condition that the second identification result represents that the audio duration of the first updated audio clip is smaller than the second duration threshold, repeatedly executing the second determining step until the role corresponding to the first updated audio clip is determined to be the known role or the unknown role.
3. The method of claim 2, wherein determining that the role corresponding to the at least one audio piece is a candidate unknown role in the case that the audio duration of the at least one audio piece is less than the second duration threshold and the highest recognition score is less than the score threshold comprises:
determining the role corresponding to the at least one audio clip as the candidate unknown role under the condition that the audio duration of the at least one audio clip is smaller than a first duration threshold and the highest identification score is smaller than a first score threshold;
determining that the role corresponding to the at least one audio clip is the unknown candidate role under the conditions that the audio duration of the at least one audio clip is greater than or equal to the first duration threshold and less than a third duration threshold and the highest identification score is greater than or equal to the first score threshold and less than a second score threshold, wherein the first duration threshold is less than the third duration threshold and the first score threshold is less than the second score threshold;
determining that the role corresponding to the at least one audio clip is the candidate unknown role under the condition that the audio duration of the at least one audio clip is greater than or equal to the third duration threshold and less than a second duration threshold and the highest identification score is greater than or equal to the second score threshold and less than a third score threshold, wherein the third duration threshold is less than the second duration threshold and the second score threshold is less than the third score threshold.
4. The method of claim 1, wherein determining that the corresponding role of the at least one audio segment is an unknown role in the case that the audio duration of the at least one audio segment is greater than or equal to the second duration threshold and the highest recognition score is less than the score threshold comprises:
determining that the role corresponding to the at least one audio clip is the unknown role under the condition that the audio duration of the at least one audio clip is greater than or equal to the second duration threshold and the highest identification score is less than the score threshold;
and under the condition that the audio duration of the at least one audio clip is greater than or equal to the second duration threshold and the highest identification score is greater than or equal to the score threshold, determining that the role corresponding to the at least one audio clip is a known role.
5. The method of claim 1, wherein obtaining at least one audio segment and performing voiceprint recognition on the at least one audio segment by using a voiceprint recognition model to obtain the first recognition result comprises:
a third determining step, in which, when the first recognition result indicates that the at least one audio clip is the target silent clip, the duration of the at least one audio clip is obtained, and when the duration of the at least one audio clip is greater than a fourth duration threshold, the at least one audio clip is determined to be the target silent clip, and the corresponding role is null;
a fourth determining step, in which, when the duration of the at least one audio clip is less than or equal to the fourth duration threshold, a subsequent audio clip of the at least one audio clip is obtained to obtain a second updated audio clip, the voiceprint recognition is performed on the second updated audio clip to obtain a third recognition result, and when the third recognition result indicates that the duration of the second updated audio clip is greater than the fourth duration threshold, the at least one audio clip is determined to be the target silent clip;
and in the case that the third identification result indicates that the audio duration of the second updated audio segment is less than or equal to the fourth duration threshold, repeating the fourth determining step until the second updated audio segment is determined to be the target mute segment or the non-target mute segment.
6. The method of claim 5, wherein determining that the at least one audio segment is the target silence segment if the duration of the at least one audio segment is greater than the fourth duration threshold comprises:
determining the at least one audio segment as the target silence segment when the duration of the at least one audio segment is greater than a second duration threshold;
and under the condition that the duration of the at least one audio segment is less than or equal to the second duration threshold and greater than the fourth duration threshold, determining that the at least one audio segment is the target silence segment, wherein the second duration threshold is greater than the fourth duration threshold.
7. The audio processing method of claim 1, further comprising:
a fifth determining step of determining whether the historical role is the same as the current role or not under the condition that the historical role is not empty, wherein the current role is a role corresponding to the current at least one audio clip, and the historical role is a role corresponding to an audio clip before the at least one audio clip;
a sixth determination step of determining that role switching has not occurred in a case where the historical role is the same as the current role;
a seventh determining step of determining whether the duration of the at least one audio clip is greater than or equal to a second duration threshold value or not when the historical role is different from the current role, and determining that the role switching occurs when the duration of the at least one audio clip is greater than or equal to the second duration threshold value;
and under the condition that the duration of the at least one audio clip is smaller than the second duration threshold, acquiring a subsequent audio clip of the at least one audio clip to obtain a third updated audio clip, and sequentially and repeatedly executing the fifth determining step to the seventh determining step at least once until the role switching is determined to occur or not to occur, wherein in the repeated executing process, the current role is a role corresponding to the third updated audio clip.
8. An audio processing apparatus, characterized in that the processing apparatus comprises:
the first acquisition unit is used for acquiring at least one audio clip and carrying out voiceprint recognition on the at least one audio clip by adopting a voiceprint recognition model to obtain a first recognition result;
the second obtaining unit is used for obtaining the highest identification score in the first identification result under the condition that the first identification result represents that the at least one audio segment is a non-target mute segment and the duration of the at least one audio segment is greater than or equal to a first duration threshold;
the determining unit is used for determining that the role corresponding to the at least one audio clip is an unknown role under the condition that the audio duration of the at least one audio clip is greater than or equal to a second duration threshold and the highest identification score is less than a score threshold, wherein the second duration threshold is greater than the first duration threshold;
and the registration unit is used for registering the unknown role in the voiceprint recognition model library.
9. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 7.
10. An audio processing system, comprising: one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing the method of any of claims 1-7.
CN202211119654.7A 2022-09-14 2022-09-14 Audio processing method, device, processor and system Pending CN115440229A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211119654.7A CN115440229A (en) 2022-09-14 2022-09-14 Audio processing method, device, processor and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211119654.7A CN115440229A (en) 2022-09-14 2022-09-14 Audio processing method, device, processor and system

Publications (1)

Publication Number Publication Date
CN115440229A true CN115440229A (en) 2022-12-06

Family

ID=84247315

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211119654.7A Pending CN115440229A (en) 2022-09-14 2022-09-14 Audio processing method, device, processor and system

Country Status (1)

Country Link
CN (1) CN115440229A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4837830A (en) * 1987-01-16 1989-06-06 Itt Defense Communications, A Division Of Itt Corporation Multiple parameter speaker recognition system and methods
JP2000206988A (en) * 1999-01-12 2000-07-28 Olympus Optical Co Ltd Voice processing device
US20140314216A1 (en) * 2013-04-22 2014-10-23 Ge Aviation Systems Limited Unknown speaker identification system
CN110505399A (en) * 2019-08-13 2019-11-26 聚好看科技股份有限公司 Control method, device and the acquisition terminal of Image Acquisition
KR20210015542A (en) * 2019-08-02 2021-02-10 서울시립대학교 산학협력단 Apparatus for identifying speaker based on in-depth neural network capable of enrolling unregistered speakers, method thereof and computer recordable medium storing program to perform the method
CN113113022A (en) * 2021-04-15 2021-07-13 吉林大学 Method for automatically identifying identity based on voiceprint information of speaker
JP2022071960A (en) * 2020-10-29 2022-05-17 株式会社Nsd先端技術研究所 Utterance cutting and dividing system and method therefor

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4837830A (en) * 1987-01-16 1989-06-06 Itt Defense Communications, A Division Of Itt Corporation Multiple parameter speaker recognition system and methods
JP2000206988A (en) * 1999-01-12 2000-07-28 Olympus Optical Co Ltd Voice processing device
US20140314216A1 (en) * 2013-04-22 2014-10-23 Ge Aviation Systems Limited Unknown speaker identification system
KR20210015542A (en) * 2019-08-02 2021-02-10 서울시립대학교 산학협력단 Apparatus for identifying speaker based on in-depth neural network capable of enrolling unregistered speakers, method thereof and computer recordable medium storing program to perform the method
CN110505399A (en) * 2019-08-13 2019-11-26 聚好看科技股份有限公司 Control method, device and the acquisition terminal of Image Acquisition
JP2022071960A (en) * 2020-10-29 2022-05-17 株式会社Nsd先端技術研究所 Utterance cutting and dividing system and method therefor
CN113113022A (en) * 2021-04-15 2021-07-13 吉林大学 Method for automatically identifying identity based on voiceprint information of speaker

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SWATI VIVEKANANTHAN: "Forensic Speech Enhancement of Voiceprints and Speaker Identification", 《2022 IEEE INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTIN AND COMMUNICATION TECHNOLOGIES》, 30 March 2022 (2022-03-30) *
付瑞;牛泰龙;常琳;: "声纹识别技术在广播监测领域的应用探究", 现代电视技术, no. 03, 15 March 2020 (2020-03-15) *
董乘宇: "文本相关的说话人认证系统", 《中国优秀硕士学位论文全文数据库(信息科技)》, 15 May 2007 (2007-05-15) *

Similar Documents

Publication Publication Date Title
Sahidullah et al. Introduction to voice presentation attack detection and recent advances
CN110136727B (en) Speaker identification method, device and storage medium based on speaking content
KR102339594B1 (en) Object recognition method, computer device, and computer-readable storage medium
US10878824B2 (en) Speech-to-text generation using video-speech matching from a primary speaker
JP6394709B2 (en) SPEAKER IDENTIFYING DEVICE AND FEATURE REGISTRATION METHOD FOR REGISTERED SPEECH
CN111128223B (en) Text information-based auxiliary speaker separation method and related device
WO2021128741A1 (en) Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium
US8972260B2 (en) Speech recognition using multiple language models
US11727939B2 (en) Voice-controlled management of user profiles
CN108780645B (en) Speaker verification computer system for text transcription adaptation of a generic background model and an enrolled speaker model
CN110136749A (en) Speaker-related end-to-end voice endpoint detection method and device
CN108766445A (en) Method for recognizing sound-groove and system
CN105938716A (en) Multi-precision-fitting-based automatic detection method for copied sample voice
CN112309406B (en) Voiceprint registration method, device and computer-readable storage medium
CN109448704A (en) Construction method, device, server and the storage medium of tone decoding figure
US20130030794A1 (en) Apparatus and method for clustering speakers, and a non-transitory computer readable medium thereof
WO2020211006A1 (en) Speech recognition method and apparatus, storage medium and electronic device
WO2014203328A1 (en) Voice data search system, voice data search method, and computer-readable storage medium
CN111108551B (en) A voiceprint identification method and related device
CN112992155A (en) Far-field voice speaker recognition method and device based on residual error neural network
JPWO2020003413A1 (en) Information processing equipment, control methods, and programs
CN112037772A (en) Multi-mode-based response obligation detection method, system and device
CN115424606A (en) Voice interaction method, voice interaction device and computer readable storage medium
CN115440229A (en) Audio processing method, device, processor and system
CN117912450A (en) Voice quality detection method, related method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination