CN114333790A

CN114333790A - Data processing method, device, equipment, storage medium and program product

Info

Publication number: CN114333790A
Application number: CN202111472195.6A
Authority: CN
Inventors: 袁有根; 吕志强; 黄申
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-12-03
Filing date: 2021-12-03
Publication date: 2022-04-12
Anticipated expiration: 2041-12-03
Also published as: CN114333790B

Abstract

The application provides a data processing method, a device, equipment, a storage medium and a program product, wherein the method comprises the following steps: acquiring voice data to be recognized and acquiring a target syllable sequence of a reference keyword; calling a keyword detection model to process the voice data to be recognized, determining a syllable sequence to be detected of the voice data to be recognized, and determining a keyword detection result of the voice data to be recognized according to the syllable sequence to be detected and the target syllable sequence; the keyword detection model is obtained by training by using a training data set, wherein the training data set comprises sample voice data of one or more language categories; the keyword detection model may perform keyword detection on speech data in any of one or more language categories. The method and the device can be applied to various scenes such as cloud technology, artificial intelligence, a smart platform and vehicle-mounted Internet, can realize multi-language keyword detection, realize automation and intellectualization of keyword detection, and improve the efficiency of keyword detection.

Description

Data processing method, apparatus, equipment, storage medium and program product

技术领域technical field

本申请涉及计算机技术领域，具体涉及人工智能技术领域，具体涉及一种数据处理方法、数据处理装置、计算机设备、计算机可读存储介质及计算机程序产品。The present application relates to the field of computer technology, in particular to the field of artificial intelligence technology, and in particular to a data processing method, a data processing apparatus, computer equipment, a computer-readable storage medium, and a computer program product.

背景技术Background technique

随着计算机技术的不断发展和应用，越来越多的场景需要用到数据处理技术，例如通过数据处理技术对语音数据进行关键词检测，用以唤醒智能设备、检测关键词出现频率等。但如何实现关键词检测是目前的研究热点。With the continuous development and application of computer technology, more and more scenarios require the use of data processing technology, such as keyword detection on voice data through data processing technology to wake up smart devices and detect the frequency of keyword occurrences. But how to realize keyword detection is the current research hotspot.

发明内容SUMMARY OF THE INVENTION

本申请提供一种数据处理方法、数据处理装置、计算机设备、计算机可读存储介质及计算机程序产品，可以实现多语种类型的关键词检测，实现关键词检测的自动化及智能化，提高关键词检测的效率。The present application provides a data processing method, data processing device, computer equipment, computer-readable storage medium and computer program product, which can realize keyword detection in multiple languages, realize automation and intelligence of keyword detection, and improve keyword detection. s efficiency.

本申请提供了一种数据处理方法，该方法包括：获取待识别语音数据，以及获取参考关键词的目标音节序列；The application provides a data processing method, the method includes: acquiring speech data to be recognized, and acquiring a target syllable sequence of a reference keyword;

调用关键词检测模型对上述待识别语音数据进行处理，确定上述待识别语音数据的待检测音节序列，根据上述待检测音节序列和上述目标音节序列确定上述待识别语音数据的关键词检测结果；Invoke the keyword detection model to process the above-mentioned speech data to be recognized, determine the syllable sequence to be detected of the above-mentioned speech data to be recognized, and determine the keyword detection result of the above-mentioned speech data to be recognized according to the above-mentioned syllable sequence to be detected and the above-mentioned target syllable sequence;

其中，上述关键词检测模型是利用训练数据集训练得到的，上述训练数据集包括一个或多个语种类别的样本语音数据；上述关键词检测模型可对上述一个或多个语种类别中任一个语种类别的语音数据进行关键词检测。Wherein, the above keyword detection model is obtained by using a training data set, and the above training data set includes sample speech data of one or more language categories; the above keyword detection model can be used for any language in the above one or more language categories Category speech data for keyword detection.

本申请提供了一种数据处理装置，该装置包括：The application provides a data processing device, the device includes:

获取模块，用于获取待识别语音数据，以及获取参考关键词的目标音节序列；an acquisition module, used for acquiring the speech data to be recognized, and acquiring the target syllable sequence of the reference keyword;

处理模块，用于调用关键词检测模型对上述待识别语音数据进行处理，确定上述待识别语音数据的待检测音节序列，根据上述待检测音节序列和上述目标音节序列确定上述待识别语音数据的关键词检测结果；The processing module is used to call the keyword detection model to process the above-mentioned speech data to be recognized, determine the syllable sequence to be detected of the above-mentioned speech data to be recognized, and determine the key of the above-mentioned speech data to be recognized according to the above-mentioned syllable sequence to be detected and the above-mentioned target syllable sequence word detection result;

本申请提供了一种计算机设备，包括：存储器、处理器，其中，上述存储器上存储有数据处理程序，上述数据处理程序被上述处理器执行时实现如上述数据处理方法的步骤。The present application provides a computer device, comprising: a memory and a processor, wherein the memory stores a data processing program, and when the data processing program is executed by the processor, the steps of the above data processing method are implemented.

本申请提供了一种计算机可读存储介质，上述计算机可读存储介质存储有计算机程序，上述计算机程序包括程序指令，上述程序指令被处理器执行，用以执行上述的数据处理方法的步骤。The present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program includes program instructions, and the program instructions are executed by a processor to execute the steps of the above-mentioned data processing method.

本申请提供了一种计算机程序产品，上述计算机程序产品包括计算机程序或计算机指令，上述计算机程序或计算机指令被处理器执行，用以实现如上述的数据处理方法。The present application provides a computer program product, wherein the computer program product includes a computer program or computer instructions, and the computer program or computer instructions are executed by a processor to implement the above-mentioned data processing method.

本申请先获取待识别语音数据和参考关键词的目标音节序列，然后调用关键词检测模型对待识别语音数据进行处理，确定待识别语音数据的待检测音节序列，再根据待检测音节序列和目标音节序列确定待识别语音数据的关键词检测结果，可以实现关键词检测的自动化及智能化，提高关键词检测的效率；本申请提出的关键词检测模型可以利用多个语种类别的样本语音数据训练得到的，可对多个语种类别的语音数据进行关键词检测，提高了关键词检测的适用性，也进一步提高了关键词检测的智能化。The application first obtains the speech data to be recognized and the target syllable sequence of the reference keyword, and then calls the keyword detection model to process the speech data to be recognized, determines the syllable sequence to be detected of the speech data to be recognized, and then determines the syllable sequence to be detected according to the syllable sequence to be detected and the target syllable. The sequence determines the keyword detection results of the speech data to be recognized, which can realize the automation and intelligence of keyword detection and improve the efficiency of keyword detection; the keyword detection model proposed in this application can be obtained by training sample speech data in multiple languages. Yes, keyword detection can be performed on speech data of multiple language categories, which improves the applicability of keyword detection and further improves the intelligence of keyword detection.

附图说明Description of drawings

为了更清楚地说明本申请实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，对于本领域普通技术人员而言，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present application more clearly, the accompanying drawings required for the description of the embodiments will be briefly introduced below. Obviously, for those of ordinary skill in the art, without creative efforts On the premise, other drawings can also be obtained according to these drawings.

图1是本申请一个示例性实施例提供的一种数据处理系统的架构示意图；1 is a schematic diagram of the architecture of a data processing system provided by an exemplary embodiment of the present application;

图2是本申请一个示例性实施例提供的一种数据处理方法的流程示意图；2 is a schematic flowchart of a data processing method provided by an exemplary embodiment of the present application;

图3是本申请一个示例性实施例提供的一种关键词检测的流程示意图；3 is a schematic flowchart of a keyword detection provided by an exemplary embodiment of the present application;

图4是本申请一个示例性实施例提供的另一种数据处理方法的流程示意图；4 is a schematic flowchart of another data processing method provided by an exemplary embodiment of the present application;

图5是本申请一个示例性实施例提供的一种因式分解时延神经网络的结构示意图；5 is a schematic structural diagram of a factorized time-delay neural network provided by an exemplary embodiment of the present application;

图6是本申请一个示例性实施例提供的一种音节识别网络的结构及流程示意图；6 is a schematic diagram of the structure and process flow of a syllable recognition network provided by an exemplary embodiment of the present application;

图7是本申请一个示例性实施例提供的另一种关键词检测的流程示意图；7 is a schematic flowchart of another keyword detection provided by an exemplary embodiment of the present application;

图8是本申请一个示例性实施例提供的一种数据处理装置的示意框图；8 is a schematic block diagram of a data processing apparatus provided by an exemplary embodiment of the present application;

图9是本申请另一个示例性实施例提供的一种计算机设备的示意框图。FIG. 9 is a schematic block diagram of a computer device provided by another exemplary embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

本申请实施例提供了一种数据处理方法，以实现关键词检测的自动化及智能化，提高关键词检测的效率。本申请实施例提供的数据处理方法可以人工智能技术中的一种或者多种技术实现。The embodiment of the present application provides a data processing method, so as to realize the automation and intelligence of keyword detection and improve the efficiency of keyword detection. The data processing methods provided in the embodiments of the present application may be implemented by one or more technologies in artificial intelligence technologies.

人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个综合技术，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法，使机器具有感知、推理与决策的功能。人工智能技术是一门综合学科，涉及领域广泛，既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括自然语言处理技术、计算机视觉技术、机器学习/深度学习等几大方向。本申请实施例提供的方案涉及人工智能技术下属的自然语言处理和机器学习等技术，下面将对自然语言处理技术和机器学习技术进行叙述。Artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology. The basic technologies of artificial intelligence generally include technologies such as sensors, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes natural language processing technology, computer vision technology, machine learning/deep learning and other major directions. The solutions provided by the embodiments of the present application involve technologies such as natural language processing and machine learning under artificial intelligence technology, and the natural language processing technology and machine learning technology will be described below.

自然语言处理(Nature Language processing,NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此，这一领域的研究将涉及自然语言，即人们日常使用的语言，所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括语音识别、文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。本申请主要涉及自然语言处理技术中的语音识别技术，具体来说，终端设备通过获取待检测语音数据，通过语音识别技术将语音数据转化为对应的音节序列形式(也即是本申请中的待检测音节序列和目标音节序列)。后续，可以基于待检测音节序列和目标音节序列进行关键词的检测操作。Natural language processing (NLP) is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, the language that people use on a daily basis, so it is closely related to the study of linguistics. Natural language processing technology usually includes speech recognition, text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies. This application mainly relates to speech recognition technology in natural language processing technology. Specifically, the terminal device obtains the speech data to be detected, and uses the speech recognition technology to convert the speech data into the corresponding syllable sequence form (that is, the to-be-detected speech data in this application). detection syllable sequence and target syllable sequence). Subsequently, a keyword detection operation may be performed based on the syllable sequence to be detected and the target syllable sequence.

机器学习(Machine Learning,ML)是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。机器学习专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习等技术。本申请主要涉及机器学习技术中的人工神经网络，具体来说，终端设备通过训练数据集对初始化神经网络进行训练，得到关键词检测模型，再利用关键词检测模型对待识别语音数据进行处理，得到待检测音节序列。基于该待检测音节序列和关键词的目标音节序列进行关键词检测，提高了关键词的检测效率和智能化。Machine Learning (ML) is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. Machine learning specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its applications are in all fields of artificial intelligence. Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning and other techniques. The present application mainly relates to artificial neural networks in machine learning technology. Specifically, the terminal equipment trains the initialized neural network through the training data set to obtain a keyword detection model, and then uses the keyword detection model to process the speech data to be recognized, and obtains The sequence of syllables to be detected. The keyword detection is performed based on the syllable sequence to be detected and the target syllable sequence of the keyword, which improves the detection efficiency and intelligence of the keyword.

随着人工智能技术研究和进步，人工智能技术在多个领域展开研究和应用，例如常见的智能家居、智能穿戴设备、虚拟助理、智能音箱、智能营销、无人驾驶、自动驾驶、无人机、机器人、智能医疗、智能客服、车联网、自动驾驶、智慧交通等。随着技术的发展，人工智能技术将在更多的领域得到应用，并发挥越来越重要的价值。With the research and progress of artificial intelligence technology, artificial intelligence technology has been researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, autonomous driving, drones , robots, intelligent medical care, intelligent customer service, Internet of Vehicles, autonomous driving, intelligent transportation, etc. With the development of technology, artificial intelligence technology will be applied in more fields and play more and more important value.

关键词检测(Spoken Term Detection)是语音识别领域的一个子领域，其目的是在语音信号中检测指定词语的所有出现位置。现有的关键词检测技术一般都是基于关键词/填充的检测方法。这种方法的主要思想是用关键词模型来输出关键词检测结果，同时用填充模型来吸收非关键词语音。为了搭建一套完整的基于关键词/填充的语音关键词检测系统，传统方法选用音素作为建模单元，然后使用各种深度神经网络作为声学模型，再通过损失函数来指导模型的训练。Keyword detection (Spoken Term Detection) is a sub-field of speech recognition, and its purpose is to detect all occurrences of a specified word in a speech signal. Existing keyword detection techniques are generally based on keyword/stuffing detection methods. The main idea of this method is to use a keyword model to output keyword detection results, while a padding model is used to absorb non-keyword speech. In order to build a complete set of speech keyword detection system based on keyword/filling, the traditional method selects phoneme as the modeling unit, and then uses various deep neural networks as acoustic models, and then guides the training of the model through the loss function.

基于关键词/填充的关键词检测方法主要缺点是关键词检测的声学模型参数较多、较大，从而导致推理速度较慢。另外传统的基于关键词/填充的关键词检测方法都是在单个语种上做的，不支持多语种的关键词检测任务。The main disadvantage of the keyword/stuffing-based keyword detection method is that the acoustic model parameters of keyword detection are many and large, resulting in a slow inference speed. In addition, the traditional keyword/stuffing-based keyword detection methods are all done in a single language, and do not support multilingual keyword detection tasks.

本发明提出一种基于因式分解时延神经网络(Factorized Time Delay NeuralNetwork，TDNN-F)的多语种关键词检测方法。该方法首先按比例混合各个语种的语料，采用多任务学习方法去训练声学模型(Acoustic model,AM)，然后采用加权有限状态转换器(Weighted Finite State Translator，WFST)方法进行解码并输出关键词结果。在多任务学习过程中，我们使用音节作为输出单元，并且每个语种都有一个单独的输出。多任务学习方法可以将每一个语种都当作一个单独的分支，而每个分支的编码器部分是权重共享的，只是最后的归一层(Softmax层)都是独立的。因此它不仅能够学到各个语种之间的共同信息，也能够通过Softmax层将编码器输出映射到对应语种中。该方法可以有效应用到多语种关键词检测任务，而之前的关键词检测方法一般都是在单个语种上做的，不支持多语种的关键词检测任务。The present invention proposes a multilingual keyword detection method based on a factorized time delay neural network (Factorized Time Delay Neural Network, TDNN-F). The method firstly mixes the corpus of each language in proportion, uses the multi-task learning method to train the Acoustic model (AM), and then uses the Weighted Finite State Translator (WFST) method to decode and output the keyword results . During multi-task learning, we use syllables as output units, and each language has a separate output. The multi-task learning method can treat each language as a separate branch, and the weight of the encoder part of each branch is shared, but the final normalization layer (Softmax layer) is independent. Therefore, it can not only learn common information between languages, but also map the encoder output to the corresponding languages through the Softmax layer. This method can be effectively applied to multilingual keyword detection tasks, while the previous keyword detection methods are generally done in a single language and do not support multilingual keyword detection tasks.

另外，本申请提出在TDNN中添加了奇异值分解层(The Singular ValueDecomposition，SVD)，有效的减少了声学模型的训练参数，在效果差别不大的情况下明显加快推理速度。实验表明该方法的效果超越了语种识别和单语系统的级连方法，相同准确率下的关键词正确报出量提升了1.2％-3.6％不等。In addition, the present application proposes to add a singular value decomposition layer (The Singular Value Decomposition, SVD) to the TDNN, which effectively reduces the training parameters of the acoustic model, and significantly speeds up the inference speed when the effect is not significantly different. Experiments show that the effect of this method exceeds that of language recognition and monolingual system cascading methods, and the number of correctly reported keywords under the same accuracy rate is increased by 1.2%-3.6%.

本申请提出的关键词检测方法，可以应用于智能设备唤醒交互及音视频文件语音关键词检测等领域。为了应对智能设备有限的计算资源限制以及海量音视频文件的处理需求，我们通常需要关键词的检测速度足够快，然后在此基础上希望检测结果足够准，最后还希望该检测方法的功能更多，灵活性更强。The keyword detection method proposed in this application can be applied to the fields of smart device wake-up interaction and voice keyword detection of audio and video files. In order to cope with the limited computing resources of smart devices and the processing requirements of massive audio and video files, we usually need the detection speed of keywords to be fast enough, and then we hope that the detection results are accurate enough on this basis, and finally we hope that the detection method has more functions. , more flexibility.

本申请可以应用于云技术、人工智能、智慧平台、车载互联网等各种场景。在云技术领域，本申请可以将待识别语音数据、待检测音节序列、参考关键词、目标音节序列以及关键词检测结果存储于云服务器上，便于数据的管理和复用，待需要关键词检测结果等数据时，即可从云服务器上直接获取；在人工智能领域，可以利用关键词检测方法对智能设备进行唤醒等操作，使得智能设备更加人性化，基于关键词检测技术发展出更多的智能应用服务；在智慧平台领域，在征得用户同意后，可以利用关键词检测方法对用户输入的语音数据进行统计和分析，针对不同关键词确定不同类型的用户分类标签，基于用户标签进行相关推荐，保障用户良好的使用体验；在车载互联网领域，可以利用关键词检测方法，对车载设备进行唤醒，也可以通过实时采集车内乘客语音数据，根据检测到的关键词进行智能化交互。例如，当检测到关键词“累、困、无聊”时，车载设备即可进行音乐播放操作，当检测到关键词“吸烟、热、闷”时，车载设备即可进行打开车窗操作。This application can be applied to various scenarios such as cloud technology, artificial intelligence, smart platforms, and in-vehicle Internet. In the field of cloud technology, the application can store the speech data to be recognized, the syllable sequence to be detected, the reference keyword, the target syllable sequence and the keyword detection result on the cloud server, which is convenient for data management and reuse. Results and other data can be obtained directly from the cloud server; in the field of artificial intelligence, the keyword detection method can be used to wake up smart devices and other operations, making smart devices more user-friendly. Intelligent application services; in the field of intelligent platforms, after obtaining the user's consent, the keyword detection method can be used to perform statistics and analysis on the voice data input by the user, determine different types of user classification tags for different keywords, and conduct correlation based on user tags. Recommendations to ensure a good user experience; in the field of in-vehicle Internet, the keyword detection method can be used to wake up the in-vehicle equipment, and it can also collect the voice data of the passengers in the car in real time, and conduct intelligent interaction according to the detected keywords. For example, when the keywords "tired, sleepy, bored" are detected, the in-vehicle device can play music, and when the keywords "smoking, hot, stuffy" are detected, the in-vehicle device can open the window.

本申请将具体通过如下实施例进行说明:The application will be specifically described by the following examples:

请参阅图1，图1是本申请一个示例性实施例提供的一种数据处理系统的架构示意图。如图1所示，该数据处理系统具体可以包括终端设备101和服务器102，终端设备101与服务器102之间通过网络连接，比如，通过无线网络连接等。基于本申请提出的数据处理方法，可以由终端设备101采集待识别语音数据，由该终端设备101进行特征提取、音节识别以及关键词匹配等处理，并且在处理过程中将采集到的待识别语音数据、关键词检测结果以及处理过程中的中间数据等发送给服务器102，便于服务器102进行后续的数据管理；也可以由服务器102执行特征提取、音节识别以及关键词等匹配处理，当服务器102执行时，可以由终端设备101采集待识别语音数据，将该待识别语音数据发送给服务器102进行上述处理，得到关键词检测结果，服务器102将关键词检测结果返回给终端设备101，再进行后续操作。Please refer to FIG. 1. FIG. 1 is a schematic structural diagram of a data processing system provided by an exemplary embodiment of the present application. As shown in FIG. 1 , the data processing system may specifically include a terminal device 101 and a server 102, and the terminal device 101 and the server 102 are connected through a network, for example, through a wireless network connection. Based on the data processing method proposed in this application, the terminal device 101 can collect the speech data to be recognized, and the terminal device 101 performs processing such as feature extraction, syllable recognition, and keyword matching, and in the process of processing the collected speech data to be recognized The data, keyword detection results, and intermediate data in the processing process are sent to the server 102, which is convenient for the server 102 to carry out subsequent data management; the server 102 can also perform feature extraction, syllable recognition, and keyword matching processing. When the server 102 executes At this time, the terminal device 101 can collect the voice data to be recognized, send the voice data to be recognized to the server 102 for the above processing, and obtain the keyword detection result, and the server 102 returns the keyword detection result to the terminal device 101, and then performs subsequent operations .

本申请实施例中，终端设备101可以获取待识别语音数据和参考关键词；终端设备101可以将待识别语音数据和参考关键词发送给服务器102；服务器102根据待识别语音数据确定待识别语音数据的待检测音节序列，根据参考关键词确定参考关键词的目标音节序列；服务器102调用关键词检测模型对待识别语音数据的待检测音节序列和目标音节序列进行匹配处理，得到待识别语音数据的关键词检测结果，并将关键词检测结果发送给终端设备101；终端设备101根据接收到的关键词检测结果，在用户界面进行关键词检测结果的展示。In this embodiment of the present application, the terminal device 101 may acquire the voice data to be recognized and the reference keywords; the terminal device 101 may send the voice data to be recognized and the reference keywords to the server 102; the server 102 determines the voice data to be recognized according to the voice data to be recognized The syllable sequence to be detected is determined, and the target syllable sequence of the reference keyword is determined according to the reference keyword; the server 102 invokes the keyword detection model to perform matching processing on the syllable sequence to be detected and the target syllable sequence of the speech data to be recognized to obtain the key of the speech data to be recognized. The word detection result is obtained, and the keyword detection result is sent to the terminal device 101; the terminal device 101 displays the keyword detection result on the user interface according to the received keyword detection result.

终端设备101也称为终端(Terminal)、用户设备(user equipment,UE)、接入终端、用户单元、移动设备、用户终端、无线通信设备、用户代理或用户装置。终端设备可以是智能家电、具有无线通信功能的手持设备(例如智能手机、平板电脑)、计算设备(例如个人电脑(personal computer,PC)、车载终端、智能语音交互设备、可穿戴设备或者其他智能装置等，但并不局限于此。The terminal equipment 101 is also referred to as a terminal (Terminal), user equipment (UE), an access terminal, a subscriber unit, a mobile device, a user terminal, a wireless communication device, a user agent, or a user equipment. Terminal devices can be smart home appliances, handheld devices with wireless communication functions (such as smart phones, tablet computers), computing devices (such as personal computers (PCs), in-vehicle terminals, intelligent voice interaction devices, wearable devices, or other smart devices. device, etc., but not limited to this.

服务器102可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。The server 102 may be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, Cloud servers for basic cloud computing services such as middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms.

可以理解的是，本申请实施例描述的系统的架构示意图是为了更加清楚的说明本申请实施例的技术方案，并不构成对于本申请实施例提供的技术方案的限定。例如，在本申请中，终端设备101除了包括图1中所示的三个设备外，也可以包括三个以上的设备；同样的，服务器102除了包括图1中所示的一个服务器外，也可以由多个服务器(也即是服务器集群)构成，服务器102也可以为建立在终端设备101上的本地服务器。本领域普通技术人员可知，随着系统架构的演变和新业务场景的出现，本申请实施例提供的技术方案对于类似的技术问题，同样适用。It can be understood that the schematic diagram of the architecture of the system described in the embodiments of the present application is to more clearly illustrate the technical solutions of the embodiments of the present application, and does not constitute a limitation on the technical solutions provided by the embodiments of the present application. For example, in this application, the terminal device 101 may include more than three devices in addition to the three devices shown in FIG. It may be composed of multiple servers (ie, server clusters), and the server 102 may also be a local server established on the terminal device 101 . Those of ordinary skill in the art know that with the evolution of the system architecture and the emergence of new business scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

请参阅图2，图2是本申请一个示例性实施例提供的一种数据处理方法的流程示意图，以该方法应用于图1中的终端设备(指代上述终端设备101，后文将以终端设备进行叙述)为例进行说明，该方法可包括以下步骤：Please refer to FIG. 2. FIG. 2 is a schematic flowchart of a data processing method provided by an exemplary embodiment of the present application, and the method is applied to the terminal device (referring to the above-mentioned terminal device 101) in FIG. equipment to describe) as an example to illustrate, the method may include the following steps:

S201、获取待识别语音数据，以及获取参考关键词的目标音节序列。S201. Acquire the speech data to be recognized, and acquire the target syllable sequence of the reference keyword.

本申请实施例中，待识别语音数据为需要进行关键词检测的目标语音，参考关键词为需要进行关键词检测的待检测关键词，获取参考关键词的目标音节序列是为了得到关键词检测的原始数据，后续的关键词检测方法都是基于该原始数据进行的。本申请实施例是为了判断目标语音中是否包括参考关键词。In the embodiment of the present application, the speech data to be recognized is the target speech that needs to be subjected to keyword detection, the reference keyword is the keyword to be detected that needs to be subjected to keyword detection, and the target syllable sequence of the reference keyword is obtained to obtain the keyword detection. Raw data, and subsequent keyword detection methods are based on the raw data. This embodiment of the present application is to determine whether the target speech includes a reference keyword.

在一实施例中，本申请的待识别语音数据可以是终端设备通过其配置的拾音设备(例如麦克风)获取到的语音数据，也可以是终端设备通过互联网从网络中其他设备上获取得到的语音数据，还可以是直接从本地存储设备(例如U盘)中获取得到的语音数据。In an embodiment, the voice data to be recognized in the present application may be voice data obtained by the terminal device through a sound pickup device (such as a microphone) configured by the terminal device, or may be obtained by the terminal device from other devices in the network through the Internet. The voice data may also be voice data obtained directly from a local storage device (such as a U disk).

在一实施例中，待识别语音数据既可以是任意国家所使用的官方语言(例如中文、英文)的语音数据，也可以是任意国家中的任意地区使用的地方语言(例如粤语、闽语、川语)的语音数据。In one embodiment, the speech data to be recognized can be either the speech data of the official language (such as Chinese, English) used in any country, or the local language (such as Cantonese, Hokkien, Sichuan) voice data.

在一实施例中，本申请的参考关键词为预先设置的，需要在待识别语音数据中对参考关键词进行关键词检测，判断待识别语音数据中是否存在该参考关键词。参考关键词既可以是任意国家所使用的官方语言(例如中文、英文、俄文)的关键词，也可以是任意国家中的任意地区使用的地方语言(例如粤语、闽语、川语)的关键词；除此之外，参考关键词可以是一个，也可以是多个。In one embodiment, the reference keywords of the present application are preset, and it is necessary to perform keyword detection on the reference keywords in the speech data to be recognized to determine whether the reference keywords exist in the speech data to be recognized. Reference keywords can be keywords in official languages (such as Chinese, English, and Russian) used in any country, or keywords in local languages (such as Cantonese, Fujian, Sichuan) used in any region in any country. Keywords; in addition, the reference keyword can be one or more than one.

在一实施例中，获取参考关键词的目标音节序列的方式可以为：In one embodiment, the method of obtaining the target syllable sequence of the reference keyword may be:

(1)、获取参考关键词，以及获取所述待识别语音数据的目标语种类别。(1) Obtain reference keywords, and obtain the target language category of the speech data to be recognized.

其中，参考关键词可以是不同语种的，而待识别语音数据也可以是不同语种的，因此需要对参考关键词以及待识别语音数据进行语种匹配处理，才能够在后续步骤中根据参考关键词的音节序列和待识别语音数据的音节序列进行关键词检测。The reference keywords may be in different languages, and the speech data to be recognized may also be in different languages. Therefore, it is necessary to perform language matching processing on the reference keywords and the to-be-recognized speech data, so that in the subsequent steps, according to the reference keywords The syllable sequence and the syllable sequence of the speech data to be recognized are subjected to keyword detection.

(2)、获取所述参考关键词的与所述目标语种类别相匹配的音节序列，将所述参考关键词的与所述目标语种类别相匹配的音节序列确定为目标音节序列。(2) Obtain the syllable sequence of the reference keyword that matches the target language category, and determine the syllable sequence of the reference keyword that matches the target language category as the target syllable sequence.

在一实施例中，可以先获取待识别语音数据的语种类别(也即是目标语种类别)，然后获取参考关键词，将参考关键词进行语音转化处理，得到参考关键词对应该目标语种类别的参考关键词音节序列(也即是目标音节序列)。In one embodiment, the language category (that is, the target language category) of the speech data to be recognized can be obtained first, then the reference keyword is obtained, and the reference keyword is subjected to speech conversion processing to obtain the reference keyword corresponding to the target language category. Reference keyword syllable sequence (ie, target syllable sequence).

在一实施例中，可以通过发音词典模型生成参考关键词的目标音节序列。发音词典模型可以包括多个分支网络，可以对不同语种的参考关键词进行文本转换操作，生成目标语种的音节序列。示例性的，当待识别语音数据的语种类别为语种A，参考关键词的语种类别为语种B，可以根据发音词典模型将该语种B的参考关键词转换成语种A的音节序列。需要说明的是，发音词典模型也可以先将语种B的参考关键词翻译为语种A，再将翻译结果进行音节识别操作，得到目标音节序列。In one embodiment, the target syllable sequence of the reference keyword can be generated through a pronunciation dictionary model. The pronunciation dictionary model can include multiple branch networks, and can perform text conversion operations on reference keywords in different languages to generate syllable sequences in the target language. Exemplarily, when the language category of the speech data to be recognized is language A and the language category of the reference keyword is language B, the reference keyword of language B can be converted into a syllable sequence of language A according to the pronunciation dictionary model. It should be noted that the pronunciation dictionary model can also first translate the reference keywords of language B into language A, and then perform a syllable recognition operation on the translation result to obtain the target syllable sequence.

S202、调用关键词检测模型对所述待识别语音数据进行处理，确定所述待识别语音数据的待检测音节序列，根据所述待检测音节序列和所述目标音节序列确定所述待识别语音数据的关键词检测结果。S202: Invoke a keyword detection model to process the to-be-recognized speech data, determine a to-be-detected syllable sequence of the to-be-recognized speech data, and determine the to-be-recognized speech data according to the to-be-detected syllable sequence and the target syllable sequence keyword detection results.

本申请实施例中，关键词检测模型可以通过特征提取、声学建模等操作获取待识别语音数据的音节序列(也即是待检测音节序列)，在得到了待检测音节序列和目标音节序列时，可以基于相似度判断的思想，检测待检测音节序列中与目标音节序列相似度最高的片段，当该片段的相似度达到预设条件，即可判断出待检测音节序列中存在目标音节序列(也即是判断出待识别语音数据中存在参考关键词)，得到关键词检测结果。In the embodiment of the present application, the keyword detection model can obtain the syllable sequence of the speech data to be recognized (that is, the syllable sequence to be detected) through operations such as feature extraction and acoustic modeling. When the syllable sequence to be detected and the target syllable sequence are obtained , based on the idea of similarity judgment, detect the segment with the highest similarity with the target syllable sequence in the syllable sequence to be detected, and when the similarity of the segment reaches the preset condition, it can be judged that there is a target syllable sequence in the syllable sequence to be detected ( That is, it is determined that there is a reference keyword in the speech data to be recognized), and a keyword detection result is obtained.

在一实施例中，若参考关键词包含多个子关键词，即目标音节序列包括多个子音节序列，则关键词检测模型可以分别对目标音节序列包括的多个子音节序列进行关键词检测，输出多个子音节序列中满足预设条件的子关键词，作为关键词检测结果。In one embodiment, if the reference keyword includes multiple sub-keywords, that is, the target syllable sequence includes multiple sub-syllable sequences, the keyword detection model may perform keyword detection on the multiple sub-syllable sequences included in the target syllable sequence, respectively, and output multiple sub-syllable sequences. The sub-keywords in the sub-syllable sequence that meet the preset conditions are used as the keyword detection result.

在一实施例中，待识别语音数据还可以包括待检测音节序列对应的音节时间序列，因此，在判断出满足预设条件的子关键词后，可以根据满足预设条件的子关键词以及待检测音节序列对应的音节时间序列，确定满足预设条件的子关键词的出现时间；然后输出满足预设条件的子关键词以及出现时间。In one embodiment, the speech data to be recognized may also include a time sequence of syllables corresponding to the sequence of syllables to be detected. The syllable time sequence corresponding to the syllable sequence is detected, and the occurrence time of the sub-keywords satisfying the preset condition is determined; then the sub-keywords satisfying the preset condition and the occurrence time are output.

在一实施例中，关键词检测模型是利用训练数据集训练得到的，训练数据集包括一个或多个语种类别的样本语音数据；关键词检测模型可对一个或多个语种类别中任一个语种类别的语音数据进行关键词检测。In one embodiment, the keyword detection model is obtained by training a training data set, and the training data set includes sample speech data of one or more language categories; the keyword detection model can be used for any language in one or more language categories. Category speech data for keyword detection.

关键词检测模型可以包括特征提取网络和音节识别网络，关键词检测模型通过待检测音节序列和目标音节序列进行关键词检测，主要基于关键词检测模型中的音节识别网络实现的。其中，训练关键词检测模型的方式可以包括以下步骤：The keyword detection model may include a feature extraction network and a syllable recognition network. The keyword detection model performs keyword detection through the syllable sequence to be detected and the target syllable sequence, mainly based on the syllable recognition network in the keyword detection model. The method of training the keyword detection model may include the following steps:

(1)、获取训练数据集，训练数据集包括一个或多个训练数据子集，每个训练数据子集包括一个或多个语种类别中任一个语种类别的样本语音数据以及与样本语音数据对应的参考音节序列。(1) Acquire a training data set, the training data set includes one or more training data subsets, and each training data subset includes sample speech data of any language category in one or more language categories and corresponding sample speech data. The reference syllable sequence of .

当训练数据集包括一个训练数据子集时(也即是训练数据集只包括一种语种类别的样本语音数据)，可以基于该语种类别的样本语音数据对关键词检测模型的音节识别网络进行训练；当训练数据集包括多个训练数据子集时(也即是训练数据集包括多种语种类别的样本语音数据)，可以按比例混合各个语种的语料，例如参照线上语音数据语种分布比例对每个语种的训练语料进行混合，使得模型在实际业务中的表现更好。When the training data set includes a subset of training data (that is, the training data set only includes sample speech data of one language category), the syllable recognition network of the keyword detection model can be trained based on the sample speech data of this language category ; When the training data set includes multiple training data subsets (that is, the training data set includes sample speech data in multiple languages), the corpus of each language can be mixed in proportion, for example, referring to the language distribution ratio of online speech data. The training corpus of each language is mixed to make the model perform better in actual business.

需要说明的是，在关键词检测模型训练过程中，各个语种所占的权重对最终模型偏向的影响非常大。为了使得最终模型不偏向于任何一种语言，可以将语种权重均分，这种方法在训练数据量相当的情况下训练效果较好，但是在实际业务中，每个语种的训练语料差别较大。因此，语种权重需要根据语种数量、实际的训练数据和业务需求去调整。It should be noted that in the training process of the keyword detection model, the weight occupied by each language has a great influence on the bias of the final model. In order to make the final model not biased towards any one language, the language weights can be equally divided. This method has better training effect when the amount of training data is similar, but in actual business, the training corpus of each language is quite different. . Therefore, the language weight needs to be adjusted according to the number of languages, actual training data and business needs.

(2)、对一个或多个训练数据子集中的样本语音数据进行特征提取处理，得到样本语音数据的样本语音特征。(2) Perform feature extraction processing on sample speech data in one or more training data subsets to obtain sample speech features of the sample speech data.

(3)、利用得到的样本语音特征以及对应的参考音节序列，对初始音节识别网络进行训练，得到训练后的音节识别网络。(3) Using the obtained sample speech features and the corresponding reference syllable sequence, train the initial syllable recognition network to obtain the trained syllable recognition network.

本申请实施例中，将样本语音数据的样本语音特征输入到初始音节识别网络，得到预测音节序列，再基于预测音节序列和参考音节序列，结合损失函数对初始音节识别网络的网络参数进行调整，当后续样本语音特征的预测音节序列满足预设条件时，即可判断初始音节识别网络训练完成，得到训练后的音节识别网络。In the embodiment of the present application, the sample speech features of the sample speech data are input into the initial syllable recognition network to obtain the predicted syllable sequence, and then based on the predicted syllable sequence and the reference syllable sequence, the network parameters of the initial syllable recognition network are adjusted in combination with the loss function, When the predicted syllable sequence of the subsequent sample speech features satisfies the preset condition, it can be judged that the training of the initial syllable recognition network is completed, and the trained syllable recognition network is obtained.

在一实施例中，进行模型训练的方式可以为：利用损失函数进行反向传播，对初始化音节识别网络中各节点的权值进行更新，当损失函数满足预设条件时，确定初始化模型中各节点的最终权值，其中，损失函数可以使用0-1损失函数、绝对值损失函数、均方差损失函数、对数损失函数和指数损失函数中的一种或多种。In one embodiment, the model training method may be as follows: using a loss function to perform backpropagation, updating the weights of each node in the initialized syllable recognition network, and when the loss function satisfies a preset condition, determine each node in the initialization model. The final weight of the node, where the loss function can use one or more of the 0-1 loss function, the absolute value loss function, the mean squared loss function, the logarithmic loss function and the exponential loss function.

(4)、根据训练后的音节识别网络生成训练后的关键词检测模型。(4) Generate a trained keyword detection model according to the trained syllable recognition network.

在音节识别网络训练完成后，可以基于特征提取网络、音节识别网络等生成训练后的关键词检测模型。其中，特征提取网络也是可以利用训练数据集进行训练，以达到更加准确的特征提取准确的。After the training of the syllable recognition network is completed, a trained keyword detection model can be generated based on the feature extraction network, the syllable recognition network, and the like. Among them, the feature extraction network can also be trained by using the training data set to achieve more accurate feature extraction.

在一实施例中，训练关键词检测模型可以利用同一训练数据集对特征提取网络和音节识别网络进行联合训练，对特征提取网络和音节识别网络进行联合训练的方式可以为：In one embodiment, the training keyword detection model can use the same training data set to jointly train the feature extraction network and the syllable recognition network, and the way to jointly train the feature extraction network and the syllable recognition network can be:

特征提取网络和音节识别网络各包含一个优化器，每个优化器对应该网络的网络参数，通过将两个网络参数进行融合，即可进行联合训练，得到训练后的特征提取网络和音节识别网络。需要说明的是，联合训练是提高模型预测准确性的一种方法，而模型的预测准确性也受多方面的影响，例如训练样本的质量。因此，在本申请提供方法的使用中，应结合实际情况和预测结果选择合适的训练方法，以达到更高的模型预测准确度。The feature extraction network and the syllable recognition network each contain an optimizer. Each optimizer corresponds to the network parameters of the network. By fusing the two network parameters, joint training can be performed to obtain the trained feature extraction network and syllable recognition network. . It should be noted that joint training is a method to improve the prediction accuracy of the model, and the prediction accuracy of the model is also affected by many aspects, such as the quality of training samples. Therefore, in the use of the method provided in this application, an appropriate training method should be selected in combination with the actual situation and the prediction result, so as to achieve a higher model prediction accuracy.

示例性的，关键词检测模型包括特征提取网络、音节识别网络、关键词匹配网络；音节识别网络包括一个或多个识别子网络，每一个识别子网络用于对一个指定语种类别的语音数据进行音节识别，指定语种类别包含于一个或多个语种类别中。Exemplarily, the keyword detection model includes a feature extraction network, a syllable identification network, and a keyword matching network; the syllable identification network includes one or more identification sub-networks, and each identification sub-network is used to perform a voice data analysis on a specified language category. Syllable recognition, the specified language category is contained in one or more language categories.

音节识别网络包括的每个识别子网络可以对一个指定语种类别的语音数据进行音节识别，具体包括以下两种情况，一种是音节识别网络包括一个识别子网络，第二种音节识别网络包括多个识别子网络。Each recognition sub-network included in the syllable recognition network can perform syllable recognition on the speech data of a specified language category, which includes the following two cases. identification sub-network.

第一种情况，当音节识别网络包括一个识别子网络时，音节识别网络可以对该音节识别网络对应语种类别的语音数据进行音节识别，得到语音数据对应的音节序列(也即是待检测音节序列)。In the first case, when the syllable recognition network includes a recognition sub-network, the syllable recognition network can perform syllable recognition on the speech data of the language category corresponding to the syllable recognition network, and obtain the syllable sequence corresponding to the speech data (that is, the syllable sequence to be detected). ).

第二种情况，当音节识别网络包括多个识别子网络时，多个识别子网络中任意两个识别子网络的网络参数相匹配。也就是说，一个或多个语种类别对应的识别子网络的编码层的权值是共享的。神经网络的权值共享指从一个局部区域学习到的信息应用到待处理数据的其它地方，以便于减少计算量，提高模型训练速度。需要说明的是，多个识别子网络中既可以任意两个识别子网络的权值共享，也可以是多个识别子网络中的全部识别子网络权值共享。In the second case, when the syllable identification network includes multiple identification sub-networks, the network parameters of any two identification sub-networks in the multiple identification sub-networks match. That is, the weights of the coding layers of the recognition sub-network corresponding to one or more language categories are shared. The weight sharing of neural network means that the information learned from a local area is applied to other places of the data to be processed, so as to reduce the amount of calculation and improve the training speed of the model. It should be noted that, in the multiple identification sub-networks, the weights of any two identification sub-networks may be shared, or the weights of all identification sub-networks in the multiple identification sub-networks may be shared.

请参见图3，图3为一种关键词检测的流程示意图，首先将待识别语音数据(例如：你好，世界)输入到特征提取网络，得到待识别语音数据的语音特征；然后将待识别语音数据的语音特征输入到音节识别网络，得到待识别语音数据的待检测音节序列；利用发音词典对参考关键词进行词表生成操作(例如：“你好”的音节序列为‘ni3 hao3’，“世界”的音节序列为‘shi4 jie4’。其中‘3’或‘4’分别代表该音节对应的音调)，得到参考关键词对应的目标音节序列；再利用关键词匹配网络对目标音节序列和待检测音节序列进行匹配，得到输出结果(也即是关键词匹配结果)。Please refer to FIG. 3. FIG. 3 is a schematic flowchart of a keyword detection. First, the speech data to be recognized (for example: hello, world) is input into the feature extraction network to obtain the speech features of the speech data to be recognized; The speech features of the speech data are input into the syllable recognition network to obtain the syllable sequence to be detected of the speech data to be recognized; the reference keyword is used to generate a vocabulary for the pronunciation dictionary (for example: the syllable sequence of "Hello" is 'ni3 hao3', The syllable sequence of "world" is 'shi4 jie4', where '3' or '4' represent the corresponding pitch of the syllable respectively), and the target syllable sequence corresponding to the reference keyword is obtained; then the keyword matching network is used to compare the target syllable sequence and The syllable sequence to be detected is matched to obtain an output result (that is, a keyword matching result).

其中，特征提取网络是通过数字信号处理算法将连续的语音信息转化成离散的向量表示，这些向量能够有效地表征相关的语音特征，有利于后续的语音任务；音节识别网络主要是根据给定的特征序列和状态序列去训练模型，其中模型的状态输出概率可以由图神经网络(Graph Neural Network，GNN)，深度神经网络(Deep Neural Networks，DNN)，卷积神经网络(Convolutional Neural Networks，CNN)，循环神经网络(Recurrent NeuralNetwork，RNN)，时延神经网络(Time Delay Neural Network，TDNN)等建模，而模型的状态跳转概率可以由隐含马尔柯夫模型(Hidden Markov Model，HMM)等来建模；词表生成操作是通过词典生成关键词对应的音节序列；关键词匹配网络主要是通过训练好的模型来搜索所有可能的状态空间，然后找到输入语音特征最可能的状态序列，使得输出的概率最大。关键词检测模型通过级连以上网络以及方法，可以检测出语音中的关键词以及该关键词的起止位置。Among them, the feature extraction network converts continuous speech information into discrete vector representations through digital signal processing algorithms. These vectors can effectively represent relevant speech features and are beneficial to subsequent speech tasks; the syllable recognition network is mainly based on the given Feature sequence and state sequence to train the model, in which the state output probability of the model can be determined by Graph Neural Network (GNN), Deep Neural Networks (DNN), Convolutional Neural Networks (Convolutional Neural Networks, CNN) , Recurrent Neural Network (RNN), Time Delay Neural Network (TDNN) and other modeling, and the state transition probability of the model can be determined by Hidden Markov Model (Hidden Markov Model, HMM) etc. to model; the vocabulary generation operation is to generate the syllable sequence corresponding to the keyword through the dictionary; the keyword matching network mainly searches all possible state spaces through the trained model, and then finds the most likely state sequence of the input speech features, so that The output has the highest probability. The keyword detection model can detect the keywords in the speech and the starting and ending positions of the keywords by cascading the above networks and methods.

示例性的，本申请可以使用多任务学习方法将每一个语种都当作一个单独的分支进行训练，每个分支的编码器部分权重共享，最后的Softmax层为独立的。因此，通过上述方法，可以学到各个语种之间的共同信息，也能够通过Softmax层将编码器输出映射到对应语种中。Exemplarily, the present application can use the multi-task learning method to train each language as a separate branch, the weights of the encoder parts of each branch are shared, and the final Softmax layer is independent. Therefore, through the above method, the common information between each language can be learned, and the encoder output can also be mapped to the corresponding language through the Softmax layer.

示例性的，在获取模型为浮点数的权重之后，使用相关聚类方法，按照大小的相近程度分为不同类别，每个类别只需要保存一个聚类中心的权值，和对应的聚类索引(Index)，可以使得数据量大幅度减少。在权值更新时进行反向传播，计算每项权值的梯度(Gradient)，将之前聚类的同一类别的梯度累加，结合学习率(Learning Rate，LR)进行权值更新。Exemplarily, after obtaining the weight of the model as a floating-point number, the relevant clustering method is used to divide into different categories according to the degree of similarity in size. Each category only needs to save the weight of one cluster center and the corresponding cluster index. (Index), which can greatly reduce the amount of data. Backpropagation is performed when the weights are updated, the gradient of each weight is calculated, the gradients of the same category of the previous clustering are accumulated, and the weights are updated in combination with the learning rate (LR).

本申请先获取待识别语音数据和参考关键词的目标音节序列，然后调用关键词检测模型对待识别语音数据进行处理，确定待识别语音数据的待检测音节序列，再根据待检测音节序列和目标音节序列确定待识别语音数据的关键词检测结果，可以实现关键词检测的自动化及智能化，提高关键词检测的效率；本申请提出的关键词检测模型可以利用多个语种类别的样本语音数据训练得到的，可对多个语种类别的语音数据进行关键词检测，且可以将参考关键词和待识别语音数据的语种类别进行匹配，提高了关键词检测的适用性，也进一步提高了关键词检测的智能化；在关键词检测模型训练中，使用权值共享，提高了模型训练速度，并将训练数据集按照线上语音数据语种分布比例进行混合，提升了关键词检测模型的检测准确度。The application first obtains the speech data to be recognized and the target syllable sequence of the reference keyword, and then calls the keyword detection model to process the speech data to be recognized, determines the syllable sequence to be detected of the speech data to be recognized, and then determines the syllable sequence to be detected according to the syllable sequence to be detected and the target syllable. The sequence determines the keyword detection results of the speech data to be recognized, which can realize the automation and intelligence of keyword detection and improve the efficiency of keyword detection; the keyword detection model proposed in this application can be obtained by training sample speech data in multiple languages. It can perform keyword detection on speech data of multiple language categories, and can match the reference keyword with the language category of the speech data to be recognized, which improves the applicability of keyword detection and further improves the accuracy of keyword detection. Intelligent; in the training of the keyword detection model, the weight sharing is used to improve the training speed of the model, and the training data set is mixed according to the distribution ratio of the language of the online voice data, which improves the detection accuracy of the keyword detection model.

请参阅图4，图4是本申请另一个示例性实施例提供的一种数据处理方法的流程示意图，以该方法应用于图1中的终端设备为例进行说明，该方法可包括以下步骤：Please refer to FIG. 4. FIG. 4 is a schematic flowchart of a data processing method provided by another exemplary embodiment of the present application. Taking the method applied to the terminal device in FIG. 1 as an example, the method may include the following steps:

S401、获取待识别语音数据，以及获取参考关键词的目标音节序列。S401. Acquire the speech data to be recognized, and acquire the target syllable sequence of the reference keyword.

其中，步骤S401的具体实施方式参见前述实施例中步骤S201的相关描述，此处不再赘述。For the specific implementation of step S401, refer to the relevant description of step S201 in the foregoing embodiment, which will not be repeated here.

S402、调用关键词检测模型的特征提取网络对待识别语音数据进行处理，得到待识别语音数据的语音特征。S402 , invoking the feature extraction network of the keyword detection model to process the speech data to be recognized to obtain speech features of the speech data to be recognized.

在一实施例中，上述步骤S402可以通过以下方法实现：In one embodiment, the above step S402 may be implemented by the following methods:

(1)、调用特征提取网络对待识别语音数据进行分帧处理，得到待识别语音数据的时域特征。(1) Call the feature extraction network to perform frame processing on the speech data to be recognized, and obtain the time domain features of the speech data to be recognized.

本申请实施例中，由于人耳能听到的声音时长至少为10ms，所以要将数字信号分成多个可以听见的块，也就是分帧。例如，可以10s长度的待识别语音数据以25ms为一帧进行划分，得到400个语音帧，将这400个语音帧作为待识别语音数据的时域特征。In the embodiment of the present application, since the duration of the sound that can be heard by the human ear is at least 10 ms, the digital signal needs to be divided into a plurality of audible blocks, that is, divided into frames. For example, the 10s-long speech data to be recognized may be divided into a frame of 25ms to obtain 400 speech frames, and the 400 speech frames are used as the time domain features of the speech data to be recognized.

(2)、调用特征提取网络对时域特征进行加窗、频域变换处理，得到频域特征。(2) Call the feature extraction network to perform windowing and frequency domain transformation processing on the time-domain features to obtain frequency-domain features.

频域特征是在时域特征的基础上进行加窗，再进行频域变换处理(典型的方法如傅里叶变换处理)。加窗也就是对每一帧都使用一个窗函数来处理，消除了一帧两端的样本，从而能够生成一段周期性的信号，如果不加窗就直接进行频域变换处理就会造成频谱泄露(spectral leakage)。The frequency domain feature is to add a window on the basis of the time domain feature, and then perform frequency domain transform processing (typical methods such as Fourier transform processing). Windowing is to use a window function to process each frame, eliminating the samples at both ends of a frame, so that a period of periodic signal can be generated. If the frequency domain transformation is performed directly without windowing, it will cause spectrum leakage ( spectral leakage).

(3)、将频域特征作为待识别语音数据的语音特征。(3) The frequency domain feature is used as the voice feature of the voice data to be recognized.

在得到待识别语音数据的频域特征后，可将该频域特征作为待识别语音数据的语音特征。After obtaining the frequency domain feature of the voice data to be recognized, the frequency domain feature can be used as the voice feature of the voice data to be recognized.

S403、调用关键词检测模型的音节识别网络对语音特征进行处理，得到待识别语音数据的待检测音节序列。S403 , invoking the syllable recognition network of the keyword detection model to process the speech feature, and obtain the to-be-detected syllable sequence of the speech data to be recognized.

本申请实施例中，获取待识别语音数据的待检测音节序列，是为了将待识别语音数据的待检测音节序列与参考关键词的目标音节序列进行关键词匹配，从而得到关键词检测结果。关键词检测模型中包含有音节识别网络，音节识别网络的作用是生成输入语音数据对应的音节序列。In the embodiment of the present application, acquiring the syllable sequence to be detected of the speech data to be recognized is to perform keyword matching between the syllable sequence to be detected of the speech data to be recognized and the target syllable sequence of the reference keyword, so as to obtain a keyword detection result. The keyword detection model includes a syllable recognition network, and the function of the syllable recognition network is to generate a syllable sequence corresponding to the input speech data.

在一实施例中，上述步骤S403可以通过以下方法实现：In one embodiment, the above step S403 can be implemented by the following methods:

(1)、调用关键词检测模型的音节识别网络对语音特征进行处理，根据语音特征确定待识别语音数据的目标语种类别。(1) The syllable recognition network of the keyword detection model is invoked to process the speech features, and the target language category of the speech data to be recognized is determined according to the speech features.

本申请实施例中，关键词检测模型的音节识别网络首先会判断该语音特征对应的语种类别，以便于利用该语种类别对应的识别子网络对语音特征进行处理，提高处理准确率。In the embodiment of the present application, the syllable recognition network of the keyword detection model first determines the language category corresponding to the speech feature, so as to use the recognition sub-network corresponding to the language category to process the speech feature and improve the processing accuracy.

在一实施例中，音节识别网络中包括语种识别网络，关键词检测模型调用语种识别网络对待识别语音数据进行处理，得到待识别语音数据的目标语种类别。语种识别网络可以使用CV音节划分法、线性预测残差等方法进行语种识别。In one embodiment, the syllable recognition network includes a language recognition network, and the keyword detection model calls the language recognition network to process the speech data to be recognized to obtain the target language category of the speech data to be recognized. Language identification network can use CV syllable division method, linear prediction residual and other methods for language identification.

在一实施例中，语种识别网络独立于音节识别网络，在音节识别网络之前，特征提取网络之后。语种识别网络先获取特征提取网络输出的语音特征，对语音特征进行处理，得到待识别语音数据的目标语种类别，再将待识别语音数据的语音特征以及目标语种类别发送给音节识别网络进行处理。In one embodiment, the language recognition network is independent of the syllable recognition network, before the syllable recognition network and after the feature extraction network. The language recognition network first obtains the speech features output by the feature extraction network, processes the speech features, and obtains the target language category of the speech data to be recognized, and then sends the speech features of the speech data to be recognized and the target language category to the syllable recognition network for processing.

(2)、调用音节识别网络中目标语种类别对应的识别子网络对语音特征进行处理，得到语音特征的音节分布概率。(2) Call the recognition sub-network corresponding to the target language category in the syllable recognition network to process the speech feature, and obtain the syllable distribution probability of the speech feature.

根据每个语种的音节特点，可以预先设置该语种类别对应的音节类别，例如语种A的音节种类有1000种，那么给定目标语种类别的一段语音，该语音中每一帧通过目标语种类别对应的识别子网络可以得到1000个音节种类的分布概率(也即是音节分布概率)。According to the syllable characteristics of each language, the syllable category corresponding to the language category can be preset. For example, there are 1000 syllable categories in language A, then given a piece of speech in the target language category, each frame in the speech corresponds to the target language category The identification sub-network of can get the distribution probability of 1000 syllable categories (that is, the syllable distribution probability).

(3)、根据音节分布概率确定待识别语音数据的待检测音节序列。(3) Determine the to-be-detected syllable sequence of the speech data to be recognized according to the syllable distribution probability.

在一实施例中，在调用音节识别网络中目标语种类别对应的识别子网络对待识别语音数据的语音特征中的每一帧进行处理后，分别得到每一帧语音对应的音节分布概率，将最大概率对应的音节作为目标音节；当确定完所有语音特征的音节分布概率，也即是得到了多个目标音节，将这多个目标音节进行组合，作为待识别语音数据的待检测音节序列。In one embodiment, after calling the recognition sub-network corresponding to the target language category in the syllable recognition network to process each frame of the speech features of the speech data to be recognized, the syllable distribution probability corresponding to each frame of speech is obtained respectively, and the maximum The syllable corresponding to the probability is used as the target syllable; when the syllable distribution probability of all speech features is determined, that is, multiple target syllables are obtained, and the multiple target syllables are combined as the to-be-detected syllable sequence of the speech data to be recognized.

在一实施例中，音节识别网络是基于时延神经网络和奇异值分解网络构建生成的。In one embodiment, the syllable recognition network is constructed and generated based on a time-delay neural network and a singular value decomposition network.

本申请实施例中，时延神经网络(即TDNN)是为了解决语音识别中传统方法隐马尔科夫模型(Hidden Markov Model，HMM)无法适应语音信号中的动态时域变化的问题。TDNN的结构参数较少，且进行语音识别不需要预先将音标与音频在时间线上进行对齐，实验证明TDNN相比HMM表现更好。TDNN的两个明显的特征是动态适应时域特征变化和参数较少，传统的深度神经网络的输入层与隐含层一一连接，TDNN在这里有改变，即隐含层的特征不仅与当前时刻的输入有关，而且还与未来时刻的输入有关，因此TDNN有能力表达语音特征在时间上的关系，且可以通过权值共享的方法，方便TDNN网络进行学习。In the embodiments of the present application, the time-delay neural network (ie, TDNN) is used to solve the problem that the traditional method Hidden Markov Model (HMM) in speech recognition cannot adapt to dynamic time domain changes in speech signals. TDNN has fewer structural parameters, and does not need to align phonetic symbols and audio on the timeline in advance for speech recognition. Experiments show that TDNN performs better than HMM. Two obvious features of TDNN are dynamic adaptation to temporal feature changes and fewer parameters. The input layer of traditional deep neural network is connected to the hidden layer one by one. TDNN has changed here, that is, the features of the hidden layer are not only the same as the current one. The input of the time is related to the input of the future time, so TDNN has the ability to express the relationship of speech features in time, and can facilitate the learning of the TDNN network through the method of weight sharing.

奇异值分解(即SVD)是一种对矩阵进行分解的方法，但是和特征值分解方法不同，特征值分解只适用于方阵，但SVD并不要求要分解的矩阵为方阵，能适用于任意的矩阵的分解。利用SVD，即可用较小的数据集来表示原始数据集，从而去除了噪声和冗余信息，极大地降低了数据量，提高了运算速度。Singular value decomposition (ie SVD) is a method of decomposing a matrix, but unlike the eigenvalue decomposition method, eigenvalue decomposition is only applicable to square matrices, but SVD does not require the matrix to be decomposed to be a square matrix, and can be applied to Decomposition of arbitrary matrices. Using SVD, the original data set can be represented by a smaller data set, thereby removing noise and redundant information, greatly reducing the amount of data, and improving the operation speed.

本申请本发明选用音节作为建模单元，基于SVD方法构建了SVD网络，结合TDNN并且使用因式分解时延神经网络(即TDNN-F)作为音节识别网络。音节识别网络的主要作用是将语音输入转化成声学表示的音节输出。具体而言，就是针对每个语音帧，计算出它所属音节的概率分布。The present invention selects syllables as modeling units, constructs SVD network based on SVD method, combines TDNN and uses factorization time delay neural network (ie TDNN-F) as syllable recognition network. The main role of the syllable recognition network is to convert the speech input into an acoustically represented syllable output. Specifically, for each speech frame, the probability distribution of the syllable to which it belongs is calculated.

请参见图5是，图5是本申请提出的TDNN-F网络的结构示意图。TDNN-F网络是一个多层的网络，每一层对语音特征都有较强的抽象能力，而且它有着较宽的上下文视野，能够捕捉更广的上下文信息，并对语音时序依赖信息的建模能力更强。除了TDNN-F网络包括的普通层结构(如图5中黑色实心圆组成的网络结构)，在某些TDNN层之间可以插入一个SVD层(如图5中SVD框出的网络结构)。示例性的，TDNN-F模型有N层，每个TDNN层和SVD层分别有M和K个节点，其中K远远小于M。那么TDNN的模型参数数量D1为：Please refer to FIG. 5. FIG. 5 is a schematic structural diagram of the TDNN-F network proposed by the present application. The TDNN-F network is a multi-layer network, each layer has a strong ability to abstract speech features, and it has a wider contextual field of view, which can capture a wider range of contextual information, and builds speech timing dependent information. The mold ability is stronger. In addition to the common layer structure included in the TDNN-F network (the network structure composed of black solid circles in Figure 5), an SVD layer can be inserted between some TDNN layers (the network structure framed by SVD in Figure 5). Exemplarily, the TDNN-F model has N layers, and each TDNN layer and SVD layer have M and K nodes, respectively, where K is much smaller than M. Then the number of model parameters D1 of TDNN is:

D1＝N*M*MD1=N*M*M

而TDNN-F的模型参数数量D2为：The number of model parameters D2 of TDNN-F is:

D2＝N*(M+M)*KD2=N*(M+M)*K

从上述两个模型参数计算公式中可以看到，TDNN的模型参数数量D1远远大于TDNN-F的模型参数数量D2，TDNN-F网络相比于TDNN网络，TDNN-F网络可以有效地减少了模型的训练参数，加快模型的推理速度。It can be seen from the above two model parameter calculation formulas that the number of model parameters D1 of TDNN is much larger than the number of model parameters D2 of TDNN-F. Compared with the TDNN network, the TDNN-F network can effectively reduce the number of The training parameters of the model to speed up the inference speed of the model.

需要说明的是，较大参数的网络模型可以获得更好的效果，但是同时会减慢模型的推理速度。因此，为了在速度和效果上取得一个较好的平衡，网络模型通常需要进行参数优化。在本申请提出的TDNN-F网络中，影响网络效果的主要参数有网络层数、每层节点数、SVD节点数和上下文视野宽度。It should be noted that a network model with larger parameters can achieve better results, but at the same time, it will slow down the inference speed of the model. Therefore, in order to achieve a better balance between speed and effect, the network model usually needs to optimize the parameters. In the TDNN-F network proposed in this application, the main parameters affecting the network effect are the number of network layers, the number of nodes in each layer, the number of SVD nodes, and the width of the context view.

根据上述提出的TDNN-F网络，可以构建基于TDNN-F网络的音节识别网络。请参见图6是，图6是本申请提出的一种音节识别网络的结构及流程示意图，该音节识别网络基于TDNN-F网络，并采用Chain Model方法(一种序列鉴别性训练的方法)和多任务学习方法将每一个语种类别都当作一个独立的分支进行处理。该音节识别网络具体流程为：将待识别语音数据的语音特征作为音节识别网络的输入，经过Chain Model网络处理对不同语种类别的语音特征分别进行处理，然后输出多个语种类别(例如语种1和语种N)对应的音节序列(也即是待检测音节序列)。According to the proposed TDNN-F network above, a syllable recognition network based on the TDNN-F network can be constructed. Please refer to FIG. 6. FIG. 6 is a schematic diagram of the structure and flow of a syllable recognition network proposed by the present application. The syllable recognition network is based on the TDNN-F network and adopts the Chain Model method (a method for sequence discrimination training) and Multi-task learning methods treat each language category as an independent branch. The specific process of the syllable recognition network is as follows: take the speech features of the speech data to be recognized as the input of the syllable recognition network, process the speech features of different language categories through Chain Model network processing, and then output multiple language categories (such as language 1 and The syllable sequence corresponding to language N) (that is, the syllable sequence to be detected).

其中，Chain Model网络是多层网络结构(图6中黑色实心圆所在的网络层为输入层，空心圆所在的层为中间层，斜线圆所在的层为输出层)，多层网络结构中又包括多层的TDNN-F网络，每个TDNN-F网络由Chain Model网络中的第i层(i指中间层中任意一层)、i-1层、Dropout层(神经元丢弃层，用于在神经网络使用中将指定神经网络单元概率置零)、TDNN-F层组成以及多条通道组成；多条通道包括从i-1层到i层的直连通道(也即是跳层连接)，也包括i-1层通过TDNN-F层和Dropout层到i层的普通通道；通过跳层连接和Dropout方法使得模型训练更加稳定；Among them, the Chain Model network is a multi-layer network structure (the network layer where the black solid circle is located in Figure 6 is the input layer, the layer where the hollow circle is located is the middle layer, and the layer where the slashed circle is located is the output layer). It also includes a multi-layer TDNN-F network. Each TDNN-F network consists of the i-th layer in the Chain Model network (i refers to any layer in the middle layer), the i-1 layer, and the Dropout layer (the neuron discarding layer, with It is used to set the probability of the specified neural network unit to zero in the use of the neural network), TDNN-F layer composition and multiple channels; multiple channels include direct channels from the i-1 layer to the i layer (that is, the skip layer connection ), also including the ordinary channel from the i-1 layer to the i layer through the TDNN-F layer and the Dropout layer; the model training is more stable through the skip layer connection and the Dropout method;

在一实施例中，本申请构建的基于TDNN-F网络的音节识别网络可以使用子采样层和共享权重的方法去减少计算量，还可以使用在帧级交叉熵(Cross-entropy)损失函数并在此基础上添加了LF-MMI损失函数，用来指导TDNN-F模型的误差传递；在网络输出阶段，本申请提出的音节识别网络会对每个语种类别的分支都进行关键词匹配操作，当关键词的匹配结果满足预设条件时，输出满足条件的关键词、及其该关键词的预测概率和起止位置。In one embodiment, the TDNN-F network-based syllable recognition network constructed in this application can use the sub-sampling layer and the method of sharing weights to reduce the amount of calculation, and can also use the frame-level cross-entropy (Cross-entropy) loss function and On this basis, the LF-MMI loss function is added to guide the error transfer of the TDNN-F model; in the network output stage, the syllable recognition network proposed in this application will perform keyword matching operations on the branches of each language category. When the matching result of the keyword satisfies the preset condition, output the keyword that satisfies the condition, and the predicted probability and start and end position of the keyword.

根据基于TDNN-F网络的音节识别网络，本申请提出基于TDNN-F的多语种语音关键词检测方法，具体实现方式如下：According to the syllable recognition network based on TDNN-F network, this application proposes a method for detecting multilingual speech keywords based on TDNN-F, and the specific implementation is as follows:

请参见图7，图7是本申请提出的基于TDNN-F多语种的语音关键词检测方法流程示意图，首先将待识别语音数据(例如：你好，世界)输入到特征提取网络，得到待识别语音数据的语音特征；然后将待识别语音数据的语音特征输入到音节识别网络，得到待识别语音数据的待检测音节序列，该音节识别网络基于TDNN-F、多任务学习可以针对不同语种类型的待识别语音数据的语音特征分别进行处理，得到多个语种(例如语种1和语种N)对应的待检测音节序列。然后利用基于隐马尔科夫(HMM)和免词格最大互信息(Lattic-FreeMaximum Mutual Information，LF-MMI)的方法实现模型的状态跳转，本申请提出的基于TDNN-F的多语种语音关键词检测方法基于HMM，HMM描述一个含有隐含未知参数的马尔可夫过程，马尔可夫过程包含可见状态(例如语音数据中的音节，如图7中S1、S2…)和隐含状态(例如语音数据中各音节间的上下文联系关系，如图7中S2上方指向自身的箭头)。通过该方法可以充分利用待检测语音数据的音节序列以及该音节序列之间的隐含状态，提高了数据的利用率。Please refer to FIG. 7. FIG. 7 is a schematic flowchart of the method for detecting speech keywords based on TDNN-F multilingual proposed by the present application. First, the speech data to be recognized (for example: hello, world) is input into the feature extraction network to obtain the speech data to be recognized. The speech features of the speech data; then input the speech features of the speech data to be recognized into the syllable recognition network to obtain the syllable sequence to be detected of the speech data to be recognized. The syllable recognition network is based on TDNN-F and multi-task learning. The speech features of the speech data to be recognized are separately processed to obtain syllable sequences to be detected corresponding to multiple languages (eg, language 1 and language N). Then, the method based on Hidden Markov (HMM) and Lattic-FreeMaximum Mutual Information (LF-MMI) is used to realize the state jump of the model. The multilingual speech key based on TDNN-F proposed in this application is The word detection method is based on HMM, which describes a Markov process with hidden unknown parameters. The Markov process contains visible states (such as syllables in speech data, such as S1, S2... in Figure 7) and implicit states (such as The contextual relationship between the syllables in the speech data, such as the arrow pointing to itself above S2 in Figure 7). By this method, the syllable sequence of the speech data to be detected and the implicit state between the syllable sequences can be fully utilized, and the utilization rate of the data is improved.

本申请实施例可以利用多语种发音词典对参考关键词(也即是图中多语种关键词)进行词表生成操作(例如：关键词“你好”的音节序列包括‘ni3 hao3’、‘nei5 hou2’等，“世界”的音节序列包括‘shi4 jie4’、‘sai3 jaai3’等，其中‘2’、‘3’或‘4’分别代表该音节对应的音调，得到参考关键词对应的多个语种类型的目标音节序列后；再利用关键词匹配网络对目标音节序列和待检测音节序列进行匹配，得到输出结果(也即是关键词匹配结果)。In this embodiment of the present application, a multilingual pronunciation dictionary can be used to generate a vocabulary for the reference keywords (that is, the multilingual keywords in the figure) (for example, the syllable sequence of the keyword "hello" includes 'ni3 hao3', 'nei5' hou2', etc., the syllable sequence of "world" includes 'shi4 jie4', 'sai3 jaai3', etc., where '2', '3' or '4' represent the tones corresponding to the syllable, respectively, to obtain multiple reference keywords corresponding to After the target syllable sequence of the language type, the keyword matching network is used to match the target syllable sequence and the syllable sequence to be detected to obtain an output result (that is, a keyword matching result).

S404、调用关键词检测模型的关键词匹配网络对待检测音节序列和目标音节序列进行处理，得到待识别语音数据的关键词检测结果。S404 , invoking the keyword matching network of the keyword detection model to process the syllable sequence to be detected and the target syllable sequence to obtain a keyword detection result of the speech data to be recognized.

在一实施例中，待检测音节序列可以包括一个或多个按照第一顺序排列的音节元素，第一顺序可以是按照待识别语音数据中各音节的先后顺序进行排列；目标音节序列包括一个或多个按照第二顺序排列的音节元素，第二顺序可以是根据关键词中各音节的先后顺序进行排列；也可以根据关键词的组成部分进行调整后排序。例如，关键词为“你好，世界”时，可以根据该关键词的音节顺序进行排序，即得到以“你好世界”为排序基准的多个音节元素；也可以根据该关键词的组成成分进行调整(关键词包括“你好”和“世界”两个部分)，得到“世界你好”的音节顺序，并得到以“世界你好”为排序基准的多个音节元素。In one embodiment, the syllable sequence to be detected may include one or more syllable elements arranged in a first order, and the first order may be arranged according to the sequence of the syllables in the speech data to be recognized; the target syllable sequence includes one or more syllable elements. The plurality of syllable elements are arranged in a second order, and the second order may be arranged according to the sequence of the syllables in the keyword; it may also be adjusted and sorted according to the components of the keyword. For example, when the keyword is "hello, world", it can be sorted according to the syllable order of the keyword, that is, multiple syllable elements based on "hello world" are obtained; or according to the composition of the keyword Make adjustments (the keywords include "hello" and "world"), get the syllable order of "hello world", and get multiple syllable elements based on "hello world".

在一实施例中，上述步骤S404可以通过以下方法实现：In one embodiment, the above step S404 can be implemented by the following methods:

(1)、调用关键词检测模型的关键词匹配网络对待检测音节序列和目标音节序列进行处理，检测目标音节序列中是否存在与第一音节元素相匹配的匹配音节元素，第一音节元素为待检测音节序列中的任一个音节元素。(1), call the keyword matching network of the keyword detection model to process the syllable sequence to be detected and the target syllable sequence, and detect whether there is a matching syllable element that matches the first syllable element in the target syllable sequence, and the first syllable element is to be Detect any syllable element in a syllable sequence.

本申请实施例中，第一音节元素可以取自待检测音节序列中的任一个音节元素，上述步骤可以理解为：调用关键词检测模型的关键词匹配网络，将目标音节序列分别与待检测音节序列中每一个音节元素(在每一次匹配过程中，待检测音节序列中被选择的音节元素即可作为第一音节元素)进行匹配，然后检测目标音节序列中是否存在与第一音节元素相匹配的匹配音节元素。In the embodiment of the present application, the first syllable element can be taken from any syllable element in the syllable sequence to be detected, and the above steps can be understood as: calling the keyword matching network of the keyword detection model, and respectively matching the target syllable sequence with the syllable to be detected Each syllable element in the sequence (in each matching process, the selected syllable element in the syllable sequence to be detected can be used as the first syllable element) is matched, and then it is detected whether there is a match with the first syllable element in the target syllable sequence of matching syllable elements.

在一实施例中，判断目标音节序列中是否存在与第一音节元素相匹配的匹配音节元素的方法可以是：调用关键词匹配网络将目标音节序列中每个音节元素与第一音节元素分别进行匹配，得到对应目标音节序列中每个音节元素的多个匹配概率，获取多个匹配概率中的最大匹配概率，若最大匹配概率大于音节匹配概率阈值，则判断该最大匹配概率对应的音节元素为匹配音节元素，也即是判断出目标音节序列中存在与第一音节元素相匹配的匹配音节元素。In one embodiment, the method for judging whether there is a matching syllable element that matches the first syllable element in the target syllable sequence may be: calling a keyword matching network to carry out each syllable element in the target syllable sequence and the first syllable element respectively. Match, obtain multiple matching probabilities of each syllable element in the corresponding target syllable sequence, obtain the maximum matching probability among the multiple matching probabilities, if the maximum matching probability is greater than the syllable matching probability threshold, then judge the syllable element corresponding to the maximum matching probability as Matching the syllable element, that is, judging that there is a matching syllable element matching the first syllable element in the target syllable sequence.

(2)、若存在与第一音节元素相匹配的匹配音节元素，则检测目标音节序列中是否存在与第二音节元素相匹配的匹配音节元素，第二音节元素为待检测音节序列中排列在所述第一音节元素后一位的音节元素。(2), if there is a matching syllable element that matches the first syllable element, then detect whether there is a matching syllable element that matches the second syllable element in the target syllable sequence, and the second syllable element is the syllable sequence to be detected. The syllable element after the first syllable element.

在一实施例中，在判断出目标音节序列中存在与第一音节元素相匹配的匹配音节元素后，可将第一音节元素的后一位音节元素作为第二音节元素，根据上述步骤(1)提供的方法，调用关键词匹配网络对目标音节序列和第二音节元素进行匹配处理。In one embodiment, after judging that there is a matching syllable element that matches the first syllable element in the target syllable sequence, the next syllable element of the first syllable element can be used as the second syllable element, according to above-mentioned steps (1. ), the keyword matching network is called to perform matching processing on the target syllable sequence and the second syllable element.

(3)、若存在与第二音节元素相匹配的匹配音节元素，则根据第一音节元素和第二音节元素确定待识别语音数据的关键词检测结果。(3) If there is a matching syllable element that matches the second syllable element, determine the keyword detection result of the speech data to be recognized according to the first syllable element and the second syllable element.

在一实施例中，参考关键词的音节元素为2个，基于步骤(1)和(2)的方法进行音节匹配，若目标音节序列存在与第一音节元素相匹配的匹配音节元素，且目标音节序列存在与第二音节元素(此时第二音节元素包含一个音节元素)相匹配的匹配音节元素，则确定待识别语音数据存在目标音节序列。In one embodiment, the syllable elements of the reference keyword are 2, and the method based on steps (1) and (2) carries out syllable matching, if there is a matching syllable element matching the first syllable element in the target syllable sequence, and the target If there is a matching syllable element in the syllable sequence that matches the second syllable element (in this case, the second syllable element includes one syllable element), it is determined that the speech data to be recognized has the target syllable sequence.

在一实施例中，参考关键词的音节元素为K(K大于2)个，那么第二音节元素可以包括K-1个音节元素(其中K-1个音节元素可以根据第二顺序进行排序)，在根据步骤(1)提供的方法判断出目标音节序列中存在与第一音节元素相匹配的匹配音节元素后，对K-1个音节元素中的第一个按照上述步骤(1)的方法进行关键词匹配，当匹配成功后再依次对K-1个音节元素中后续音节元素进行关键词判断，若K-1个音节元素都与待检测音节序列中的音节元素能够匹配，则确定待识别语音数据存在目标音节序列，并输出待识别语音数据的关键词检测结果。In an embodiment, the syllable elements of the reference keyword are K (K is greater than 2), then the second syllable element may include K-1 syllable elements (wherein the K-1 syllable elements may be sorted according to the second order) , after judging that there is a matching syllable element that matches the first syllable element in the target syllable sequence according to the method provided by step (1), to the first of the K-1 syllable elements according to the method of the above-mentioned step (1) Perform keyword matching, and then perform keyword judgment on the subsequent syllable elements in the K-1 syllable elements in turn when the matching is successful. Recognize that the speech data exists in the target syllable sequence, and output the keyword detection result of the speech data to be recognized.

在一实施例中，输出待识别语音数据的关键词检测结果，具体可以包括输出判断存在于待识别语音数据的目标关键词、目标关键词的起止时间以及目标关键词的预测概率。In one embodiment, outputting the keyword detection result of the speech data to be recognized may specifically include outputting the target keyword judged to exist in the speech data to be recognized, the start and end time of the target keyword, and the predicted probability of the target keyword.

在一实施例中，目标关键词的预测概率指目标关键词中所有音节元素的匹配概率进行融合得到的，例如目标关键词包含K(例如K＝3)个音节元素，音节匹配概率阈值为V(例如V＝0.9)，目标关键词包含的K个音节元素的匹配概率都大于音节匹配概率阈值(例如3个音节元素的匹配概率分别为0.96、0.98、0.94)，可以将3个音节元素的匹配概率的平均值(例如0.96)作为目标关键词的预测概率，也可以将3个音节元素的匹配概率的乘积(例如0.88)作为目标关键词的预测概率。In an embodiment, the predicted probability of the target keyword refers to the matching probability of all syllable elements in the target keyword obtained by fusion, for example, the target keyword contains K (for example, K=3) syllable elements, and the syllable matching probability threshold is V. (eg V=0.9), the matching probabilities of the K syllable elements contained in the target keyword are all greater than the syllable matching probability threshold (for example, the matching probabilities of the three syllable elements are 0.96, 0.98, and 0.94, respectively). The average value of matching probabilities (eg, 0.96) is used as the predicted probability of the target keyword, and the product of matching probabilities of three syllable elements (eg, 0.88) can also be used as the predicted probability of the target keyword.

关键词匹配网络可以包括多个匹配策略，通过多个匹配策略得到的匹配概率进行关键词的结果的确定，提高了关键词检测的准确度。匹配策略可以包括动态规划、最长序列优先、穷举得到最优路径等方法，匹配策略可以基于上述方法中的一种或多种实现，当基于多个匹配策略进行关键词匹配时，具体步骤如下：The keyword matching network may include multiple matching strategies, and the result of the keyword is determined through the matching probability obtained from the multiple matching strategies, which improves the accuracy of keyword detection. The matching strategy can include methods such as dynamic programming, longest sequence first, and exhaustive exhaustion to obtain the optimal path. The matching strategy can be implemented based on one or more of the above methods. When performing keyword matching based on multiple matching strategies, the specific steps are: as follows:

(1)、调用关键词匹配网络中的多个匹配策略，分别对待检测音节序列(包括第一音节元素和第二音节元素)和目标音节序列进行处理，得到多个后验概率集合，每个后验概率集合包括多个匹配策略中一个匹配策略对应得到的参考关键词的多个预测概率。(1) Call multiple matching strategies in the keyword matching network, process the syllable sequence to be detected (including the first syllable element and the second syllable element) and the target syllable sequence respectively, and obtain multiple sets of posterior probability, each The posterior probability set includes a plurality of predicted probabilities of a reference keyword obtained corresponding to one matching strategy among the multiple matching strategies.

示例性的，关键词匹配网络中包括Z(例如Z＝3)个匹配策略，利用Z个匹配策略对待检测音节序列和目标音节序列进行处理，得到Z个后验概率集合(例如每个后验概率集合包括3个后验概率，后验概率集合分别为[0.94、0.96、0.98]、[0.88、0.94、0.96]和[0.90、0.94、0.92])。Exemplarily, the keyword matching network includes Z (for example, Z=3) matching strategies, and the Z matching strategies are used to process the to-be-detected syllable sequence and the target syllable sequence to obtain Z sets of posterior probabilities (such as each posterior). The probability set includes 3 posterior probabilities, the posterior probability sets are [0.94, 0.96, 0.98], [0.88, 0.94, 0.96] and [0.90, 0.94, 0.92]).

(2)、确定每个后验概率集合中多个预测概率的最大预测概率作为策略概率，根据多个策略概率，确定参考关键词的预测概率。(2) Determine the maximum predicted probability of multiple predicted probabilities in each posterior probability set as the strategy probability, and determine the predicted probability of the reference keyword according to the multiple strategy probabilities.

示例性的，根据后验概率集合确定多个预测概率的最大预测概率(例如第一匹配策略对应的最大预测概率为0.98，第二匹配策略对应的最大预测概率为0.96，第三匹配策略对应的最大预测概率为0.94)，再根据多个预测概率的最大预测概率确定参考关键词的预测概率(例如三个匹配策略对应的3个最大预测概率分别为0.98、0.96和0.94，对3个最大预测概率求均值得到参考关键词的预测概率为0.96)。Exemplarily, the maximum predicted probability of multiple predicted probabilities is determined according to the posterior probability set (for example, the maximum predicted probability corresponding to the first matching strategy is 0.98, the maximum predicted probability corresponding to the second matching strategy is 0.96, and the third matching strategy corresponds to 0.96. The maximum prediction probability is 0.94), and then the prediction probability of the reference keyword is determined according to the maximum prediction probability of multiple prediction probabilities (for example, the three maximum prediction probabilities corresponding to the three matching strategies are 0.98, 0.96, and 0.94, respectively. The probability is averaged to obtain a predicted probability of 0.96 for the reference keyword).

(3)、根据参考关键词的预测概率确定所述待识别语音数据的关键词检测结果。(3) Determine the keyword detection result of the speech data to be recognized according to the predicted probability of the reference keyword.

示例性的，音节匹配概率阈值为V(例如V＝0.9)，参考关键词的预测概率为0.96，大于音节匹配概率阈值，则确定该参考关键词为目标关键词，然后输出关键词检测结果。Exemplarily, if the syllable matching probability threshold is V (for example, V=0.9), and the predicted probability of the reference keyword is 0.96, which is greater than the syllable matching probability threshold, the reference keyword is determined as the target keyword, and then the keyword detection result is output.

在关键词检测过程中，若待检测语音数据并不准确(例如用户的发音并不标准，那么通过待检测语音数据识别出的待检测音节序列较为模糊)，可以在本申请提出的关键词检测模型中加入纠正网络(可以是在的特征提取网络之前，也可以是在音节识别网络之后)，通过内置的基于多种语种类别的语法规则或音节发音规则对待检测语音数据或待检测音节序列进行纠正，得到更加准确的待检测音节序列，可以提高在后续关键匹配检测的准确性。In the process of keyword detection, if the voice data to be detected is not accurate (for example, the user's pronunciation is not standard, then the sequence of syllables to be detected identified by the voice data to be detected is relatively ambiguous), the keyword detection method proposed in this application can be used. A correction network is added to the model (it can be before the feature extraction network, or after the syllable recognition network), and the speech data or the syllable sequence to be detected is processed through the built-in grammar rules or syllable pronunciation rules based on multiple languages. Correction to obtain a more accurate syllable sequence to be detected, which can improve the accuracy of subsequent key matching detection.

除此之外，本申请提出的方法在根据参考关键词生成目标音节序列的同时，还生成模糊音节序列，模糊音节序列是目标音节序列的非标准形式(例如参考关键词为“世界”，该关键词的目标音节序列可以是“shi4 jie4”，该关键词的模糊音节序列可以是“si4jie4”)，通过生成模糊音节序列，并基于目标音节序列、模糊音节序列以及待检测音节序列进行关键词检测，使得待检测语音数据中非标准化的关键词语音数据能够被检测出来，提高了方法的智能化和适用性，该方法在一些特定环境(例如老人、小孩、以及发音不标准人群的语音数据的关键词检测)能够提升用户体验。In addition, the method proposed in this application generates a fuzzy syllable sequence while generating the target syllable sequence according to the reference keyword. The fuzzy syllable sequence is a non-standard form of the target syllable sequence (for example, the reference keyword is "world", the The target syllable sequence of the keyword can be "shi4 jie4", and the fuzzy syllable sequence of the keyword can be "si4jie4"). Detection, so that the non-standardized keyword speech data in the speech data to be detected can be detected, and the intelligence and applicability of the method are improved. keyword detection) can improve the user experience.

当基于模糊音节序列、目标音节序列以及待检测音节序列进行关键词匹配时，可以通过如下方式实现：When performing keyword matching based on the fuzzy syllable sequence, the target syllable sequence, and the syllable sequence to be detected, it can be implemented in the following ways:

调用关键词检测模型的关键词匹配网络对待检测音节序列、目标音节序列以及模糊音节序列进行处理，得到待识别语音数据的关键词检测结果。根据上述步骤S404提供的方法，可以分别得到参考关键词(也即是目标音节序列)的第一预测概率和模糊音节序列的第二预测概率，根据第一预测概率和第二预测概率联合确定待识别语音数据的关键词检测结果。The keyword matching network of the keyword detection model is called to process the syllable sequence to be detected, the target syllable sequence and the fuzzy syllable sequence to obtain the keyword detection result of the speech data to be recognized. According to the method provided in the above step S404, the first predicted probability of the reference keyword (that is, the target syllable sequence) and the second predicted probability of the fuzzy syllable sequence can be obtained respectively. Identify the keyword detection results of speech data.

示例性的，目标音节序列的第一预测概率为0.88，模糊音节序列的第二预测概率为0.92，第一音节匹配概率阈值为0.9。通过分析目标音节序列的第一预测概率小于音节匹配概率阈值(此时表示待检测语音数据中不存在基于该关键词的标准化的音节序列)，但模糊音节序列的第二预测概率大于音节匹配概率阈值(此时表示待检测语音数据中存在基于该关键词的非标准的音节序列)，那么判断待检测语音数据中存在该关键词的非标准形式，进行进一步判断。例如，可以设定一个第二音节匹配概率阈值P(例如P为0.85)，若目标音节序列的第一预测概率小于第一音节匹配概率阈值、目标音节序列的第一预测概率大于第二音节匹配概率阈值，并且模糊音节序列的第二预测概率大于音节匹配概率阈值，则确定待识别语音数据存在该关键词。Exemplarily, the first predicted probability of the target syllable sequence is 0.88, the second predicted probability of the fuzzy syllable sequence is 0.92, and the first syllable matching probability threshold is 0.9. By analyzing the first predicted probability of the target syllable sequence is less than the syllable matching probability threshold (in this case, it means that there is no standardized syllable sequence based on the keyword in the speech data to be detected), but the second predicted probability of the fuzzy syllable sequence is greater than the syllable matching probability threshold (in this case, it means that there is a non-standard syllable sequence based on the keyword in the speech data to be detected), then it is judged that there is a non-standard form of the keyword in the speech data to be detected, and further judgment is made. For example, a second syllable matching probability threshold P (for example, P is 0.85) can be set. If the first predicted probability of the target syllable sequence is less than the first syllable matching probability threshold, the first predicted probability of the target syllable sequence is greater than the second syllable matching. probability threshold, and the second predicted probability of the fuzzy syllable sequence is greater than the syllable matching probability threshold, it is determined that the speech data to be recognized exists in the keyword.

本申请对提出的基于TDNN-F的关键词检测方法进行了测试，首先选取两个目标语种类别(包括语种1和语种2)，并且在每个语种类别中选用27.5小时的语料作为测试数据。为了测试基于TDNN-F的关键词检测模型的性能，还搭建了一套基于语种识别+单语关键词检测级连作为基线进行对比。This application tests the proposed keyword detection method based on TDNN-F. First, two target language categories (including language 1 and language 2) are selected, and 27.5 hours of corpus is selected as the test data in each language category. In order to test the performance of the keyword detection model based on TDNN-F, a set based on language recognition + monolingual keyword detection cascade is also built as a baseline for comparison.

请参见下表：See the table below:

方法method 准确率Accuracy 召回率recall 调和均值Harmonic mean 语种识别+单语关键词检测级连Language recognition + monolingual keyword detection cascade 91.6％91.6% 64.9％64.9% 76.0％76.0% 基于TDNN-F的关键词检测Keyword Detection Based on TDNN-F 91.7％91.7% 68.5％68.5% 78.4％78.4% 提升效果boost effect +0.1％+0.1% +3.6％+3.6% +2.45％+2.45%

表中列出了语种1中关键词检测的效果。在语种1中，语种识别+单语关键词检测级连方法的准确率为91.6％，召回率为64.9％，调和均值为76.0％；基于TDNN-F的关键词检测方法的准确率为91.7％，召回率为68.5％，调和均值为78.4％，计算得到基于TDNN-F的关键词检测方法相比于语种识别+单语关键词检测级连方法，准确率提升为0.1％，召回率提升为3.6％，调和均值提升为2.45％。The table lists the effect of keyword detection in language 1. In language 1, the accuracy rate of language identification + monolingual keyword detection cascade method is 91.6%, the recall rate is 64.9%, and the harmonic mean is 76.0%; the accuracy rate of keyword detection method based on TDNN-F is 91.7% , the recall rate is 68.5%, and the harmonic mean is 78.4%. Compared with the language recognition + monolingual keyword detection cascade method, the accuracy rate of the keyword detection method based on TDNN-F is calculated to be 0.1%, and the recall rate is improved to 3.6%, and the harmonic mean improvement is 2.45%.

从结果可以看到，相比语种识别+单语关键词检测级连方法，本申请提出的基于TDNN-F的关键词检测方法在准确率基本相同的情况下，覆盖率提升了3.6％。结果说明基于TDNN-F的关键词检测系统能够有效地提升关键词的召回个数，效果更好。It can be seen from the results that, compared with the cascading method of language recognition + monolingual keyword detection, the keyword detection method based on TDNN-F proposed in this application has a coverage rate increased by 3.6% under the condition that the accuracy is basically the same. The results show that the keyword detection system based on TDNN-F can effectively increase the number of recalled keywords, and the effect is better.

请参见下表：See the table below:

方法method 准确率Accuracy 召回率recall 调和均值Harmonic mean 语种识别+单语关键词检测级连Language recognition + monolingual keyword detection cascade 84.2％84.2% 78.5％78.5% 81.1％81.1% 基于TDNN-F的关键词检测Keyword Detection Based on TDNN-F 84.4％84.4% 79.4％79.4% 81.82％81.82% 提升效果boost effect +0.2％+0.2% +1.2％+1.2% +0.73％+0.73%

表中列出了语种2中关键词检测的效果。在语种2中，语种识别+单语关键词检测级连方法的准确率为84.2％，召回率为78.5％，调和均值为81.1％；基于TDNN-F的关键词检测方法的准确率为84.4％，召回率为79.4％，调和均值为81.82％，计算得到基于TDNN-F的关键词检测方法相比于语种识别+单语关键词检测级连方法，准确率提升为0.2％，召回率提升为1.2％，调和均值提升为0.73％。The table lists the effect of keyword detection in language 2. In language 2, the accuracy rate of language identification + monolingual keyword detection cascade method is 84.2%, the recall rate is 78.5%, and the harmonic mean is 81.1%; the accuracy rate of keyword detection method based on TDNN-F is 84.4% , the recall rate is 79.4%, and the harmonic mean is 81.82%. Compared with the TDNN-F-based keyword detection method, the accuracy rate is improved to 0.2%, and the recall rate is improved to 1.2%, and the harmonic mean lift is 0.73%.

从结果可以看到，相比语种识别+单语关键词检测级连方法相比，本申请提出的基于TDNN-F的关键词检测方法在准确率基本相同的情况下，覆盖率提升了1.2％，这个结构再次验证了基于TDNN-F的多语种关键词检测方法的有效性。It can be seen from the results that, compared with the cascading method of language recognition + monolingual keyword detection, the keyword detection method based on TDNN-F proposed in this application has the same accuracy, and the coverage is increased by 1.2% , this structure verifies the effectiveness of the multilingual keyword detection method based on TDNN-F again.

本申请提出的关键词检测模型中的音节识别网络包括语种识别网络，先对待检测语音进行语种识别，然后调用音节识别网络中目标语种类别对应的识别子网络对语音特征进行处理，确定待识别语音数据的待检测音节序列，使得关键词检测模型对不同语种语音数据进行针对性的处理，提高了处理效率；本申请选用音节作为建模单元，利用奇异值分解法和时延神经网络构建了基于因式分解时延神经网络关键词检测模型，提升了关键词检测准确度和检测效率，并且使用跳层连接和神经元丢弃的方法，进一步提高了检测效率和模型训练稳定度；本申请还对参考关键词的音节序列进行排序以及生成参考关键词对应的模糊音节序列，使得关键词检测模型能够检测出不同音节排列顺序以及非标准化的关键词，提高了适用性和智能化；通过在关键词匹配网络中预设多个匹配策略，并基于多个匹配策略得到的关键词预测概率确定关键词检测结果，进一步提高了关键词检测的准确度。The syllable recognition network in the keyword detection model proposed in this application includes a language recognition network, which firstly performs language recognition on the speech to be detected, and then calls the recognition sub-network corresponding to the target language category in the syllable recognition network to process the speech features, and determines the speech to be recognized. The syllable sequence of the data to be detected enables the keyword detection model to perform targeted processing on the speech data of different languages, which improves the processing efficiency; this application selects the syllable as the modeling unit, and uses the singular value decomposition method and the time-delay neural network to construct a The factorization delay neural network keyword detection model improves the keyword detection accuracy and detection efficiency, and uses the method of skip layer connection and neuron discarding to further improve the detection efficiency and model training stability; Sort the syllable sequence of the reference keyword and generate the fuzzy syllable sequence corresponding to the reference keyword, so that the keyword detection model can detect different syllable sequences and non-standardized keywords, which improves the applicability and intelligence; Multiple matching strategies are preset in the matching network, and keyword detection results are determined based on keyword prediction probabilities obtained from the multiple matching strategies, thereby further improving the accuracy of keyword detection.

请参阅图8，图8是本申请实施例提供的一种数据处理装置的示意框图。其中，数据处理装置具体可以包括：Please refer to FIG. 8. FIG. 8 is a schematic block diagram of a data processing apparatus provided by an embodiment of the present application. Wherein, the data processing device may specifically include:

获取模块801，用于获取待识别语音数据，以及获取参考关键词的目标音节序列；An acquisition module 801, configured to acquire the speech data to be recognized, and acquire the target syllable sequence of the reference keyword;

处理模块802，用于调用关键词检测模型对所述待识别语音数据进行处理，确定所述待识别语音数据的待检测音节序列，根据所述待检测音节序列和所述目标音节序列确定所述待识别语音数据的关键词检测结果；The processing module 802 is configured to call a keyword detection model to process the speech data to be recognized, determine the syllable sequence to be detected of the speech data to be recognized, and determine the syllable sequence to be detected according to the syllable sequence to be detected and the target syllable sequence. The keyword detection result of the speech data to be recognized;

其中，所述关键词检测模型是利用训练数据集训练得到的，所述训练数据集包括一个或多个语种类别的样本语音数据；所述关键词检测模型可对所述一个或多个语种类别中任一个语种类别的语音数据进行关键词检测。Wherein, the keyword detection model is obtained by training using a training data set, and the training data set includes sample speech data of one or more language categories; the keyword detection model can be used for the one or more language categories. The speech data of any language category is used for keyword detection.

在一实施例中，所述关键词检测模型包括特征提取网络、音节识别网络、关键词匹配网络；所述音节识别网络包括一个或多个识别子网络，每一个识别子网络用于对一个指定语种类别的语音数据进行音节识别，所述指定语种类别包含于所述一个或多个语种类别中；当所述音节识别网络包括多个识别子网络时，所述多个识别子网络中任意两个识别子网络的网络参数相匹配。In one embodiment, the keyword detection model includes a feature extraction network, a syllable identification network, and a keyword matching network; the syllable identification network includes one or more identification sub-networks, and each identification sub-network is used to identify a specified sub-network. The syllable recognition is performed on the speech data of the language category, and the specified language category is included in the one or more language categories; when the syllable recognition network includes multiple identification sub-networks, any two of the multiple identification sub-networks are included. The network parameters that identify the sub-network are matched.

可选的，所述获取模块801，在用于获取参考关键词的目标音节序列时，具体用于：Optionally, when the acquisition module 801 is used to acquire the target syllable sequence of the reference keyword, it is specifically used for:

获取参考关键词，以及获取所述待识别语音数据的目标语种类别；Obtain reference keywords, and obtain the target language category of the speech data to be recognized;

获取所述参考关键词的与所述目标语种类别相匹配的音节序列，将所述参考关键词的与所述目标语种类别相匹配的音节序列确定为目标音节序列。Acquire the syllable sequence of the reference keyword that matches the target language category, and determine the syllable sequence of the reference keyword that matches the target language category as the target syllable sequence.

可选的，所述处理模块802，在用于调用关键词检测模型对所述待识别语音数据进行处理，确定所述待识别语音数据的待检测音节序列，根据所述待检测音节序列和所述目标音节序列确定所述待识别语音数据的关键词检测结果时，具体用于：Optionally, the processing module 802 is used to call the keyword detection model to process the speech data to be recognized, and determine the syllable sequence to be detected of the speech data to be recognized, according to the syllable sequence to be detected and the syllable sequence to be detected. When the target syllable sequence determines the keyword detection result of the to-be-recognized speech data, it is specifically used for:

调用关键词检测模型的特征提取网络对所述待识别语音数据进行处理，得到所述待识别语音数据的语音特征；Call the feature extraction network of the keyword detection model to process the to-be-recognized speech data to obtain the speech features of the to-be-recognized speech data;

调用所述关键词检测模型的音节识别网络对所述语音特征进行处理，得到所述待识别语音数据的待检测音节序列；Invoke the syllable recognition network of the keyword detection model to process the speech features to obtain the to-be-detected syllable sequence of the to-be-recognized speech data;

调用所述关键词检测模型的关键词匹配网络对所述待检测音节序列和所述目标音节序列进行处理，得到所述待识别语音数据的关键词检测结果。The keyword matching network of the keyword detection model is invoked to process the syllable sequence to be detected and the target syllable sequence to obtain a keyword detection result of the speech data to be recognized.

可选的，所述处理模块802，在用于调用所述关键词检测模型的音节识别网络对所述语音特征进行处理，得到所述待识别语音数据的待检测音节序列时，具体用于：Optionally, when the processing module 802 is used for invoking the syllable recognition network of the keyword detection model to process the speech feature to obtain the to-be-detected syllable sequence of the to-be-recognized speech data, it is specifically used for:

调用所述关键词检测模型的音节识别网络对所述语音特征进行处理，根据所述语音特征确定所述待识别语音数据的目标语种类别；Call the syllable recognition network of the keyword detection model to process the voice feature, and determine the target language category of the voice data to be recognized according to the voice feature;

调用所述音节识别网络中所述目标语种类别对应的识别子网络对所述语音特征进行处理，得到所述语音特征的音节分布概率；Invoke the recognition sub-network corresponding to the target language category in the syllable recognition network to process the speech feature, and obtain the syllable distribution probability of the speech feature;

根据所述音节分布概率确定所述待识别语音数据的待检测音节序列。The to-be-detected syllable sequence of the to-be-recognized speech data is determined according to the syllable distribution probability.

可选的，所述待检测音节序列包括一个或多个按照第一顺序排列的音节元素，所述目标音节序列包括一个或多个按照第二顺序排列的音节元素，所述处理模块802，在用于调用所述关键词检测模型的关键词匹配网络对所述待检测音节序列和所述目标音节序列进行处理，得到所述待识别语音数据的关键词检测结果时，具体用于：Optionally, the syllable sequence to be detected includes one or more syllable elements arranged in a first order, the target syllable sequence includes one or more syllable elements arranged in a second order, the processing module 802, in When the keyword matching network used for invoking the keyword detection model processes the syllable sequence to be detected and the target syllable sequence, and obtains the keyword detection result of the speech data to be recognized, it is specifically used for:

调用所述关键词检测模型的关键词匹配网络对所述待检测音节序列和所述目标音节序列进行处理，检测所述目标音节序列中是否存在与第一音节元素相匹配的匹配音节元素，所述第一音节元素为所述待检测音节序列中的任一个音节元素；Invoke the keyword matching network of the keyword detection model to process the to-be-detected syllable sequence and the target syllable sequence, and detect whether there is a matching syllable element that matches the first syllable element in the target syllable sequence. The first syllable element is any syllable element in the to-be-detected syllable sequence;

若存在与所述第一音节元素相匹配的匹配音节元素，则检测所述目标音节序列中是否存在与第二音节元素相匹配的匹配音节元素，所述第二音节元素为所述待检测音节序列中排列在所述第一音节元素后一位的音节元素；If there is a matching syllable element that matches the first syllable element, it is detected whether there is a matching syllable element matching the second syllable element in the target syllable sequence, and the second syllable element is the syllable to be detected. a syllable element arranged one digit after the first syllable element in the sequence;

若存在与所述第二音节元素相匹配的匹配音节元素，则根据所述第一音节元素和所述第二音节元素确定所述待识别语音数据的关键词检测结果。If there is a matching syllable element matching the second syllable element, the keyword detection result of the speech data to be recognized is determined according to the first syllable element and the second syllable element.

在一实施例中，所述音节识别网络是基于时延神经网络和奇异值分解网络构建生成的。In one embodiment, the syllable recognition network is constructed and generated based on a time-delay neural network and a singular value decomposition network.

可选的，所述处理模块802，还用于：Optionally, the processing module 802 is further configured to:

获取所述训练数据集，所述训练数据集包括一个或多个训练数据子集，每个训练数据子集包括所述一个或多个语种类别中任一个语种类别的样本语音数据以及与所述样本语音数据对应的参考音节序列；Obtain the training data set, the training data set includes one or more training data subsets, each training data subset includes sample speech data of any language category in the one or more language categories and the same as the one or more language categories. the reference syllable sequence corresponding to the sample speech data;

对所述一个或多个训练数据子集中的样本语音数据进行特征提取处理，得到样本语音数据的样本语音特征；Perform feature extraction processing on the sample voice data in the one or more training data subsets to obtain sample voice features of the sample voice data;

利用得到的样本语音特征以及对应的参考音节序列，对初始音节识别网络进行训练，得到训练后的音节识别网络；Using the obtained sample speech features and the corresponding reference syllable sequence, the initial syllable recognition network is trained, and the trained syllable recognition network is obtained;

根据训练后的音节识别网络生成训练后的关键词检测模型。A trained keyword detection model is generated from the trained syllable recognition network.

需要说明的是，本申请实施例的图像处理装置的各功能模块的功能可根据上述方法实施例中的方法具体实现，其具体实现过程可以参照上述方法实施例的相关描述，此处不再赘述。It should be noted that the functions of each functional module of the image processing apparatus in the embodiments of the present application may be specifically implemented according to the methods in the above method embodiments, and the specific implementation process may refer to the relevant descriptions of the above method embodiments, which will not be repeated here. .

请参阅图9，图是本申请一实施例提供的一种计算机设备的示意框图。如图所示的本申请实施例中的计算机设备可以包括：处理器901、存储装置902以及网络接口903。上述处理器901、存储装置902以及网络接口903之间可以进行数据交互。Please refer to FIG. 9 , which is a schematic block diagram of a computer device provided by an embodiment of the present application. The computer device in the embodiment of the present application as shown in the figure may include: a processor 901 , a storage device 902 , and a network interface 903 . Data interaction can be performed among the above-mentioned processor 901 , storage device 902 and network interface 903 .

上述存储装置902可以包括易失性存储器(volatile memory)，例如随机存取存储器(random-access memory，RAM)；存储装置902也可以包括非易失性存储器(non-volatilememory)，例如快闪存储器(flash memory)，固态硬盘(solid-state drive，SSD)等；上述存储装置902还可以包括上述种类的存储器的组合。The above-mentioned storage device 902 may include volatile memory (volatile memory), such as random-access memory (random-access memory, RAM); the storage device 902 may also include non-volatile memory (non-volatile memory), such as flash memory (flash memory), solid-state drive (solid-state drive, SSD), etc.; the above-mentioned storage device 902 may also include a combination of the above-mentioned types of memories.

上述处理器901可以是中央处理器(central processing unit，CPU)。在一个实施例中，上述处理器901还可以是图形处理器(Graphics Processing Unit，GPU)。上述处理器901也可以是由CPU和GPU的组合。在一个实施例中，上述存储装置902用于存储程序指令，上述处理器901可以调用上述程序指令，执行如下操作：The above-mentioned processor 901 may be a central processing unit (central processing unit, CPU). In one embodiment, the above-mentioned processor 901 may also be a graphics processor (Graphics Processing Unit, GPU). The above-mentioned processor 901 may also be a combination of a CPU and a GPU. In one embodiment, the above-mentioned storage device 902 is used to store program instructions, and the above-mentioned processor 901 can call the above-mentioned program instructions to perform the following operations:

获取待识别语音数据，以及获取参考关键词的目标音节序列；Obtain the speech data to be recognized, and obtain the target syllable sequence of the reference keyword;

调用关键词检测模型对所述待识别语音数据进行处理，确定所述待识别语音数据的待检测音节序列，根据所述待检测音节序列和所述目标音节序列确定所述待识别语音数据的关键词检测结果；Invoke a keyword detection model to process the to-be-recognized speech data, determine the to-be-detected syllable sequence of the to-be-recognized speech data, and determine the key of the to-be-recognized speech data according to the to-be-detected syllable sequence and the target syllable sequence word detection result;

可选的，所述处理器901，在用于获取参考关键词的目标音节序列时，具体用于：Optionally, when the processor 901 is used to obtain the target syllable sequence of the reference keyword, it is specifically used for:

可选的，所述处理器901，在用于调用关键词检测模型对所述待识别语音数据进行处理，确定所述待识别语音数据的待检测音节序列，根据所述待检测音节序列和所述目标音节序列确定所述待识别语音数据的关键词检测结果时，具体用于：Optionally, the processor 901 is used to call a keyword detection model to process the to-be-recognized speech data, and to determine the to-be-detected syllable sequence of the to-be-recognized speech data, according to the to-be-detected syllable sequence and the When the target syllable sequence determines the keyword detection result of the to-be-recognized speech data, it is specifically used for:

可选的，所述处理器901，在用于调用所述关键词检测模型的音节识别网络对所述语音特征进行处理，得到所述待识别语音数据的待检测音节序列时，具体用于：Optionally, when the processor 901 is used for invoking the syllable recognition network of the keyword detection model to process the speech feature to obtain the to-be-detected syllable sequence of the to-be-recognized speech data, it is specifically used for:

可选的，所述待检测音节序列包括一个或多个按照第一顺序排列的音节元素，所述目标音节序列包括一个或多个按照第二顺序排列的音节元素，所述处理器901，在用于调用所述关键词检测模型的关键词匹配网络对所述待检测音节序列和所述目标音节序列进行处理，得到所述待识别语音数据的关键词检测结果时，具体用于：Optionally, the syllable sequence to be detected includes one or more syllable elements arranged in a first order, the target syllable sequence includes one or more syllable elements arranged in a second order, the processor 901, in When the keyword matching network used for invoking the keyword detection model processes the syllable sequence to be detected and the target syllable sequence, and obtains the keyword detection result of the speech data to be recognized, it is specifically used for:

可选的，所述处理器901，还用于：Optionally, the processor 901 is further configured to:

获取所述训练数据集，所述训练数据集包括一个或多个训练数据子集，每个训练数据子集包括所述一个或多个语种类别中任一个语种类别的样本语音数据以及与所述样本语音数据对应的参考音节序列；Obtain the training data set, the training data set includes one or more training data subsets, each training data subset includes sample speech data of any one of the one or more language categories and the the reference syllable sequence corresponding to the sample speech data;

具体实现中，本申请实施例中所描述的处理器901、存储装置902以及网络接口903可执行本申请实施例图2或图4提供的数据处理方法的相关实施例中所描述的实现方式，也可执行本申请实施例图8提供的数据处理装置的相关实施例中所描述的实现方式，在此不再赘述。In a specific implementation, the processor 901, the storage device 902, and the network interface 903 described in the embodiments of the present application may perform the implementations described in the related embodiments of the data processing methods provided in FIG. 2 or FIG. 4 in the embodiments of the present application, The implementation manner described in the related embodiment of the data processing apparatus provided in FIG. 8 in this embodiment of the present application can also be executed, and details are not described herein again.

在本申请所提供的几个实施例中，应该理解到，所揭露的方法、装置和系统，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的；例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式；例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed method, apparatus and system may be implemented in other manners. For example, the device embodiments described above are only illustrative; for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation; for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

此外，这里需要指出的是：本申请实施例还提供了一种计算机可读存储介质，且计算机可读存储介质中存储有前文提及的图像处理装置所执行的计算机程序，且该计算机程序包括程序指令，当处理器执行上述程序指令时，能够执行前文图2、图4所对应实施例中的方法，因此，这里将不再进行赘述。另外，对采用相同方法的有益效果描述，也不再进行赘述。对于本申请所涉及的计算机可读存储介质实施例中未披露的技术细节，请参照本申请方法实施例的描述。作为示例，程序指令可以被部署在一个计算机设备上，或者在位于一个地点的多个计算机设备上执行，又或者，在分布在多个地点且通过通信网络互连的多个计算机设备上执行，分布在多个地点且通过通信网络互连的多个计算机设备可以组成区块链系统。In addition, it should be pointed out here that the embodiments of the present application further provide a computer-readable storage medium, and the computer-readable storage medium stores a computer program executed by the aforementioned image processing apparatus, and the computer program includes The program instructions, when the processor executes the above program instructions, can execute the methods in the foregoing embodiments corresponding to FIG. 2 and FIG. 4 , and therefore will not be repeated here. In addition, the description of the beneficial effects of using the same method will not be repeated. For technical details not disclosed in the computer-readable storage medium embodiments involved in the present application, please refer to the description of the method embodiments of the present application. By way of example, program instructions may be deployed on one computer device, or executed on multiple computer devices located at one site, or alternatively, distributed across multiple sites and interconnected by a communications network, Multiple computer devices distributed in multiple locations and interconnected by a communication network can form a blockchain system.

根据本申请的一个方面，提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备可以执行前文图2、图4所对应实施例中的方法，因此，这里将不再进行赘述。According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device can execute the methods in the embodiments corresponding to FIG. 2 and FIG. Repeat.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，上述程序可存储于计算机可读取存储介质中，该程序在执行时，可包括如上述各方法的实施例的流程。其中，上述存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory，ROM)或随机存储记忆体(Random Access Memory，RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing the relevant hardware through a computer program. The above programs can be stored in a computer-readable storage medium, and when the program is executed , which may include the processes of the above-mentioned method embodiments. The above-mentioned storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

以上所揭露的仅为本申请的部分实施例而已，当然不能以此来限定本申请之权利范围，本领域普通技术人员可以理解实现上述实施例的全部或部分流程，并依本申请权利要求所作的等同变化，仍属于发明所涵盖的范围。The above disclosure is only a part of the embodiments of the present application, of course, the scope of the rights of the present application cannot be limited by this. Those of ordinary skill in the art can understand that all or part of the procedures for realizing the above-mentioned embodiments are implemented according to the claims of the present application. The equivalent changes of the invention still belong to the scope covered by the invention.

Claims

1. A method of data processing, the method comprising:

acquiring voice data to be recognized and acquiring a target syllable sequence of a reference keyword;

calling a keyword detection model to process the voice data to be recognized, determining a syllable sequence to be detected of the voice data to be recognized, and determining a keyword detection result of the voice data to be recognized according to the syllable sequence to be detected and the target syllable sequence;

the keyword detection model is obtained by training by utilizing a training data set, wherein the training data set comprises sample voice data of one or more language categories; the keyword detection model may perform keyword detection on speech data in any of the one or more language categories.

2. The method of claim 1, wherein the keyword detection model comprises a feature extraction network, a syllable recognition network, a keyword matching network; the syllable recognition network comprises one or more recognition sub-networks, each recognition sub-network is used for carrying out syllable recognition on the voice data of a specified language category, and the specified language category is contained in the one or more language categories; when the syllable recognition network includes a plurality of recognition subnetworks, network parameters of any two recognition subnetworks of the plurality of recognition subnetworks match.

3. The method of claim 1, wherein obtaining the target syllable sequence of the reference keyword comprises:

acquiring a reference keyword and acquiring a target language category of the voice data to be recognized;

and acquiring the syllable sequence of the reference keyword matched with the target language category, and determining the syllable sequence of the reference keyword matched with the target language category as a target syllable sequence.

4. The method according to claim 2, wherein the invoking the keyword detection model to process the voice data to be recognized, determining a syllable sequence to be detected of the voice data to be recognized, and determining a keyword detection result of the voice data to be recognized according to the syllable sequence to be detected and the target syllable sequence comprises:

calling a feature extraction network of a keyword detection model to process the voice data to be recognized to obtain voice features of the voice data to be recognized;

calling a syllable recognition network of the keyword detection model to process the voice characteristics to obtain a syllable sequence to be detected of the voice data to be recognized;

and calling a keyword matching network of the keyword detection model to process the syllable sequence to be detected and the target syllable sequence to obtain a keyword detection result of the voice data to be recognized.

5. The method according to claim 4, wherein the step of processing the voice feature by calling the syllable recognition network of the keyword detection model to obtain the syllable sequence to be detected of the voice data to be recognized comprises:

calling a syllable recognition network of the keyword detection model to process the voice characteristics, and determining the target language category of the voice data to be recognized according to the voice characteristics;

calling an identifier network corresponding to the target language category in the syllable recognition network to process the voice features to obtain syllable distribution probability of the voice features;

and determining the syllable sequence to be detected of the voice data to be recognized according to the syllable distribution probability.

6. The method of claim 4, wherein the syllable sequence to be detected comprises one or more syllable elements arranged in a first order, and the target syllable sequence comprises one or more syllable elements arranged in a second order; the step of processing the syllable sequence to be detected and the target syllable sequence by calling the keyword matching network of the keyword detection model to obtain a keyword detection result of the voice data to be recognized comprises the following steps:

calling a keyword matching network of the keyword detection model to process the syllable sequence to be detected and the target syllable sequence, and detecting whether a matching syllable element matched with a first syllable element exists in the target syllable sequence, wherein the first syllable element is any one syllable element in the syllable sequence to be detected;

if the matched syllable element matched with the first syllable element exists, detecting whether a matched syllable element matched with a second syllable element exists in the target syllable sequence, wherein the second syllable element is a syllable element which is arranged one bit behind the first syllable element in the syllable sequence to be detected;

and if the matched syllable element matched with the second syllable element exists, determining a keyword detection result of the voice data to be recognized according to the first syllable element and the second syllable element.

7. The method of claim 2, wherein the syllable recognition network is generated based on a time-lapse neural network and a singular value decomposition network construction.

8. The method of claim 1, further comprising:

obtaining the training data set, wherein the training data set comprises one or more training data subsets, and each training data subset comprises sample voice data of any one of the one or more language categories and a reference syllable sequence corresponding to the sample voice data;

performing feature extraction processing on the sample voice data in the one or more training data subsets to obtain sample voice features of the sample voice data;

training the initial syllable recognition network by using the obtained sample voice characteristics and the corresponding reference syllable sequence to obtain a trained syllable recognition network;

and generating a trained keyword detection model according to the trained syllable recognition network.

9. A data processing apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring voice data to be recognized and acquiring a target syllable sequence of a reference keyword;

the processing module is used for calling a keyword detection model to process the voice data to be recognized, determining a syllable sequence to be detected of the voice data to be recognized, and determining a keyword detection result of the voice data to be recognized according to the syllable sequence to be detected and the target syllable sequence;

10. A computer device, comprising: the device comprises a memory and a processor, wherein the memory stores a data processing program, and the data processing program is used for realizing the data processing method according to any one of claims 1-8 when being executed by the processor.

11. A computer-readable storage medium device, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which are executed by a processor for implementing the data processing method according to any one of claims 1 to 8.

12. A computer program product, characterized in that it comprises a computer program or computer instructions which, when executed by a processor, implement a data processing method according to any one of claims 1 to 8.